NVIDIA DGX Spark - Performance Degradation & GPU Power Issue
NVIDIA DGX Spark: Performance Degradation & GPU Power Issue
This document consolidates technical data regarding the “Low Power/Low Clock” issue affecting the NVIDIA DGX Spark (GB10). It covers symptoms, root causes, diagnostic tools, and recovery procedures.
—
1. Issue Overview: GPU Power Capping
Users have reported a performance regression where the DGX Spark GPU becomes “stuck” in a low-power state, severely impacting AI inference and training speeds.
Symptoms
- Drastic Performance Drop: Token generation speeds (e.g., Llama 3) drop by 50-70%.
- Low Power Consumption: Under full load, the GPU draws only 5W – 15W (Expected: ~100W+).
- Capped Clock Speeds: GPU clocks are locked at ~400 MHz – 650 MHz (Expected: ~2400 MHz+).
- Software Power Capping: nvidia-smi shows the state as P0, but the “SW Power Capping” counter accumulates rapidly.
Common Triggers
- System Crashes: Hard reboots after OOM (Out of Memory) errors or kernel panics.
- Sleep/Wake Cycles: Waking the system from a suspended state.
- Field Diagnostics: Ironically, running the official NVIDIA Field Diagnostic tool has been known to trigger this state on healthy units.
—
2. Hardware Reference: DGX Spark (GB10)
Understanding the hardware limits is essential for identifying when the system is underperforming.
| Feature | Specification |
|---|---|
| Architecture | NVIDIA Grace Blackwell |
| GPU | Blackwell Architecture (GB10) |
| CPU | 20-core Arm (10x Cortex-X925 + 10x Cortex-A725) |
| Unified Memory | 128 GB LPDDR5x |
| TDP (Superchip) | 140W |
| Power Supply | 240W External PSU (USB-C PD) |
| Networking | ConnectX-7 NIC (200 Gbps) |
—
3. Diagnostic & Troubleshooting Tools
A. Manual Check (nvidia-smi)
Run the following command while running a workload:
nvidia-smi \-q \-d PERFORMANCE
Look for: Power Limit, Enforced Power Limit, and SW Power Cap. If power is ~10W under load, your unit is affected.
B. GPU Throttle Check Tool
A community tool developed by parallelArchitect provides a deeper look into the mailbox and firmware states.
- Function: Decodes throttle causes, monitors PCIe link status, and checks for “insufficient power” flags reported by the mlx5_core.
C. Firmware Verification
Check your current firmware components:
sudo dmidecode \-t 45 | grep \-A2 \-E "UEFI|EC|PD|FLASH"
—
4. Resolution & Recovery
The “Cold Boot” Fix (Primary Solution)
The issue is often caused by the USB-C Power Delivery (PD) controller entering a bad state.
- Shut down the DGX Spark.
- Physically unplug the power adapter from the wall outlet and the USB-C cable from the unit.
- Wait for at least 60 seconds to allow capacitors to discharge.
- Plug everything back in and boot. This resets the PD controller firmware.
Official System Recovery
If the OS environment is corrupted, perform a factory reset using a recovery image.
- Preparation: Download the recovery image and flash it to a 16GB+ USB drive.
- UEFI Settings: Set Secure Boot to “Custom” and restore Factory Keys.
- Boot: Select the USB drive as the primary boot device and follow the NVIDIA automated recovery prompts.
—
5. Summary Table: Expected vs. Degraded State
| Metric | Expected (Healthy) | Degraded (Affected) |
|---|---|---|
| Idle Power | ~5W | ~5W |
| Load Power | 80W - 140W | 7W - 20W |
| GPU Clock | ~2496 MHz | 400 MHz - 650 MHz |
| LLM Inference | ~60 tokens/s | ~25 tokens/s |
Reference:
NVIDIA Developer Forums: DGX Spark Performance & Power Issue Discussion
Official Documentation: NVIDIA DGX Spark System Recovery Guide
Diagnostic Utilities: parallelArchitect Spark GPU Throttle Check Tool
Hardware Reference: DGX Spark Hardware Specifications