NVIDIA DGX Spark - Performance Degradation & GPU Power Issue

NVIDIA DGX Spark: Performance Degradation & GPU Power Issue

This document consolidates technical data regarding the “Low Power/Low Clock” issue affecting the NVIDIA DGX Spark (GB10). It covers symptoms, root causes, diagnostic tools, and recovery procedures.

1. Issue Overview: GPU Power Capping

Users have reported a performance regression where the DGX Spark GPU becomes “stuck” in a low-power state, severely impacting AI inference and training speeds.

Symptoms

  • Drastic Performance Drop: Token generation speeds (e.g., Llama 3) drop by 50-70%.
  • Low Power Consumption: Under full load, the GPU draws only 5W – 15W (Expected: ~100W+).
  • Capped Clock Speeds: GPU clocks are locked at ~400 MHz – 650 MHz (Expected: ~2400 MHz+).
  • Software Power Capping: nvidia-smi shows the state as P0, but the “SW Power Capping” counter accumulates rapidly.

Common Triggers

  • System Crashes: Hard reboots after OOM (Out of Memory) errors or kernel panics.
  • Sleep/Wake Cycles: Waking the system from a suspended state.
  • Field Diagnostics: Ironically, running the official NVIDIA Field Diagnostic tool has been known to trigger this state on healthy units.

2. Hardware Reference: DGX Spark (GB10)

Understanding the hardware limits is essential for identifying when the system is underperforming.

Feature Specification
Architecture NVIDIA Grace Blackwell
GPU Blackwell Architecture (GB10)
CPU 20-core Arm (10x Cortex-X925 + 10x Cortex-A725)
Unified Memory 128 GB LPDDR5x
TDP (Superchip) 140W
Power Supply 240W External PSU (USB-C PD)
Networking ConnectX-7 NIC (200 Gbps)

3. Diagnostic & Troubleshooting Tools

A. Manual Check (nvidia-smi)

Run the following command while running a workload:

nvidia-smi \-q \-d PERFORMANCE

Look for: Power Limit, Enforced Power Limit, and SW Power Cap. If power is ~10W under load, your unit is affected.

B. GPU Throttle Check Tool

A community tool developed by parallelArchitect provides a deeper look into the mailbox and firmware states.

  • Function: Decodes throttle causes, monitors PCIe link status, and checks for “insufficient power” flags reported by the mlx5_core.

C. Firmware Verification

Check your current firmware components:

sudo dmidecode \-t 45 | grep \-A2 \-E "UEFI|EC|PD|FLASH"

4. Resolution & Recovery

The “Cold Boot” Fix (Primary Solution)

The issue is often caused by the USB-C Power Delivery (PD) controller entering a bad state.

  1. Shut down the DGX Spark.
  2. Physically unplug the power adapter from the wall outlet and the USB-C cable from the unit.
  3. Wait for at least 60 seconds to allow capacitors to discharge.
  4. Plug everything back in and boot. This resets the PD controller firmware.

Official System Recovery

If the OS environment is corrupted, perform a factory reset using a recovery image.

  1. Preparation: Download the recovery image and flash it to a 16GB+ USB drive.
  2. UEFI Settings: Set Secure Boot to “Custom” and restore Factory Keys.
  3. Boot: Select the USB drive as the primary boot device and follow the NVIDIA automated recovery prompts.

5. Summary Table: Expected vs. Degraded State

Metric Expected (Healthy) Degraded (Affected)
Idle Power ~5W ~5W
Load Power 80W - 140W 7W - 20W
GPU Clock ~2496 MHz 400 MHz - 650 MHz
LLM Inference ~60 tokens/s ~25 tokens/s

Reference:
NVIDIA Developer Forums: DGX Spark Performance & Power Issue Discussion
Official Documentation: NVIDIA DGX Spark System Recovery Guide
Diagnostic Utilities: parallelArchitect Spark GPU Throttle Check Tool
Hardware Reference: DGX Spark Hardware Specifications