TraceOpt guides

Diagnose training slowdowns.

Practical guides for finding DataLoader stalls, low GPU utilization, DDP rank stragglers, memory creep, and run-to-run regressions.

Powered by TraceML, our open-source step-level diagnostics layer.

The triage loop
01
Run TraceML
traceml run train.py
02
Read fingerprint
Check step time, input loading, H2D, compute, wait time, GPU utilization, memory, and rank skew.
03
Choose guide
Use the matching symptom guide before changing DataLoader, model, or distributed settings.

Open-source diagnostics for training performance.

TraceML helps you inspect slow training runs, compare summaries, and understand where time and memory go.