Always on, yet light
Diagnose training stalls before they burn GPU hours.
TraceOpt helps ML teams monitor performance, compare runs, and detect regressions across training workloads. Powered by TraceML, our open-source step-level diagnostics layer.
Why TraceOpt
TraceOpt extends open-source TraceML across the full training lifecycle. Here's what we're building next.
Regression finder for CI/CD
Flag a slow or broken change before it reaches a full run.
Live multi-node views
Live terminal and browser views across multi-node runs.
Compare runs across experiments
Diff each run's fingerprint to see what changed and why.
Continuous regression analysis
Catch performance and GPU regressions automatically.
Get started in three steps
From install to live diagnostics in minutes. Three lines of code, no config files. Read the docs →
Install TraceML
One command. Pure pip, no Docker, no account.
pip install traceml-ai
Wrap your training step
Add three lines: import, init, and one context manager around the step.
import traceml_ai as traceml traceml.init(mode="auto") for batch in dataloader: with traceml.trace_step(model): ... # your existing step
Run and watch
Launch with one command. TraceML profiles every phase and saves a run summary you can compare later.
traceml run train.py --mode=summary
Frequently asked questions
What does TraceOpt offer today?
What does it actually detect?
Does it work for distributed / multi-GPU training?
How does it compare to deep-dive tools and experiment trackers?
What is the performance overhead?
Can I turn it off, and can it ever break my training?
--disable-traceml
(or
TRACEML_DISABLED=1
) runs your script natively. And it's best-effort by design:
instrumentation fails quietly rather than ever interrupting
training.
Where does my data go? Does it need an account or the cloud?
What if my team needs TensorFlow, JAX, or another stack?
Work with us
- - you run PyTorch training on single-node multi-GPU systems today
- - you run multi-node or Slurm-based training workflows
- - you want to understand slow or unstable runs faster
- - your team needs clearer diagnostics during and after training