Always on, yet light

Diagnose training stalls before they burn GPU hours.

TraceOpt helps ML teams monitor performance, compare runs, and detect regressions across training workloads. Powered by TraceML, our open-source step-level diagnostics layer.

SLOW TRAINING DETECTED · step 1,240
Live step view · last 100 steps
Median step 23.1ms Slowest worker 25.9ms

Data loader
13.4ms
Forward
4.4ms
Backward
3.2ms
Optimizer
1.8ms

Worker behavior
Slow-worker gap +12.1% on worker 3

Memory
GPU memory 14.2 / 96 GB peak 17.1 GB ↑
Stack
PyTorch NVIDIA CUDA Lightning Hugging Face Ray Train Slurm DeepSpeed Megatron-LM

Why TraceOpt

TraceOpt extends open-source TraceML across the full training lifecycle. Here's what we're building next.

Before the run

Regression finder for CI/CD

Flag a slow or broken change before it reaches a full run.

While it runs

Live multi-node views

Live terminal and browser views across multi-node runs.

After the run

Compare runs across experiments

Diff each run's fingerprint to see what changed and why.

Over time

Continuous regression analysis

Catch performance and GPU regressions automatically.

Missing a capability? Help shape the roadmap. Request a feature on GitHub →

Get started in three steps

From install to live diagnostics in minutes. Three lines of code, no config files. Read the docs →

1

Install TraceML

One command. Pure pip, no Docker, no account.

bash
pip install traceml-ai
~30 sec
2

Wrap your training step

Add three lines: import, init, and one context manager around the step.

python
import traceml_ai as traceml

traceml.init(mode="auto")

for batch in dataloader:
    with traceml.trace_step(model):
        ...  # your existing step
3 lines
3

Run and watch

Launch with one command. TraceML profiles every phase and saves a run summary you can compare later.

bash
traceml run train.py --mode=summary
run as usual

Frequently asked questions

What does TraceOpt offer today?
Open-source TraceML. It auto-instruments PyTorch training (data loading, forward, backward, optimizer) and shows where time goes live in your terminal, plus a compact end-of-run summary. It runs on PyTorch / NVIDIA CUDA with drop-in integrations for PyTorch Lightning, Hugging Face, and Ray Train, and usually takes about three lines to add. Cross-run regression detection is on the roadmap.
What does it actually detect?
Input stragglers, compute stragglers, memory imbalance across ranks, wait-heavy steps, memory creep, and GPU under-utilization, all per rank, with a plain-language diagnosis.
Does it work for distributed / multi-GPU training?
Yes. Single-node multi-GPU DDP and FSDP today, plus multi-node DDP in summary mode. (Multi-node live views are on the roadmap.)
How does it compare to deep-dive tools and experiment trackers?
It complements both. PyTorch trace tools and Nsight Systems are heavyweight, kernel-level tools for deep dives. TraceML is the always-on first pass that shows you where to look, then you point a deep-dive tool there. And it doesn't replace W&B, MLflow, or TensorBoard: it runs alongside your tracking stack, which keeps owning experiments and metrics while TraceML finds where training time and memory go.
What is the performance overhead?
TraceML is designed to stay lightweight. Benchmarking studies (coming soon). Actual overhead depends on workload and configuration.
Can I turn it off, and can it ever break my training?
Yes. --disable-traceml (or TRACEML_DISABLED=1 ) runs your script natively. And it's best-effort by design: instrumentation fails quietly rather than ever interrupting training.
Where does my data go? Does it need an account or the cloud?
Nowhere external. TraceML is local-first: it writes summaries to local files, needs no account, and works fully offline.
What if my team needs TensorFlow, JAX, or another stack?
If your team runs another stack and this problem matters to you, please reach out to us.

Work with us

  • - you run PyTorch training on single-node multi-GPU systems today
  • - you run multi-node or Slurm-based training workflows
  • - you want to understand slow or unstable runs faster
  • - your team needs clearer diagnostics during and after training