Diagnose training stalls before they burn GPU hours.
TraceOpt is a training-performance platform for monitoring slow runs, comparing regressions, and scaling visibility across training workloads (coming soon).
Powered by TraceML, our open-source step-level diagnostics layer.
Why TraceOpt
Training slowdowns are hard to compare across runs, teams, and infrastructure. TraceOpt is being built to turn TraceML diagnostics into a shared performance layer for training workloads.
Regression finder for CI/CD
Flag a slow or broken change before it reaches a full run.
Live multi-node views
Live terminal and browser views across multi-node runs.
Compare runs across experiments
Diff each run's fingerprint to see what changed and why.
Continuous regression analysis
Catch performance and GPU regressions automatically.
Get started in three steps
From install to diagnostics in 3 lines of code.
Install TraceML
Install from PyPI. No Docker, account setup, or hosted backend required.
pip install traceml-ai
Wrap your training step
Add one import, one init call, and one context manager around your existing step.
import traceml_ai as traceml traceml.init(mode="auto") for batch in dataloader: with traceml.trace_step(model): ... # your existing step
Run TraceML
Launch with one command. TraceML profiles the training step and saves a run summary you can inspect or compare later.
traceml run train.py
Frequently asked questions
What do we offer today?
What bottlenecks do we detect?
Does it work for distributed and multi-GPU training?
What is the performance overhead?
Do I need an account or hosted service?
Does TraceOpt support TensorFlow, JAX, or other training stacks?
Work with us
We are looking for design partners and early collaborators as we build TraceOpt for training-performance diagnostics.
- - you run PyTorch training on single-node multi-GPU, multi-node, or Slurm-based systems
- - slow, unstable, or hard-to-debug training runs cost your team time or GPU hours
- - you want clearer diagnostics across runs, machines, and experiments
- - you want to help shape the TraceOpt platform with direct feedback from real training workloads