Diagnose training stalls before they burn GPU hours.

TraceOpt is a training-performance platform for monitoring slow runs, comparing regressions, and scaling visibility across training workloads (coming soon).

Powered by TraceML, our open-source step-level diagnostics layer.

SLOW TRAINING DETECTED · step 1,240
Live step view · last 100 steps
Median step 23.1ms Slowest worker 25.9ms

Data loader
13.4ms
Forward
4.4ms
Backward
3.2ms
Optimizer
1.8ms

Worker behavior
Slow-worker gap +12.1% on worker 3

Memory
GPU memory 14.2 / 96 GB peak 17.1 GB ↑
Stack
PyTorch NVIDIA CUDA Lightning Hugging Face Ray Train Slurm DeepSpeed Megatron-LM

Why TraceOpt

Training slowdowns are hard to compare across runs, teams, and infrastructure. TraceOpt is being built to turn TraceML diagnostics into a shared performance layer for training workloads.

Before the run

Regression finder for CI/CD

Flag a slow or broken change before it reaches a full run.

While it runs

Live multi-node views

Live terminal and browser views across multi-node runs.

After the run

Compare runs across experiments

Diff each run's fingerprint to see what changed and why.

Over time

Continuous regression analysis

Catch performance and GPU regressions automatically.

Get started in three steps

From install to diagnostics in 3 lines of code.

1

Install TraceML

Install from PyPI. No Docker, account setup, or hosted backend required.

bash
pip install traceml-ai
~30 sec
2

Wrap your training step

Add one import, one init call, and one context manager around your existing step.

python
import traceml_ai as traceml

traceml.init(mode="auto")

for batch in dataloader:
    with traceml.trace_step(model):
        ...  # your existing step
3 lines
3

Run TraceML

Launch with one command. TraceML profiles the training step and saves a run summary you can inspect or compare later.

bash
traceml run train.py
run as usual

Frequently asked questions

What do we offer today?
TraceML is available today as the open-source diagnostics engine behind TraceOpt. It helps PyTorch teams see where training time goes, generate run summaries, and compare performance changes.
What bottlenecks do we detect?
TraceML detects common PyTorch training bottlenecks: slow input loading, low GPU utilization, rank stragglers, wait-heavy steps, memory imbalance, and memory creep.
Does it work for distributed and multi-GPU training?
Yes. TraceML supports single-node multi-GPU DDP and FSDP, plus multi-node DDP in summary mode.
What is the performance overhead?
TraceML is designed to be lightweight. Early internal runs show low overhead, but actual impact depends on workload, hardware, and configuration.
Do I need an account or hosted service?
TraceML runs locally by default. It writes diagnostics and summaries to your local run directory, requires no account, and does not depend on a hosted backend.
Does TraceOpt support TensorFlow, JAX, or other training stacks?
TraceOpt is focused on PyTorch training today through its open-source TraceML diagnostics engine. If your team needs similar diagnostics for TensorFlow, JAX, or another stack, contact us at support@traceopt.ai so we can understand the use case.

Work with us

We are looking for design partners and early collaborators as we build TraceOpt for training-performance diagnostics.

  • - you run PyTorch training on single-node multi-GPU, multi-node, or Slurm-based systems
  • - slow, unstable, or hard-to-debug training runs cost your team time or GPU hours
  • - you want clearer diagnostics across runs, machines, and experiments
  • - you want to help shape the TraceOpt platform with direct feedback from real training workloads