Diagnose training stalls before they burn GPU hours.

TraceOpt is a training-performance platform for monitoring slow runs, comparing regressions, and scaling visibility across training workloads (coming soon).

View on GitHub Docs Guides

                    
                    SLOW TRAINING DETECTED · step
                    1,240
                  
                    Live step view · last 100 steps
                  
                    Median step
                    
                      23.1ms
                    
                    Slowest worker
                    
                      25.9ms
                    
                        Data loader
                      
                        13.4ms
                      
Forward

                        4.4ms
                      
Backward

                        3.2ms
                      
Optimizer

                        1.8ms
                      
                    Worker behavior
                  
                    Slow-worker gap
                    +12.1%
                    on worker
                    3
                  
                    Memory
                  
                    GPU memory
                    14.2 / 96 GB
                    peak
                    17.1 GB ↑

Stack

PyTorch NVIDIA CUDA Lightning Hugging Face Ray Train

Slurm

DeepSpeed

Megatron-LM

Why TraceOpt

Training slowdowns are hard to compare across runs, teams, and infrastructure. TraceOpt is being built to turn TraceML diagnostics into a shared performance layer for training workloads.

                Before the run
              

Regression finder for CI/CD

Flag a slow or broken change before it reaches a full run.

                While it runs
              

Live multi-node views

Live terminal and browser views across multi-node runs.

                After the run
              

Compare runs across experiments

Diff each run's fingerprint to see what changed and why.

                Over time
              

Continuous regression analysis

Catch performance and GPU regressions automatically.

Get started in three steps

From install to diagnostics in 3 lines of code.

1

Install TraceML

Install from PyPI. No Docker, account setup, or hosted backend required.

                    bash
                    
                  

pip install traceml-ai

                  
                  ~30 sec
                

2

Wrap your training step

Add one import, one init call, and one context manager around your existing step.

                    python
                    
                  

import traceml_ai as traceml

traceml.init(mode="auto")

for batch in dataloader:
    with traceml.trace_step(model):
        ...  # your existing step

                  
                  3 lines
                

3

Run TraceML

Launch with one command. TraceML profiles the training step and saves a run summary you can inspect or compare later.

                    bash
                    
                  

traceml run train.py

                  
                  run as usual
                

Frequently asked questions

What do we offer today?

TraceML is available today as the open-source diagnostics engine behind TraceOpt. It helps PyTorch teams see where training time goes, generate run summaries, and compare performance changes.

What bottlenecks do we detect?

TraceML detects common PyTorch training bottlenecks: slow input loading, low GPU utilization, rank stragglers, wait-heavy steps, memory imbalance, and memory creep.

Does it work for distributed and multi-GPU training?

Yes. TraceML supports single-node multi-GPU DDP and FSDP, plus multi-node DDP in summary mode.

What is the performance overhead?

TraceML is designed to be lightweight. Early internal runs show low overhead, but actual impact depends on workload, hardware, and configuration.

Do I need an account or hosted service?

TraceML runs locally by default. It writes diagnostics and summaries to your local run directory, requires no account, and does not depend on a hosted backend.

Does TraceOpt support TensorFlow, JAX, or other training stacks?

TraceOpt is focused on PyTorch training today through its open-source TraceML diagnostics engine. If your team needs similar diagnostics for TensorFlow, JAX, or another stack, contact us at support@traceopt.ai so we can understand the use case.

Work with us

We are looking for design partners and early collaborators as we build TraceOpt for training-performance diagnostics.

- you run PyTorch training on single-node multi-GPU, multi-node, or Slurm-based systems
- slow, unstable, or hard-to-debug training runs cost your team time or GPU hours
- you want clearer diagnostics across runs, machines, and experiments
- you want to help shape the TraceOpt platform with direct feedback from real training workloads