Skip to content

Commit f773efe

Browse files
committed
feat: add experiment classification and plot configuration system
Signed-off-by: Ilana Nguyen <[email protected]>
1 parent 7318b4b commit f773efe

37 files changed

+4938
-373
lines changed
239 KB
Loading
186 KB
Loading
9.07 KB
Loading
9.05 KB
Loading
7.65 KB
Loading
7.26 KB
Loading
8.99 KB
Loading
9.04 KB
Loading
211 KB
Loading

docs/tutorials/plot.md

Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,9 @@ artifacts/sweep_qwen/ # Contains multiple runs
6060
└── Qwen3-0.6B-concurrency4/
6161
```
6262

63+
> [!TIP]
64+
> Use [Experiment Classification](#experiment-classification) to assign semantic colors (grey for baselines, green for treatments) and improve visual distinction in multi-run comparisons.
65+
6366
**Generated plots (3 default):**
6467
2. **TTFT vs Throughput** - Time to first token vs request throughput across concurrency levels
6568
4. **Token Throughput per GPU vs Latency** - GPU efficiency vs latency (when GPU telemetry available)
@@ -174,6 +177,116 @@ aiperf plot <sweep_directory>
174177
aiperf plot <sweep_directory> --output <custom_output_path>
175178
```
176179

180+
## Customizing Plot Grouping
181+
182+
Multi-run comparison plots can group runs in different ways to create different colored lines/series. You can customize this in `~/.aiperf/plot_config.yaml`.
183+
184+
### Default Grouping Behavior
185+
186+
**Without experiment classification:**
187+
- **Default**: Each run gets its own line (groups by `run_name`)
188+
- **Customize**: Edit `groups:` in plot presets to group by other fields like `[model]`, `[experiment_group]`, or `[concurrency]`
189+
190+
**With experiment classification enabled:**
191+
- **Always**: Groups by `experiment_group` (directory name) with semantic baseline/treatment colors
192+
- This override ensures treatment variants are preserved as separate lines
193+
194+
### Example Customizations
195+
196+
**Group by model** (useful when comparing different models):
197+
```yaml
198+
multi_run_plots:
199+
pareto_curve_throughput_per_gpu_vs_latency:
200+
# ... other settings ...
201+
groups: [model]
202+
```
203+
204+
**Group by directory** (useful for hierarchical experiment structures):
205+
```yaml
206+
multi_run_plots:
207+
pareto_curve_throughput_per_gpu_vs_latency:
208+
# ... other settings ...
209+
groups: [experiment_group]
210+
```
211+
212+
**Group by run name** (default - each run is a separate line):
213+
```yaml
214+
multi_run_plots:
215+
pareto_curve_throughput_per_gpu_vs_latency:
216+
# ... other settings ...
217+
groups: [run_name]
218+
```
219+
220+
> [!TIP]
221+
> Edit `~/.aiperf/plot_config.yaml` to customize grouping. See the file's CONFIGURATION GUIDE section for detailed examples and options.
222+
223+
## Experiment Classification
224+
225+
Classify runs as "baseline" or "treatment" for semantic color assignment in multi-run comparisons:
226+
- **Baselines**: Grey shades, listed first in legend
227+
- **Treatments**: NVIDIA green shades, listed after baselines
228+
- **Use case**: Clear visual distinction for A/B testing and performance comparisons
229+
230+
### Configuration
231+
232+
Edit `~/.aiperf/plot_config.yaml` (auto-created on first run):
233+
234+
```yaml
235+
experiment_classification:
236+
baselines:
237+
- "*baseline*" # Glob patterns
238+
- "*_agg_*"
239+
treatments:
240+
- "*treatment*"
241+
- "*_disagg_*"
242+
default: treatment # Fallback when no match
243+
```
244+
245+
> [!IMPORTANT]
246+
> When experiment classification is enabled, **all multi-run plots automatically group by experiment_group** (directory name). This preserves individual treatment variants while applying semantic baseline/treatment colors. This behavior overrides any explicit `groups:` settings in the config.
247+
248+
**Pattern notes**: Uses glob syntax (`*` = wildcard), case-sensitive, first match wins.
249+
250+
### Quick Example
251+
252+
**Directory structure:**
253+
```
254+
artifacts/
255+
├── baseline_moderate_io_isl100_osl200_streaming/ # → Grey (baseline; ISL=100, OSL=200)
256+
│ ├── concurrency_1/
257+
│ ├── concurrency_2/
258+
│ ├── concurrency_4/
259+
│ └── ... (other concurrency runs)
260+
├── treatment_large_context_isl500_osl50_streaming/ # → Green (treatment; ISL=500, OSL=50)
261+
│ ├── concurrency_1/
262+
│ ├── concurrency_2/
263+
│ ├── concurrency_4/
264+
│ └── ...
265+
├── treatment_long_generation_isl50_osl500_streaming/ # → Blue (treatment; ISL=50, OSL=500)
266+
│ ├── concurrency_1/
267+
│ ├── concurrency_2/
268+
│ ├── concurrency_4/
269+
│ └── ...
270+
└── treatment_cancellation_10pct_isl100_osl200_streaming/ # → Orange (treatment; ISL=100, OSL=200, 10% cancels)
271+
├── concurrency_1/
272+
├── concurrency_2/
273+
├── concurrency_4/
274+
└── ...
275+
```
276+
277+
**Result**: 4 lines in plots (1 baseline line + 3 treatment lines, each with semantic colors)
278+
279+
**Advanced**: Use `group_extraction_pattern` to aggregate variants into named groups:
280+
- Pattern `"^(treatment_\d+)"` groups `treatment_1_variantA` + `treatment_1_variantB` → `"treatment_1"`
281+
- See config file for `group_display_names` and other options
282+
283+
> [!TIP]
284+
> See `src/aiperf/plot/default_plot_config.yaml` for detailed options.
285+
286+
![Pareto Curve: Throughput per GPU vs Interactivity (Experiment Classification)](../diagrams/plot_examples/multi_run/config_experiment_classification/pareto_curve_throughput_per_gpu_vs_interactivity.png)
287+
288+
![TTFT vs Throughput (Experiment Classification)](../diagrams/plot_examples/multi_run/config_experiment_classification/ttft_vs_throughput.png)
289+
177290
## Theme Options
178291
179292
Choose between light and dark themes for your plots:
@@ -270,6 +383,9 @@ plots/
270383
> **Consistent Configurations**: When comparing runs, keep all parameters identical except the one you're testing (e.g., only vary concurrency). This ensures plots show the impact of that specific parameter.
271384
> Future features in interactive mode will allow pop-ups to show specific configurations of plotted runs.
272385
386+
> [!TIP]
387+
> **Use Experiment Classification**: For multi-run comparisons, configure [experiment classification](#experiment-classification) to distinguish baselines from treatments with semantic colors. This makes it easier to identify reference points and experimental variations.
388+
273389
> [!TIP]
274390
> **Include Warmup**: Use `--warmup-request-count` to ensure the server reaches steady state before measurement. This reduces noise in your visualizations.
275391

0 commit comments

Comments
 (0)