Jetpack 6.2 Support #1730

alexnorell · 2025-11-18T16:46:41Z

Description

Compiles PyTorch 2.8.0, torchvision 0.23.0, and onnxruntime 1.20.0 from source with numpy 2.x support, enabling the modern ML ecosystem while achieving better performance and smaller image size than the wheel-based approach.

Key Results

Performance on Jetson AGX Orin:

65.7 FPS @ 15.3ms average latency (RF-DETR base, TensorRT FP16)
5.6% faster than PR Optimize Jetson 6.2.0 Docker image with l4t-cuda base (41.7% size reduction) #1718 (62.2 FPS @ 16.0ms)
0% error rate (1000/1000 successful inferences)
Consistent: p75=16.0ms, p90=17.5ms

Image Size:

6.74GB (vs 8.28GB in PR Optimize Jetson 6.2.0 Docker image with l4t-cuda base (41.7% size reduction) #1718)
1.54GB smaller (18.6% reduction)
Base: 2.09GB, Added: 4.65GB

Benefits

✅ NumPy 2.x support - Compiled with numpy>=2.0.0, no longer constrained by outdated Jetson wheels
✅ Better performance - 5.6% faster inference with Jetson-specific optimizations
✅ Smaller image - 1.54GB smaller despite compiling from source
✅ flash-attn 2.8.3 - Latest version with PyTorch built-in flash attention enabled
✅ Optimized build - Disabled unnecessary features (NCCL, QNNPACK, XNNPACK, FBGEMM, Kineto, etc.)
✅ ARM+CUDA optimized - USE_PRIORITIZED_TEXT_FOR_LD=1 linker optimization

Size Optimizations Implemented

cuDNN/TensorRT symlink preservation: ~2GB saved (only copy versioned .so files, create symlinks)
Remove unnecessary packages: ~250MB saved (Jupyter, debugpy, examples, benchmarks, docs)
Remove torch dev tools: ~119MB saved (torch/bin, torch/include)
Remove test directories: ~60MB saved (onnx tests, scipy tests, pandas tests)
Conservative cleanup: Preserve numpy.testing, torch.testing (public APIs depend on them)

PyTorch Build Optimizations

Comprehensive PyTorch build flags for minimal Jetson inference-only binary:

CUDA/cuDNN enabled, single GPU arch (8.7)
Disabled: MKLDNN, OpenMP, Distributed (GLOO, MPI, TensorPipe, NCCL), all quantization backends (QNNPACK, XNNPACK, NNPACK, FBGEMM), Kineto profiler, CUPTI, MPS, ROCm
Enabled: Flash Attention, Memory Efficient Attention
MAX_JOBS=12 to prevent OOM during linking (64GB RAM)
USE_PRIORITIZED_TEXT_FOR_LD=1 for ARM+CUDA linker optimization

onnxruntime Build Optimizations

TensorRT execution provider enabled
--parallel 12 to prevent OOM
Disabled onnxruntime's flash attention (using TensorRT's optimized kernels instead)
CMake 3.31.10 with nsync compatibility patch

Tradeoffs

Cons:

Build time: ~1.5-2 hours (vs 40 min with pre-built wheels) - but mostly cached after first build
Complexity: More build flags to maintain
Depot cost: Longer builds on Depot runners

Pros outweigh cons:

Modern numpy 2.x ecosystem access
Better performance AND smaller size
Full control over optimizations

What's Compiled From Source

PyTorch 2.8.0 - CUDA 12.6, cuDNN 9.3, Jetson Orin arch 8.7, numpy 2.x
torchvision 0.23.0 - CUDA support, numpy 2.x
onnxruntime 1.20.0 - TensorRT EP, CUDA 12.6, numpy 2.x
flash-attn 2.8.3 - Latest version
GDAL 3.11.5 - Same as Optimize Jetson 6.2.0 Docker image with l4t-cuda base (41.7% size reduction) #1718

Type of Change

Performance improvement (5.6% faster inference)
Size optimization (18.6% smaller image)
Feature enablement (numpy 2.x support)
This change modifies the Jetson 6.2.0 Dockerfile

How Has This Been Tested?

Build: Successfully built on Depot ARM64 builder (~1.5 hrs with caching)
Runtime: Container runs successfully, all imports working, GPU acceleration active
Benchmark: RF-DETR 65.7 FPS with TensorRT verified on Jetson AGX Orin

Deployment Considerations

First run: 10-15 min for TensorRT engine compilation (cached thereafter)
Use --volume ~/.inference/cache:/tmp:rw to persist TensorRT cache
MAXN mode recommended for best performance
numpy>=2.0.0 support enabled
Build time: ~1.5-2 hours on Depot (mostly cached), ~6 hours full rebuild

Docs

N/A

alexnorell · 2025-11-20T23:25:51Z

Image Size Breakdown: 6.74GB

@sberan Here's the complete breakdown of what makes up the 6.74GB final image:

Base Layer (2.09GB)

nvcr.io/nvidia/l4t-cuda:12.6.11-runtime base image

Added Layers (4.65GB total)

1. cuDNN Libraries: ~1GB (with symlinks properly preserved)

libcudnn_adv.so.9.3.0: 276M
libcudnn_engines_precompiled.so.9.3.0: 487M
libcudnn_heuristic.so.9.3.0: 52M
Other cuDNN components: ~185M
Note: Only actual .so.X.Y.Z files copied, symlinks created afterward (saves ~2GB)

2. Python Packages: 2.67GB (after cleanup)

torch: ~600M (source-compiled, optimized for Jetson)
onnxruntime-gpu: ~350M (TensorRT EP)
bitsandbytes: 325M
jaxlib: 278M (required by mediapipe)
flash-attn: ~200M (v2.8.3)
scipy: ~90M
transformers: 55M
mediapipe: 55M
pandas: 47M
OpenCV (3 variants): ~160M
Other packages: ~500M

3. TensorRT Libraries: 986MB

libnvinfer*.so: Main TensorRT runtime
libnvonnxparser*.so: ONNX parser
libnvparsers*.so: Additional parsers

4. Runtime APT Packages: 247MB

libvips42, libopenblas0, libproj22, libavcodec58, libavformat58, etc.

5. GDAL: ~100MB

Binaries (gdal*, ogr*, gnm*): ~5MB
Libraries (libgdal*): 95MB
Data files: ~3MB

6. Other: ~200MB

Application code: ~8MB
cupti/nvToolsExt: ~22MB
Python dist-info metadata: ~20MB
Other system libs: ~150MB

Size Optimizations Applied

What we removed (~500MB saved):

Test directories: scipy/pandas/onnx tests (~60MB)
torch/bin, torch/include dev tools (~119MB)
Jupyter/IPython/debugpy packages (~50MB)
examples/benchmarks/docs across packages (~50MB)
skimage/data test images (~7.5MB)
pycache directories
GDAL/cuDNN headers (~1MB)

What we preserved (required for functionality):

numpy.testing, torch.testing (public APIs depend on them)
.pyi stub files (lazy_loader/type checkers need them)

What we optimized:

cuDNN/TensorRT symlinks instead of duplicates: ~2GB saved

Comparison

PR Optimize Jetson 6.2.0 Docker image with l4t-cuda base (41.7% size reduction) #1718 (wheels): 8.28GB, 62.2 FPS
PR Jetpack 6.2 Support #1730 (source): 6.74GB, 65.7 FPS
Improvement: 1.54GB smaller (18.6%), 5.6% faster

The source compilation approach produces a leaner, faster image despite compiling everything from scratch!

…upport - PyTorch 2.8.0 with Jetson Orin optimizations (arch 8.7, ARM+CUDA linker optimization) - Disabled unnecessary features (NCCL, QNNPACK, XNNPACK, FBGEMM, Kineto, etc.) - torchvision 0.23.0 with CUDA support - onnxruntime 1.20.0 with TensorRT EP - flash-attn 2.8.3 (latest version) Performance: 65.7 FPS (vs 62.2 FPS baseline = 5.6% faster) Image size: 6.74GB (vs 8.28GB baseline = 18.6% smaller) Size optimizations: - cuDNN/TensorRT symlink preservation: ~2GB saved - Remove test directories, dev tools, examples: ~500MB saved - Conservative cleanup preserving public APIs (numpy.testing, torch.testing) TensorRT optimization: - FP16 precision enabled - Engine caching enabled with 2GB workspace - Builder optimization level 3 - Aux streams optimized for memory efficiency

alexnorell requested review from PawelPeczek-Roboflow, grzegorz-roboflow, hansent, probicheaux and yeldarby as code owners November 18, 2025 16:46

alexnorell force-pushed the jetson-620-compile-pytorch-from-source branch 3 times, most recently from dd73a73 to 0bcc14f Compare November 18, 2025 22:32

alexnorell changed the base branch from jetson-620-cuda-base-pr to main November 19, 2025 22:50

alexnorell force-pushed the jetson-620-compile-pytorch-from-source branch from c9590c9 to 63626a5 Compare November 20, 2025 23:48

alexnorell changed the title ~~[WIP] Compile PyTorch/torchvision from source for numpy 2.x support~~ Jetpack 6.2 Support Nov 20, 2025

alexnorell mentioned this pull request Nov 20, 2025

Optimize Jetson 6.2.0 Docker image with l4t-cuda base (41.7% size reduction) #1718

Closed

2 tasks

alexnorell force-pushed the jetson-620-compile-pytorch-from-source branch from 63626a5 to 502826c Compare November 21, 2025 00:00

alexnorell force-pushed the jetson-620-compile-pytorch-from-source branch from 502826c to f26bf0a Compare November 21, 2025 00:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Jetpack 6.2 Support #1730

Jetpack 6.2 Support #1730

alexnorell commented Nov 18, 2025 •

edited

Loading

Uh oh!

alexnorell commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Jetpack 6.2 Support #1730

Are you sure you want to change the base?

Jetpack 6.2 Support #1730

Conversation

alexnorell commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Key Results

Benefits

Size Optimizations Implemented

PyTorch Build Optimizations

onnxruntime Build Optimizations

Tradeoffs

What's Compiled From Source

Type of Change

How Has This Been Tested?

Deployment Considerations

Docs

Uh oh!

alexnorell commented Nov 20, 2025

Image Size Breakdown: 6.74GB

Base Layer (2.09GB)

Added Layers (4.65GB total)

Size Optimizations Applied

Comparison

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alexnorell commented Nov 18, 2025 •

edited

Loading