Skip to content

Conversation

@alexnorell
Copy link
Contributor

@alexnorell alexnorell commented Nov 18, 2025

Description

Compiles PyTorch 2.8.0, torchvision 0.23.0, and onnxruntime 1.20.0 from source with numpy 2.x support, enabling the modern ML ecosystem while achieving better performance and smaller image size than the wheel-based approach.

Key Results

Performance on Jetson AGX Orin:

Image Size:

Benefits

  • NumPy 2.x support - Compiled with numpy>=2.0.0, no longer constrained by outdated Jetson wheels
  • Better performance - 5.6% faster inference with Jetson-specific optimizations
  • Smaller image - 1.54GB smaller despite compiling from source
  • flash-attn 2.8.3 - Latest version with PyTorch built-in flash attention enabled
  • Optimized build - Disabled unnecessary features (NCCL, QNNPACK, XNNPACK, FBGEMM, Kineto, etc.)
  • ARM+CUDA optimized - USE_PRIORITIZED_TEXT_FOR_LD=1 linker optimization

Size Optimizations Implemented

  1. cuDNN/TensorRT symlink preservation: ~2GB saved (only copy versioned .so files, create symlinks)
  2. Remove unnecessary packages: ~250MB saved (Jupyter, debugpy, examples, benchmarks, docs)
  3. Remove torch dev tools: ~119MB saved (torch/bin, torch/include)
  4. Remove test directories: ~60MB saved (onnx tests, scipy tests, pandas tests)
  5. Conservative cleanup: Preserve numpy.testing, torch.testing (public APIs depend on them)

PyTorch Build Optimizations

Comprehensive PyTorch build flags for minimal Jetson inference-only binary:

  • CUDA/cuDNN enabled, single GPU arch (8.7)
  • Disabled: MKLDNN, OpenMP, Distributed (GLOO, MPI, TensorPipe, NCCL), all quantization backends (QNNPACK, XNNPACK, NNPACK, FBGEMM), Kineto profiler, CUPTI, MPS, ROCm
  • Enabled: Flash Attention, Memory Efficient Attention
  • MAX_JOBS=12 to prevent OOM during linking (64GB RAM)
  • USE_PRIORITIZED_TEXT_FOR_LD=1 for ARM+CUDA linker optimization

onnxruntime Build Optimizations

  • TensorRT execution provider enabled
  • --parallel 12 to prevent OOM
  • Disabled onnxruntime's flash attention (using TensorRT's optimized kernels instead)
  • CMake 3.31.10 with nsync compatibility patch

Tradeoffs

Cons:

  • Build time: ~1.5-2 hours (vs 40 min with pre-built wheels) - but mostly cached after first build
  • Complexity: More build flags to maintain
  • Depot cost: Longer builds on Depot runners

Pros outweigh cons:

  • Modern numpy 2.x ecosystem access
  • Better performance AND smaller size
  • Full control over optimizations

What's Compiled From Source

  1. PyTorch 2.8.0 - CUDA 12.6, cuDNN 9.3, Jetson Orin arch 8.7, numpy 2.x
  2. torchvision 0.23.0 - CUDA support, numpy 2.x
  3. onnxruntime 1.20.0 - TensorRT EP, CUDA 12.6, numpy 2.x
  4. flash-attn 2.8.3 - Latest version
  5. GDAL 3.11.5 - Same as Optimize Jetson 6.2.0 Docker image with l4t-cuda base (41.7% size reduction) #1718

Type of Change

  • Performance improvement (5.6% faster inference)
  • Size optimization (18.6% smaller image)
  • Feature enablement (numpy 2.x support)
  • This change modifies the Jetson 6.2.0 Dockerfile

How Has This Been Tested?

Build: Successfully built on Depot ARM64 builder (~1.5 hrs with caching)
Runtime: Container runs successfully, all imports working, GPU acceleration active
Benchmark: RF-DETR 65.7 FPS with TensorRT verified on Jetson AGX Orin

Deployment Considerations

  • First run: 10-15 min for TensorRT engine compilation (cached thereafter)
  • Use --volume ~/.inference/cache:/tmp:rw to persist TensorRT cache
  • MAXN mode recommended for best performance
  • numpy>=2.0.0 support enabled
  • Build time: ~1.5-2 hours on Depot (mostly cached), ~6 hours full rebuild

Docs

N/A

@alexnorell alexnorell force-pushed the jetson-620-compile-pytorch-from-source branch 3 times, most recently from dd73a73 to 0bcc14f Compare November 18, 2025 22:32
@alexnorell alexnorell changed the base branch from jetson-620-cuda-base-pr to main November 19, 2025 22:50
@alexnorell
Copy link
Contributor Author

Image Size Breakdown: 6.74GB

@sberan Here's the complete breakdown of what makes up the 6.74GB final image:

Base Layer (2.09GB)

  • nvcr.io/nvidia/l4t-cuda:12.6.11-runtime base image

Added Layers (4.65GB total)

1. cuDNN Libraries: ~1GB (with symlinks properly preserved)

  • libcudnn_adv.so.9.3.0: 276M
  • libcudnn_engines_precompiled.so.9.3.0: 487M
  • libcudnn_heuristic.so.9.3.0: 52M
  • Other cuDNN components: ~185M
  • Note: Only actual .so.X.Y.Z files copied, symlinks created afterward (saves ~2GB)

2. Python Packages: 2.67GB (after cleanup)

  • torch: ~600M (source-compiled, optimized for Jetson)
  • onnxruntime-gpu: ~350M (TensorRT EP)
  • bitsandbytes: 325M
  • jaxlib: 278M (required by mediapipe)
  • flash-attn: ~200M (v2.8.3)
  • scipy: ~90M
  • transformers: 55M
  • mediapipe: 55M
  • pandas: 47M
  • OpenCV (3 variants): ~160M
  • Other packages: ~500M

3. TensorRT Libraries: 986MB

  • libnvinfer*.so: Main TensorRT runtime
  • libnvonnxparser*.so: ONNX parser
  • libnvparsers*.so: Additional parsers

4. Runtime APT Packages: 247MB

  • libvips42, libopenblas0, libproj22, libavcodec58, libavformat58, etc.

5. GDAL: ~100MB

  • Binaries (gdal*, ogr*, gnm*): ~5MB
  • Libraries (libgdal*): 95MB
  • Data files: ~3MB

6. Other: ~200MB

  • Application code: ~8MB
  • cupti/nvToolsExt: ~22MB
  • Python dist-info metadata: ~20MB
  • Other system libs: ~150MB

Size Optimizations Applied

What we removed (~500MB saved):

  • Test directories: scipy/pandas/onnx tests (~60MB)
  • torch/bin, torch/include dev tools (~119MB)
  • Jupyter/IPython/debugpy packages (~50MB)
  • examples/benchmarks/docs across packages (~50MB)
  • skimage/data test images (~7.5MB)
  • pycache directories
  • GDAL/cuDNN headers (~1MB)

What we preserved (required for functionality):

  • numpy.testing, torch.testing (public APIs depend on them)
  • .pyi stub files (lazy_loader/type checkers need them)

What we optimized:

  • cuDNN/TensorRT symlinks instead of duplicates: ~2GB saved

Comparison

The source compilation approach produces a leaner, faster image despite compiling everything from scratch!

@alexnorell alexnorell force-pushed the jetson-620-compile-pytorch-from-source branch from c9590c9 to 63626a5 Compare November 20, 2025 23:48
@alexnorell alexnorell changed the title [WIP] Compile PyTorch/torchvision from source for numpy 2.x support Jetpack 6.2 Support Nov 20, 2025
@alexnorell alexnorell force-pushed the jetson-620-compile-pytorch-from-source branch from 63626a5 to 502826c Compare November 21, 2025 00:00
…upport

- PyTorch 2.8.0 with Jetson Orin optimizations (arch 8.7, ARM+CUDA linker optimization)
- Disabled unnecessary features (NCCL, QNNPACK, XNNPACK, FBGEMM, Kineto, etc.)
- torchvision 0.23.0 with CUDA support
- onnxruntime 1.20.0 with TensorRT EP
- flash-attn 2.8.3 (latest version)

Performance: 65.7 FPS (vs 62.2 FPS baseline = 5.6% faster)
Image size: 6.74GB (vs 8.28GB baseline = 18.6% smaller)

Size optimizations:
- cuDNN/TensorRT symlink preservation: ~2GB saved
- Remove test directories, dev tools, examples: ~500MB saved
- Conservative cleanup preserving public APIs (numpy.testing, torch.testing)

TensorRT optimization:
- FP16 precision enabled
- Engine caching enabled with 2GB workspace
- Builder optimization level 3
- Aux streams optimized for memory efficiency
@alexnorell alexnorell force-pushed the jetson-620-compile-pytorch-from-source branch from 502826c to f26bf0a Compare November 21, 2025 00:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants