diff --git a/content/learning-paths/mobile-graphics-and-gaming/ai-camera-pipelines/1-prerequisites.md b/content/learning-paths/mobile-graphics-and-gaming/ai-camera-pipelines/1-prerequisites.md index 489255785..8e9b4eeb6 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/ai-camera-pipelines/1-prerequisites.md +++ b/content/learning-paths/mobile-graphics-and-gaming/ai-camera-pipelines/1-prerequisites.md @@ -8,7 +8,7 @@ layout: learningpathall ## Host machine requirements -This Learning Path demonstrates how to improve the performance of camera pipelines using KleidiAI and KleidiCV on Arm. You’ll need an Arm64 machine, preferably running an Ubuntu-based distribution. The instructions have been tested on Ubuntu 24.04. +This Learning Path demonstrates how to improve the performance of camera pipelines using SME2 on Arm. You’ll need an Arm64 machine with SME2 support, preferably running an Ubuntu-based distribution. The instructions have been tested on Ubuntu 24.04. ## Install required software diff --git a/content/learning-paths/mobile-graphics-and-gaming/ai-camera-pipelines/2-overview.md b/content/learning-paths/mobile-graphics-and-gaming/ai-camera-pipelines/2-overview.md index 1ee5968d7..36bfd4956 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/ai-camera-pipelines/2-overview.md +++ b/content/learning-paths/mobile-graphics-and-gaming/ai-camera-pipelines/2-overview.md @@ -6,13 +6,32 @@ weight: 4 layout: learningpathall --- -## KleidiAI +## About SME/SME2 + +Arm Scalable Matrix Extension (SME) is an architecture extension that provides enhanced support for matrix operations. SME builds on the Scalable Vector Extensions (SVE and SVE2), adding new capabilities to efficiently process matrices. + +Key features include: +- Outer product between two vectors +- Matrix tile storage +- Load, store, insert, and extract of tile vectors, including on-the-fly transposition +- Streaming SVE mode + +SME defines the following new features: +- A new architectural state capable of holding two-dimensional matrix tiles. +- Streaming SVE mode which supports execution of SVE2 instructions with a vector length that matches the tile width. +- New instructions that accumulate the outer product of two vectors into a tile. +- New load, store, and move instructions that transfer a vector to or from a tile row or column. +- Like SVE2, SME is a scalable vector length extension which enables Vector Length Agnostic (VLA), per-lane predication, predicate-driven loop control and management features. + +This makes SME and SME2 particularly fit for image processing. In this learning path, you will use SME2 instructions thru the KleidiAI and KleidiCV libraries: + +### KleidiAI [KleidiAI](https://gitlab.arm.com/kleidi/kleidiai) is an open-source library of optimized, performance-critical routines (micro-kernels) for AI workloads on Arm CPUs. These routines are tuned for specific Arm microarchitectures to maximize performance and are designed for straightforward integration into C/C++ ML and AI frameworks. Several popular AI frameworks already take advantage of KleidiAI to improve performance on Arm platforms. -## KleidiCV +### KleidiCV [KleidiCV](https://gitlab.arm.com/kleidi/kleidicv) is an open-source library that provides high-performance image-processing functions for AArch64. It is lightweight and simple to integrate, and computer-vision frameworks such as OpenCV can leverage KleidiCV to accelerate image processing on Arm devices. diff --git a/content/learning-paths/mobile-graphics-and-gaming/ai-camera-pipelines/3-build.md b/content/learning-paths/mobile-graphics-and-gaming/ai-camera-pipelines/3-build.md index 8fa4f9aa9..7667febb6 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/ai-camera-pipelines/3-build.md +++ b/content/learning-paths/mobile-graphics-and-gaming/ai-camera-pipelines/3-build.md @@ -44,24 +44,25 @@ docker run --rm --volume $PWD:/home/cv-examples/example -it ai-camera-pipelines Inside the container, run the following commands: ```bash -ENABLE_SME2=0 -TENSORFLOW_GIT_TAG="v2.19.0" +ENABLE_SME2=1 +TENSORFLOW_GIT_TAG="v2.20.0" + # Build flatbuffers git clone https://github.com/google/flatbuffers.git cd flatbuffers git checkout v24.3.25 mkdir build cd build -cmake .. -DCMAKE_INSTALL_PREFIX=../install -cmake --build . -j16 +cmake -GNinja -DCMAKE_INSTALL_PREFIX=../install .. +cmake --build . cmake --install . cd ../.. # Build the pipelines mkdir build cd build -cmake -GNinja -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=../install -DARMNN_TFLITE_PARSER=0 -DTENSORFLOW_GIT_TAG=$TENSORFLOW_GIT_TAG -DTFLITE_HOST_TOOLS_DIR=../flatbuffers/install/bin -DENABLE_SME2=$ENABLE_SME2 -DENABLE_KLEIDICV:BOOL=ON -DXNNPACK_ENABLE_KLEIDIAI:BOOL=ON -DCMAKE_TOOLCHAIN_FILE=toolchain.cmake -S ../example -B . -cmake --build . -j16 +cmake -GNinja -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=../install -DTENSORFLOW_GIT_TAG=$TENSORFLOW_GIT_TAG -DTFLITE_HOST_TOOLS_DIR=../flatbuffers/install/bin -DENABLE_SME2=$ENABLE_SME2 -DCMAKE_TOOLCHAIN_FILE=../example/toolchain.cmake -S ../example +cmake --build . cmake --install . # Package and export the pipelines. @@ -71,16 +72,9 @@ tar cfz example/install.tar.gz install Leave the container by pressing `Ctrl+D`. -## Notes on the CMake configuration options - -The `cmake` command-line options relevant to this learning path are: - -| Command-line option | Description | -|-------------------------------------|----------------------------------------------------------------------------------------------| -| `ENABLE_SME2=$ENABLE_SME2` | SME2 (Scalable Matrix Extension 2) is disabled in this build with `ENABLE_SME2=0`. | -| `ARMNN_TFLITE_PARSER=0` | Configures the `ai-camera-pipelines` repository to use LiteRT with XNNPack instead of ArmNN. | -| `ENABLE_KLEIDICV:BOOL=ON` | Enables KleidiCV for optimized image processing. | -| `XNNPACK_ENABLE_KLEIDIAI:BOOL=ON` | Enables KleidiAI acceleration for LiteRT workloads via XNNPack. | +{{% notice Note %}} +In the above `cmake` command line, you'll note that the pipelines are build with SME2 (`-DENABLE_SME2=$ENABLE_SME2`), but that this could be disabled as well. We will use this feature later when benchmarking the performance improvements brought by SME2. +{{% /notice %}} ## Install the pipelines diff --git a/content/learning-paths/mobile-graphics-and-gaming/ai-camera-pipelines/4-run.md b/content/learning-paths/mobile-graphics-and-gaming/ai-camera-pipelines/4-run.md index 6dfd39788..10fff7c39 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/ai-camera-pipelines/4-run.md +++ b/content/learning-paths/mobile-graphics-and-gaming/ai-camera-pipelines/4-run.md @@ -16,16 +16,16 @@ In the previous section, you built the AI Camera Pipelines. In this section, you cd $HOME/ai-camera-pipelines python3 -m venv venv . venv/bin/activate -pip install -r ai-camera-pipelines.git/docker/python-requirements.txt +pip install numpy opencv-python pillow torch ``` ## Background blur -Run the background Blur pipeline, using `resources/test_input.png` as the input image and write the transformed image to `test_output.png`: +Run the background Blur pipeline, using `resources/test_input.png` as the input image and write the transformed image to `test_output_cinematic_mode.png`: ```bash cd $HOME/ai-camera-pipelines -bin/cinematic_mode resources/test_input.png test_output.png resources/depth_and_saliency_v3_2_assortedv2_w_augment_mobilenetv2_int8_only_ptq.tflite +bin/cinematic_mode resources/test_input.png test_output_cinematic_mode.png resources/depth_and_saliency_v3_2_assortedv2_w_augment_mobilenetv2_int8_only_ptq.tflite ``` ![example image alt-text#center](test_input2.webp "Input image") @@ -33,11 +33,11 @@ bin/cinematic_mode resources/test_input.png test_output.png resources/depth_and_ ## Low-Light Enhancement -Run the Low-Light Enhancement pipeline, using `resources/test_input.png` as the input image and write the transformed image to `test_output2_lime.png`: +Run the Low-Light Enhancement pipeline, using `resources/test_input.png` as the input image and write the transformed image to `test_output_lime.png`: ```bash cd $HOME/ai-camera-pipelines -bin/low_light_image_enhancement resources/test_input.png test_output2_lime.png resources/HDRNetLIME_lr_coeffs_v1_1_0_mixed_low_light_perceptual_l1_loss_float32.tflite +bin/low_light_image_enhancement resources/test_input.png test_output_lime.png resources/HDRNetLIME_lr_coeffs_v1_1_0_mixed_low_light_perceptual_l1_loss_float32.tflite ``` ![example image alt-text#center](test_input2.webp "Input image") diff --git a/content/learning-paths/mobile-graphics-and-gaming/ai-camera-pipelines/5-performances.md b/content/learning-paths/mobile-graphics-and-gaming/ai-camera-pipelines/5-performances.md index d75ebf3e7..77f08d89e 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/ai-camera-pipelines/5-performances.md +++ b/content/learning-paths/mobile-graphics-and-gaming/ai-camera-pipelines/5-performances.md @@ -18,36 +18,38 @@ These benchmarks demonstrate the performance improvements enabled by KleidiCV an - KleidiCV enhances OpenCV performance with computation kernels optimized for Arm processors. - KleidiAI accelerates LiteRT+XNNPack inference using AI-optimized micro-kernels tailored for Arm CPUs. -## Performance with KleidiCV and KleidiAI +## Performance with SME -By default, the OpenCV library is built with KleidiCV support, and LiteRT+XNNPack is built with KleidiAI support. - -You can run the benchmarks using the applications you built earlier. +You can run the benchmarks using the applications you built earlier (with SME2 support). Run the Background Blur benchmark: ```bash -bin/cinematic_mode_benchmark 20 resources/depth_and_saliency_v3_2_assortedv2_w_augment_mobilenetv2_int8_only_ptq.tflite +bin/cinematic_mode_benchmark 20 --tflite_file resources/depth_and_saliency_v3_2_assortedv2_w_augment_mobilenetv2_int8_only_ptq.tflite ``` The output is similar to: ```output +INFO: Frame rate throttling is turned OFF INFO: Created TensorFlow Lite XNNPACK delegate for CPU. -Total run time over 20 iterations: 2028.745 ms +INFO: Total run time over 20 iterations: 614.724 ms +INFO: Average FPS: 32.5349 ``` Run the Low Light Enhancement benchmark: ```bash -bin/low_light_image_enhancement_benchmark 20 resources/HDRNetLIME_lr_coeffs_v1_1_0_mixed_low_light_perceptual_l1_loss_float32.tflite +bin/low_light_image_enhancement_benchmark 20 --tflite_file resources/HDRNetLIME_lr_coeffs_v1_1_0_mixed_low_light_perceptual_l1_loss_float32.tflite ``` The output is similar to: ```output +INFO: Frame rate throttling is turned OFF INFO: Created TensorFlow Lite XNNPACK delegate for CPU. -Total run time over 20 iterations: 58.2126 ms +INFO: Total run time over 20 iterations: 57.3958 ms +INFO: Average FPS: 348.457 ``` Last, run the Neural Denoising benchmark: @@ -59,77 +61,36 @@ bin/neural_denoiser_temporal_benchmark_4K 20 The output is similar to: ```output -Total run time over 20 iterations: 37.6839 ms +INFO: Frame rate throttling is turned OFF +INFO: Total run time over 20 iterations: 36.2114 ms +INFO: Average FPS: 552.312 ``` From these results, you can see that: -- `cinematic_mode_benchmark` performed 20 iterations in 2028.745 ms -- `low_light_image_enhancement_benchmark` performed 20 iterations in 58.2126 ms -- `neural_denoiser_temporal_benchmark_4K` performed 20 iterations in 37.6839 ms - -## Benchmark results without KleidiCV and KleidiAI - -To measure the performance without these optimizations, recompile the pipelines using the following flags in your CMake command: -```bash --DENABLE_KLEIDICV:BOOL=OFF -DXNNPACK_ENABLE_KLEIDIAI:BOOL=OFF -``` - -Re-run the Background Blur benchmark: +- `cinematic_mode_benchmark` performed 20 iterations in 614.724 ms, which means 32.5349 FPS +- `low_light_image_enhancement_benchmark` performed 20 iterations in 57.3958 ms, which means 348.457 FPS +- `neural_denoiser_temporal_benchmark_4K` performed 20 iterations in 36.2114 ms, which means 552.312 FPS -```bash -bin/cinematic_mode_benchmark 20 resources/depth_and_saliency_v3_2_assortedv2_w_augment_mobilenetv2_int8_only_ptq.tflite -``` +## Performance without SME2 -The new output is similar to: +To measure the performances without the benefits of SME2, set `ENABLE_SME2=0` and recompile the pipelines as shown in the [build](/learning-paths/mobile-graphics-and-gaming/ai-camera-pipelines/3-build/) page. -```output -INFO: Created TensorFlow Lite XNNPACK delegate for CPU. -Total run time over 20 iterations: 2030.5525 ms -``` - -Re-run the Low Light Enhancement benchmark: - -```bash -bin/low_light_image_enhancement_benchmark 20 resources/HDRNetLIME_lr_coeffs_v1_1_0_mixed_low_light_perceptual_l1_loss_float32.tflite -``` - -The new output is similar to: - -```output -INFO: Created TensorFlow Lite XNNPACK delegate for CPU. -Total run time over 20 iterations: 58.0613 ms -``` - -Re-run the Neural Denoising benchmark: - -```bash -bin/neural_denoiser_temporal_benchmark_4K 20 -``` - -The new output is similar to: - -```output -Total run time over 20 iterations: 38.0813 ms -``` +You can now re-run the benchmarks and compare the performance benefits of using SME2. -## Comparison table and future performance uplift with SME2 +## Example performance with a Vivo X300 Android phone -| Benchmark | Without KleidiCV+KleidiAI | With KleidiCV+KleidiAI | -|-------------------------------------------|---------------------------|------------------------| -| `cinematic_mode_benchmark` | 2030.5525 ms | 2028.745 ms (-0.09%) | -| `low_light_image_enhancement_benchmark` | 58.0613 ms | 58.2126 ms (0.26%) | -| `neural_denoiser_temporal_benchmark_4K` | 38.0813 ms | 37.6839 ms (-1.04%) | +The table table shows the measurements (in FPS, Frames Per Second) measured on a Vivo X300 android phone: -As shown, the Background Blur (`cinematic_mode_benchmark`) and Neural Denoising -pipelines gain only a minor improvement, while the low-light enhancement pipeline -sees a minor performance degradation (0.26%) when KleidiCV and KleidiAI are -enabled. +| Benchmark | Without SME2 | With SME2 | Uplift | +|-----------------------------------------------------------|--------------|-----------|---------| +| `cinematic_mode_benchmark` | 17 | 27 | +58.8% | +| `low_light_image_enhancement_benchmark` | 51 | 84 | +64.70% | +| `neural_denoiser_temporal_benchmark_4K` (temporal only) | 249 | 678 | +172.3% | +| `neural_denoiser_temporal_benchmark_4K` (spatio-temporal) | - | 87 | | -A major benefit of using KleidiCV and KleidiAI though is that they can -automatically leverage new Arm architecture features - such as SME2 (Scalable -Matrix Extension v2) - without requiring changes to your application code. +{{% notice Note %}} +The Android system enforces throttling, so your own results may vary slightly. +{{% /notice %}} -As KleidiCV and KleidiAI operate as performance abstraction layers, any future -hardware instruction support can be utilized by simply rebuilding the -application. This enables better performance on newer processors without -additional engineering effort. +As shown, SME2 brings a dramatic performance improvement. This new power also enables to use much more +complex processing algorithms like the spatio-temporal denoising. diff --git a/content/learning-paths/mobile-graphics-and-gaming/ai-camera-pipelines/_index.md b/content/learning-paths/mobile-graphics-and-gaming/ai-camera-pipelines/_index.md index b3d992d96..43afa62b2 100644 --- a/content/learning-paths/mobile-graphics-and-gaming/ai-camera-pipelines/_index.md +++ b/content/learning-paths/mobile-graphics-and-gaming/ai-camera-pipelines/_index.md @@ -1,5 +1,5 @@ --- -title: Accelerate Denoising, Background Blur and Low-Light Camera Effects with KleidiAI and KleidiCV +title: Accelerate Denoising, Background Blur and Low-Light Camera Effects with SME2 minutes_to_complete: 30 @@ -7,7 +7,7 @@ who_is_this_for: This introductory topic is for mobile and computer-vision devel learning_objectives: - Build and run AI-powered camera pipeline applications - - Use KleidiCV and KleidiAI to improve the performance of real-time camera pipelines + - Use SME2 to improve the performance of real-time camera pipelines prerequisites: - A computer running Arm Linux or macOS with Docker installed @@ -47,6 +47,22 @@ further_reading: title: TensorFlow Lite is now LiteRT link: https://developers.googleblog.com/en/tensorflow-lite-is-now-litert/ type: blog + - resource: + title: Introducing the Scalable Matrix Extension for the Armv9-A Architecture + link: https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/scalable-matrix-extension-armv9-a-architecture + type: website + - resource: + title: Arm Scalable Matrix Extension (SME) Introduction (Part 1) + link: https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/arm-scalable-matrix-extension-introduction + type: blog + - resource: + title: Arm Scalable Matrix Extension (SME) Introduction (Part 2) + link: https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/arm-scalable-matrix-extension-introduction-p2 + type: blog + - resource: + title: (Part 3) Matrix-matrix multiplication. Neon, SVE, and SME compared + link: https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/matrix-matrix-multiplication-neon-sve-and-sme-compared + type: blog ### FIXED, DO NOT MODIFY # ================================================================================