Skip to content

Commit 337b972

Browse files
[ai-camera-pipeline] Switch the pitch to SME2 instead of KleidiAI/KleidiCV.
1 parent 5d766e5 commit 337b972

File tree

6 files changed

+87
-97
lines changed

6 files changed

+87
-97
lines changed

content/learning-paths/mobile-graphics-and-gaming/ai-camera-pipelines/1-prerequisites.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ layout: learningpathall
88

99
## Host machine requirements
1010

11-
This Learning Path demonstrates how to improve the performance of camera pipelines using KleidiAI and KleidiCV on Arm. You’ll need an Arm64 machine, preferably running an Ubuntu-based distribution. The instructions have been tested on Ubuntu 24.04.
11+
This Learning Path demonstrates how to improve the performance of camera pipelines using SME2 on Arm. You’ll need an Arm64 machine with SME2 support, preferably running an Ubuntu-based distribution. The instructions have been tested on Ubuntu 24.04.
1212

1313
## Install required software
1414

content/learning-paths/mobile-graphics-and-gaming/ai-camera-pipelines/2-overview.md

Lines changed: 21 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,13 +6,32 @@ weight: 4
66
layout: learningpathall
77
---
88

9-
## KleidiAI
9+
## About SME/SME2
10+
11+
Arm Scalable Matrix Extension (SME) is an architecture extension that provides enhanced support for matrix operations. SME builds on the Scalable Vector Extensions (SVE and SVE2), adding new capabilities to efficiently process matrices.
12+
13+
Key features include:
14+
- Outer product between two vectors
15+
- Matrix tile storage
16+
- Load, store, insert, and extract of tile vectors, including on-the-fly transposition
17+
- Streaming SVE mode
18+
19+
SME defines the following new features:
20+
- A new architectural state capable of holding two-dimensional matrix tiles.
21+
- Streaming SVE mode which supports execution of SVE2 instructions with a vector length that matches the tile width.
22+
- New instructions that accumulate the outer product of two vectors into a tile.
23+
- New load, store, and move instructions that transfer a vector to or from a tile row or column.
24+
- Like SVE2, SME is a scalable vector length extension which enables Vector Length Agnostic (VLA), per-lane predication, predicate-driven loop control and management features.
25+
26+
This makes SME and SME2 particularly fit for image processing. In this learning path, you will use SME2 instructions thru the KleidiAI and KleidiCV libraries:
27+
28+
### KleidiAI
1029

1130
[KleidiAI](https://gitlab.arm.com/kleidi/kleidiai) is an open-source library of optimized, performance-critical routines (micro-kernels) for AI workloads on Arm CPUs. These routines are tuned for specific Arm microarchitectures to maximize performance and are designed for straightforward integration into C/C++ ML and AI frameworks.
1231

1332
Several popular AI frameworks already take advantage of KleidiAI to improve performance on Arm platforms.
1433

15-
## KleidiCV
34+
### KleidiCV
1635

1736
[KleidiCV](https://gitlab.arm.com/kleidi/kleidicv) is an open-source library that provides high-performance image-processing functions for AArch64. It is lightweight and simple to integrate, and computer-vision frameworks such as OpenCV can leverage KleidiCV to accelerate image processing on Arm devices.
1837

content/learning-paths/mobile-graphics-and-gaming/ai-camera-pipelines/3-build.md

Lines changed: 10 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -44,24 +44,25 @@ docker run --rm --volume $PWD:/home/cv-examples/example -it ai-camera-pipelines
4444
Inside the container, run the following commands:
4545

4646
```bash
47-
ENABLE_SME2=0
48-
TENSORFLOW_GIT_TAG="v2.19.0"
47+
ENABLE_SME2=1
48+
TENSORFLOW_GIT_TAG="v2.20.0"
49+
4950
# Build flatbuffers
5051
git clone https://github.com/google/flatbuffers.git
5152
cd flatbuffers
5253
git checkout v24.3.25
5354
mkdir build
5455
cd build
55-
cmake .. -DCMAKE_INSTALL_PREFIX=../install
56-
cmake --build . -j16
56+
cmake -GNinja -DCMAKE_INSTALL_PREFIX=../install ..
57+
cmake --build .
5758
cmake --install .
5859
cd ../..
5960

6061
# Build the pipelines
6162
mkdir build
6263
cd build
63-
cmake -GNinja -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=../install -DARMNN_TFLITE_PARSER=0 -DTENSORFLOW_GIT_TAG=$TENSORFLOW_GIT_TAG -DTFLITE_HOST_TOOLS_DIR=../flatbuffers/install/bin -DENABLE_SME2=$ENABLE_SME2 -DENABLE_KLEIDICV:BOOL=ON -DXNNPACK_ENABLE_KLEIDIAI:BOOL=ON -DCMAKE_TOOLCHAIN_FILE=toolchain.cmake -S ../example -B .
64-
cmake --build . -j16
64+
cmake -GNinja -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=../install -DTENSORFLOW_GIT_TAG=$TENSORFLOW_GIT_TAG -DTFLITE_HOST_TOOLS_DIR=../flatbuffers/install/bin -DENABLE_SME2=$ENABLE_SME2 -DCMAKE_TOOLCHAIN_FILE=../example/toolchain.cmake -S ../example
65+
cmake --build .
6566
cmake --install .
6667

6768
# Package and export the pipelines.
@@ -71,16 +72,9 @@ tar cfz example/install.tar.gz install
7172

7273
Leave the container by pressing `Ctrl+D`.
7374

74-
## Notes on the CMake configuration options
75-
76-
The `cmake` command-line options relevant to this learning path are:
77-
78-
| Command-line option | Description |
79-
|-------------------------------------|----------------------------------------------------------------------------------------------|
80-
| `ENABLE_SME2=$ENABLE_SME2` | SME2 (Scalable Matrix Extension 2) is disabled in this build with `ENABLE_SME2=0`. |
81-
| `ARMNN_TFLITE_PARSER=0` | Configures the `ai-camera-pipelines` repository to use LiteRT with XNNPack instead of ArmNN. |
82-
| `ENABLE_KLEIDICV:BOOL=ON` | Enables KleidiCV for optimized image processing. |
83-
| `XNNPACK_ENABLE_KLEIDIAI:BOOL=ON` | Enables KleidiAI acceleration for LiteRT workloads via XNNPack. |
75+
{{% notice Note %}}
76+
In the above `cmake` command line, you'll note that the pipelines are build with SME2 (`-DENABLE_SME2=$ENABLE_SME2`), but that this could be disabled as well. We will use this feature later when benchmarking the performance improvements brought by SME2.
77+
{{% /notice %}}
8478

8579
## Install the pipelines
8680

content/learning-paths/mobile-graphics-and-gaming/ai-camera-pipelines/4-run.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -16,28 +16,28 @@ In the previous section, you built the AI Camera Pipelines. In this section, you
1616
cd $HOME/ai-camera-pipelines
1717
python3 -m venv venv
1818
. venv/bin/activate
19-
pip install -r ai-camera-pipelines.git/docker/python-requirements.txt
19+
pip install numpy opencv-python pillow torch
2020
```
2121

2222
## Background blur
2323

24-
Run the background Blur pipeline, using `resources/test_input.png` as the input image and write the transformed image to `test_output.png`:
24+
Run the background Blur pipeline, using `resources/test_input.png` as the input image and write the transformed image to `test_output_cinematic_mode.png`:
2525

2626
```bash
2727
cd $HOME/ai-camera-pipelines
28-
bin/cinematic_mode resources/test_input.png test_output.png resources/depth_and_saliency_v3_2_assortedv2_w_augment_mobilenetv2_int8_only_ptq.tflite
28+
bin/cinematic_mode resources/test_input.png test_output_cinematic_mode.png resources/depth_and_saliency_v3_2_assortedv2_w_augment_mobilenetv2_int8_only_ptq.tflite
2929
```
3030

3131
![example image alt-text#center](test_input2.webp "Input image")
3232
![example image alt-text#center](test_output2.webp "Image with blur applied")
3333

3434
## Low-Light Enhancement
3535

36-
Run the Low-Light Enhancement pipeline, using `resources/test_input.png` as the input image and write the transformed image to `test_output2_lime.png`:
36+
Run the Low-Light Enhancement pipeline, using `resources/test_input.png` as the input image and write the transformed image to `test_output_lime.png`:
3737

3838
```bash
3939
cd $HOME/ai-camera-pipelines
40-
bin/low_light_image_enhancement resources/test_input.png test_output2_lime.png resources/HDRNetLIME_lr_coeffs_v1_1_0_mixed_low_light_perceptual_l1_loss_float32.tflite
40+
bin/low_light_image_enhancement resources/test_input.png test_output_lime.png resources/HDRNetLIME_lr_coeffs_v1_1_0_mixed_low_light_perceptual_l1_loss_float32.tflite
4141
```
4242

4343
![example image alt-text#center](test_input2.webp "Input image")

content/learning-paths/mobile-graphics-and-gaming/ai-camera-pipelines/5-performances.md

Lines changed: 32 additions & 71 deletions
Original file line numberDiff line numberDiff line change
@@ -18,36 +18,38 @@ These benchmarks demonstrate the performance improvements enabled by KleidiCV an
1818
- KleidiCV enhances OpenCV performance with computation kernels optimized for Arm processors.
1919
- KleidiAI accelerates LiteRT+XNNPack inference using AI-optimized micro-kernels tailored for Arm CPUs.
2020

21-
## Performance with KleidiCV and KleidiAI
21+
## Performance with SME
2222

23-
By default, the OpenCV library is built with KleidiCV support, and LiteRT+XNNPack is built with KleidiAI support.
24-
25-
You can run the benchmarks using the applications you built earlier.
23+
You can run the benchmarks using the applications you built earlier (with SME2 support).
2624

2725
Run the Background Blur benchmark:
2826

2927
```bash
30-
bin/cinematic_mode_benchmark 20 resources/depth_and_saliency_v3_2_assortedv2_w_augment_mobilenetv2_int8_only_ptq.tflite
28+
bin/cinematic_mode_benchmark 20 --tflite_file resources/depth_and_saliency_v3_2_assortedv2_w_augment_mobilenetv2_int8_only_ptq.tflite
3129
```
3230

3331
The output is similar to:
3432

3533
```output
34+
INFO: Frame rate throttling is turned OFF
3635
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
37-
Total run time over 20 iterations: 2028.745 ms
36+
INFO: Total run time over 20 iterations: 614.724 ms
37+
INFO: Average FPS: 32.5349
3838
```
3939

4040
Run the Low Light Enhancement benchmark:
4141

4242
```bash
43-
bin/low_light_image_enhancement_benchmark 20 resources/HDRNetLIME_lr_coeffs_v1_1_0_mixed_low_light_perceptual_l1_loss_float32.tflite
43+
bin/low_light_image_enhancement_benchmark 20 --tflite_file resources/HDRNetLIME_lr_coeffs_v1_1_0_mixed_low_light_perceptual_l1_loss_float32.tflite
4444
```
4545

4646
The output is similar to:
4747

4848
```output
49+
INFO: Frame rate throttling is turned OFF
4950
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
50-
Total run time over 20 iterations: 58.2126 ms
51+
INFO: Total run time over 20 iterations: 57.3958 ms
52+
INFO: Average FPS: 348.457
5153
```
5254

5355
Last, run the Neural Denoising benchmark:
@@ -59,77 +61,36 @@ bin/neural_denoiser_temporal_benchmark_4K 20
5961
The output is similar to:
6062

6163
```output
62-
Total run time over 20 iterations: 37.6839 ms
64+
INFO: Frame rate throttling is turned OFF
65+
INFO: Total run time over 20 iterations: 36.2114 ms
66+
INFO: Average FPS: 552.312
6367
```
6468

6569
From these results, you can see that:
66-
- `cinematic_mode_benchmark` performed 20 iterations in 2028.745 ms
67-
- `low_light_image_enhancement_benchmark` performed 20 iterations in 58.2126 ms
68-
- `neural_denoiser_temporal_benchmark_4K` performed 20 iterations in 37.6839 ms
69-
70-
## Benchmark results without KleidiCV and KleidiAI
71-
72-
To measure the performance without these optimizations, recompile the pipelines using the following flags in your CMake command:
73-
```bash
74-
-DENABLE_KLEIDICV:BOOL=OFF -DXNNPACK_ENABLE_KLEIDIAI:BOOL=OFF
75-
```
76-
77-
Re-run the Background Blur benchmark:
70+
- `cinematic_mode_benchmark` performed 20 iterations in 614.724 ms, which means 32.5349 FPS
71+
- `low_light_image_enhancement_benchmark` performed 20 iterations in 57.3958 ms, which means 348.457 FPS
72+
- `neural_denoiser_temporal_benchmark_4K` performed 20 iterations in 36.2114 ms, which means 552.312 FPS
7873

79-
```bash
80-
bin/cinematic_mode_benchmark 20 resources/depth_and_saliency_v3_2_assortedv2_w_augment_mobilenetv2_int8_only_ptq.tflite
81-
```
74+
## Performance without SME2
8275

83-
The new output is similar to:
76+
To measure the performances without the benefits of SME2, set `ENABLE_SME2=0` and recompile the pipelines as shown in the [build](/learning-paths/mobile-graphics-and-gaming/ai-camera-pipelines/3-build/) page.
8477

85-
```output
86-
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
87-
Total run time over 20 iterations: 2030.5525 ms
88-
```
89-
90-
Re-run the Low Light Enhancement benchmark:
91-
92-
```bash
93-
bin/low_light_image_enhancement_benchmark 20 resources/HDRNetLIME_lr_coeffs_v1_1_0_mixed_low_light_perceptual_l1_loss_float32.tflite
94-
```
95-
96-
The new output is similar to:
97-
98-
```output
99-
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
100-
Total run time over 20 iterations: 58.0613 ms
101-
```
102-
103-
Re-run the Neural Denoising benchmark:
104-
105-
```bash
106-
bin/neural_denoiser_temporal_benchmark_4K 20
107-
```
108-
109-
The new output is similar to:
110-
111-
```output
112-
Total run time over 20 iterations: 38.0813 ms
113-
```
78+
You can now re-run the benchmarks and compare the performance benefits of using SME2.
11479

115-
## Comparison table and future performance uplift with SME2
80+
## Example performance with a Vivo X300 Android phone
11681

117-
| Benchmark | Without KleidiCV+KleidiAI | With KleidiCV+KleidiAI |
118-
|-------------------------------------------|---------------------------|------------------------|
119-
| `cinematic_mode_benchmark` | 2030.5525 ms | 2028.745 ms (-0.09%) |
120-
| `low_light_image_enhancement_benchmark` | 58.0613 ms | 58.2126 ms (0.26%) |
121-
| `neural_denoiser_temporal_benchmark_4K` | 38.0813 ms | 37.6839 ms (-1.04%) |
82+
The table table shows the measurements (in FPS, Frames Per Second) measured on a Vivo X300 android phone:
12283

123-
As shown, the Background Blur (`cinematic_mode_benchmark`) and Neural Denoising
124-
pipelines gain only a minor improvement, while the low-light enhancement pipeline
125-
sees a minor performance degradation (0.26%) when KleidiCV and KleidiAI are
126-
enabled.
84+
| Benchmark | Without SME2 | With SME2 | Uplift |
85+
|-----------------------------------------------------------|--------------|-----------|---------|
86+
| `cinematic_mode_benchmark` | 17 | 27 | +58.8% |
87+
| `low_light_image_enhancement_benchmark` | 51 | 84 | +64.70% |
88+
| `neural_denoiser_temporal_benchmark_4K` (temporal only) | 249 | 678 | +172.3% |
89+
| `neural_denoiser_temporal_benchmark_4K` (spatio-temporal) | - | 87 | |
12790

128-
A major benefit of using KleidiCV and KleidiAI though is that they can
129-
automatically leverage new Arm architecture features - such as SME2 (Scalable
130-
Matrix Extension v2) - without requiring changes to your application code.
91+
{{% notice Note %}}
92+
The Android system enforces throttling, so your own results may vary slightly.
93+
{{% /notice %}}
13194

132-
As KleidiCV and KleidiAI operate as performance abstraction layers, any future
133-
hardware instruction support can be utilized by simply rebuilding the
134-
application. This enables better performance on newer processors without
135-
additional engineering effort.
95+
As shown, SME2 brings a dramatic performance improvement. This new power also enables to use much more
96+
complex processing algorithms like the spatio-temporal denoising.

content/learning-paths/mobile-graphics-and-gaming/ai-camera-pipelines/_index.md

Lines changed: 18 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,13 @@
11
---
2-
title: Accelerate Denoising, Background Blur and Low-Light Camera Effects with KleidiAI and KleidiCV
2+
title: Accelerate Denoising, Background Blur and Low-Light Camera Effects with SME2
33

44
minutes_to_complete: 30
55

66
who_is_this_for: This introductory topic is for mobile and computer-vision developers, camera pipeline engineers, and performance-minded practitioners who want to optimize real-time camera effects on Arm using KleidiAI and KleidiCV.
77

88
learning_objectives:
99
- Build and run AI-powered camera pipeline applications
10-
- Use KleidiCV and KleidiAI to improve the performance of real-time camera pipelines
10+
- Use SME2 to improve the performance of real-time camera pipelines
1111

1212
prerequisites:
1313
- A computer running Arm Linux or macOS with Docker installed
@@ -47,6 +47,22 @@ further_reading:
4747
title: TensorFlow Lite is now LiteRT
4848
link: https://developers.googleblog.com/en/tensorflow-lite-is-now-litert/
4949
type: blog
50+
- resource:
51+
title: Introducing the Scalable Matrix Extension for the Armv9-A Architecture
52+
link: https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/scalable-matrix-extension-armv9-a-architecture
53+
type: website
54+
- resource:
55+
title: Arm Scalable Matrix Extension (SME) Introduction (Part 1)
56+
link: https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/arm-scalable-matrix-extension-introduction
57+
type: blog
58+
- resource:
59+
title: Arm Scalable Matrix Extension (SME) Introduction (Part 2)
60+
link: https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/arm-scalable-matrix-extension-introduction-p2
61+
type: blog
62+
- resource:
63+
title: (Part 3) Matrix-matrix multiplication. Neon, SVE, and SME compared
64+
link: https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/matrix-matrix-multiplication-neon-sve-and-sme-compared
65+
type: blog
5066

5167
### FIXED, DO NOT MODIFY
5268
# ================================================================================

0 commit comments

Comments
 (0)