This is the code artifact for the paper "GPUHammer: Rowhammer Attacks on GPU Memories are Practical", presented at USENIX Security 2025
Authors: Chris S. Lin (University of Toronto), Joyce Qu (University of Toronto), Gururaj Saileshwar (University of Toronto).
Run-time Environment: We suggest using a Linux distribution compatible with g++-11 or newer.
-
Software Dependencies:
- Anaconda 24.9.2
- CMake 3.26.4+
- g++ with C++17 Support
- Python 3.10+
- NVIDIA CUDA Driver
- NVIDIA CUDA Toolkit
- NVIDIA System Management Interface
nvidia-smi
-
Hardware Dependencies:
- NVIDIA GPU sm_80+
We evaluated our artifacts on the following reference system:
- OS: Ubuntu 20.04.6 LTS
- CPU: AMD Ryzen Threadripper PRO 5945WX 12-Cores
- GPU: NVIDIA RTX A6000 (48 GB GDDR6, sm_80)
- Driver: NVIDIA Driver 545.23.08 (includes nvidia-smi)
- CUDA Toolkit: 12.3
- Compiler: g++ 11.4.90 with C++17 support
- NVIDIA A6000 GPU with 48GB GDDR6
Ensure you have already cloned the repository:
git clone https://github.com/sith-lab/gpuhammer.git
cd gpuhammerFor the Rowhammer attack, a prerequiste is having ECC disabled. This is already the default setting on many GPUs. But if it is enabled, use the following commands to disable it:
sudo nvidia-smi -e 0
rmmod nvidia_drm
rmmod nvidia_modeset
sudo rebootOur profiling is easier with the persistence mode enabled, and with fixed GPU and memory clock rates, although these are not pre-requisites. The following script performs the above actions:
# Example usage:
# bash ./util/init_cuda.sh <MAX_GPU_CLOCK> <MAX_MEMORY_CLOCK>
bash ./util/init_cuda.sh 1800 7600MAX_GPU_CLOCK and MAX_MEMORY_CLOCK can be found with deviceQuery from CUDA samples. We provide this for A6000 in 'deviceQuery.txt'.
These changes can be undone with bash ./util/reset_cuda.sh.
Our artifact requires the ImageNet 2012 Validation Dataset, which is available from the official ImageNet website. Please note that downloading requires a (free) ImageNet account — please register at https://www.image-net.org/download-images.php before proceeding.
We require the "Validation images (all tasks)" under Images when inside the ImageNet 2012 DataSet webpage. Please obtain the download link and download it to the repository root as follows:
# Make sure you are downloading the file into the repository root directory
cd gpuhammer
wget <download link>The downloaded file's name should be ILSVRC2012_img_val.tar.
Run the following commands to install dependencies, build GPUHammer, and execute experiments.
cd gpuhammer
# Make sure you set HAMMER_ROOT to the repository root directory
export HAMMER_ROOT=`pwd`
bash ./run_artifact.shThis command will run the following steps:
-
Install Prerequisites and Build (~15 mins)
-
Run GPUHammer Experiments (~4 days, <1GB disc space)
# Generate Row Sets (takes ~1 day) bash run_row_sets.sh # one can also skip the above step if row-sets are already generated by a previous run. # Reverse Engineering Attack Primitives (takes <6 hour) bash run_fig2.sh bash run_fig5_6.sh bash run_fig8.sh bash run_fig10.sh # Rowhammer Campaign with 24-sided hammering on 4 banks (takes ~1 day) bash run_t1_t3.sh # Bit Flip Characterization for our 8 bit-flips on 4 banks (takes <6 hours) bash run_fig11.sh # ~2 hours bash run_fig12.sh # ~4 hours # Exploit using 4 bit flips on 5 ML models (takes ~1.5 day) bash run_fig13_t4.sh
If a run script is supposed to generate a figure (Figures 2, 5, 6, 8, 10, 11, 12), it is stored in the respective folders: ./results/fig* .
The results of the campaign run_t1_t3.sh are stored in ./results/campaign and the respective tables (Table 1 and Table 3) are in ./results/campaign/t1.txt and ./results/campaign/t3.txt.
The results of the exploit (Figure 13 and Table 4) are in ./results/fig13_t4/fig13.pdf and./results/fig13_t4/t4.txt
NOTE: We additionally provide sample outputs of all experiments in the folder ./results/sample.
First the script will install RAPIDS RMM and Python dependencies:
bash ./run_setup.shIf an error occurs during setup, run the following to clean up: bash ./run_setup.sh clean
Next, the script enables the RMM development manually when opening a terminal:
conda init
source activate base
conda activate rmm_devcmake -S ./src -B ./src/out/build
cd ./src/out/build
makeThis step is automatically done by ./run_rowsets.sh. This uses the script, ./util/run_timing_task.py, to generate the Row Set of a bank (unique addresses that map to the different rows in the same bank). Example usage of this script is as follows:
# Display available tasks available for row buffer conflict
python3 ./util/run_timing_task.py -h
# Display usage to the specific task
python3 ./util/run_timing_task.py <task> -hTo obtain the row-sets we have to (1) Identify tRC, (2) Generate a Conflict Set (addresses that have a row-buffer conflict with a given address), (3) Generate a Row Set (set of addresses mapping to unique rows in a bank)
1. Identify tRC: Get a rough idea of the tRC (time between two row activations) with the output of gt task. After inspecting the file you should be able to observe the difference in latency between row-buffer hits and conflicts. By default, the result is stored in a text TIME_VALUE.txt generated by the script. For example, here, we observe tRC is ~43.
python3 ./util/run_timing_task.py gt
Example output:
...
328
336
327
338
360
360
2. Generate Conflict Set: Obtain a Conflict Set with the output of conf_set task
Conflict set is a list of array offsets (addresses) that map to rows that conflict with the row of a fixed address (reference). By default, the result is stored in a text CONF_SET.txt generated by the script.
NOTE: When selecting the conflict threshold, due to noise, pick a number lower than the observed tRC for better accuracy. We observe on A6000 that values around 25-30ns works well.
python3 ./util/run_timing_task.py conf_set \
--threshold 27 \
--step 256 \
--it 15Example output:
852224
852736
868352
868864
895232
895744
911360
911872
3. Generate Row Set: Obtain the Row Set for the bank corresponding to step 2 with the output of row_set task.
Row Set is a matrix of offsets, where each line lists the address offsets mapping to the same DRAM row. By picking one adddress from each line, we can map to unique rows in the DRAM bank. By default, the result is stored in a text ROW_SET.txt generated by the script.
python3 ./util/run_timing_task.py row_set CONF_SET.txt \
--threshold 27 \
--it 15Example output:
852224 852736 868352 ...
1153280 1153792 1169408 ...
...
(Optionally) 4. Bank Set: Obtain a Bank Set with the output of bank_set task. Bank Set has the offsets that correspond to different banks in the GPU. It is recommended that you add a large size and test a small max at first, as the number of banks is undecidable for this program. By default, the result is stored in a text BANK_SET.txt generated by the script.
python3 ./util/run_timing_task.py bank_set \
--threshold 27 \
--max 5 \
--step 256 \
--it 15Example output:
0
256
1024
1280
2048
This step helps to obtain the exact delay to be added to synchronize the aggressor pattern with DRAM REF commands. A helpful script, run_delay.sh, is available in the util folder. We assume the row set files are stored at ./results/row_sets/ROW_SET_<bank_id>.txt, and the hammering results is dumped in ./src/log/. Please update the relevant variables in the script, specifically the bank_id and file locations. You can run the script with:
bash ./util/run_delay.shYou can visualize the result by running
# Example usage:
# python3 ./util/plot_delay.py <iterations> <trefi> <bank_offset> <num_aggressors>
python3 ./util/plot_delay.py 10000 1407 0 24The first parameter is the number of iterations, which should match the iterations variable in the bash script. The second parameter is the tREFI period in nanoseconds, which is 1407 for the A6000. The fourth and onward parameters specify the number of aggressors. You can plot multiple aggressor configurations on the same plot.
On the output plot, observe that there is a flat-lined area in the middle. This is where synchronization occurs, and the delays in that area are optimal.
The helper script run_campaign.py is available in the util folder. Again, we assume the row set files exist, so please update the relevant variables. In addition, fill in the delay configuration you obtained from the previous step in results/delay/delays.txt by adding a line in the form <bank_id>, <number of aggressors>, <delay_amount>. You can run the script with the code below to start a campaign on four banks (A, B, C, D) with 24-sided hammering.
python3 run_campaign.py --bank_ids A B C D --num_agg 24The result of your hammering can be found at ./results/campaign/bank_<offset>/<#_of_aggressors>agg_<vic_pat><agg_pat>_log.txt. Any bitflips will be reported in the form Bit-flip detected! in the log file.
In our systematic study, we ran attacks with num_agg = 8, 12, 16, 20, 24 and skip_step = 3. We launched campaigns with both gridded victim data patterns. (vic_pat=55, agg_pat=aa and vic_pat=aa, agg_pat=55 ) For each configuration, we used the following setups:
- 8-sided: 8 warps, 1 thread, 3 rounds
- 12-sided: 6 warps, 2 threads, 2 rounds
- 16-sided: 8 warps, 2 threads, 1 round
- 20-sided: 10 warps, 2 threads, 1 round
- 24-sided: 8 warps, 3 threads, 1 round
For our artifact evaluation, our scripts only perform hammering with 24-sided due to time limitations. But the script can be modified run campaigns for other configurations as well.
Reproducing bitflips from our A6000 can be done with ./util/run_known_flips.sh. You can find the bitflip results in the folder ./results/sample_bitflips.
bash ./util/run_known_flips.shBy default our scripts execute the exploits with the bit flips we already discovered. But one can modify this to use new bit flips discovered in the campaigns. Once you have found some 0->1 bit-flips that map to the Most Significant Bit in FP16 (i.e. bit location 6 in an the second byte) , you can run an exploit, targeting these bit-flips in a ML model weight. We provide sample exploit scripts, run_hammer_manual_<bitflip>.sh in data_scripts/fig13_t4 that automates this process.
When a bit-flip is observed, as shown in the paper, it can only be trigger by aggressor rows on one side. Once you have observed a set of aggressors (As) that triggers the bitflip in a victim (V), usually in the form of:
... A - - - A - V - A - - - A ...
You can find out which aggressor triggered it by shifting the pattern by changing min_rowid:
Left Aggressor: ... A - - - A - V
Right Aggressor: V - A - - - A ...
To exploit given a left side aggressor, we can create a script based on either the run_hammer_manual_B1.sh or run_hammer_manual_D1.sh. You will need to modify the following parameters based on information of your observed bitflip.
- aggressor_row: Row id of the aggressor row left of the victim.
- victim_row: Row id of the victim.
- victim_row_offset: Offset of the victim row in Row Set. (Note Row id starts at 0, but Row Set text file line number may read starting line 1)
- aggressor_row_offset: Offset of the aggressor row in Row Set. (Note Row id starts at 0, but Row Set text file line number may read starting line 1)
- store_dir: Location to store your exploit results.
bash ./data_scripts/fig13_t4/run_hammer_manual_<name>.shYou may find the results of your exploits in your store_dir, listed in <model>.txt.
To exploit given a right side aggressor, we can create a script based on either the run_hammer_manual_B2.sh or run_hammer_manual_D3.sh. You will need to modify the following parameters based on information of your observed bitflip. The parameters are slightly different than what you have for left side:
- aggressor_row: Row id of the aggressor row left of the victim.
- victim_byte_offset: The exact byte offset of the bitflip.
- aggressor_row_offset: Offset of the aggressor row in Row Set. (Note Row id starts at 0, but Row Set text file line number may read starting line 1)
- store_dir: Location to store your exploit results.
bash ./data_scripts/fig13_t4/run_hammer_manual_<name>.shYou may find the results of your exploits in your store_dir, listed in <model>.txt.
The scripts in ./data_scripts/fig13_t4 will run the exploit on ImageNet models with specific bitflips we found in our A6000 GPU (B1, B2, D1, and D3). The model accuracy will be recorded in ./results/fig13_t4/<bitflip>.
bash ./data_scripts/fig13_t4/run_hammer_manual_<bitflip>.shThis entire exploit with 4 bit-flips should take less than 1 day.
src/— Source code of GPUHammer, including CUDA and C++ implementations.util/— Utility scripts for GPU setup, delay tuning, and campaign orchestration.data_scripts/fig*/— Scripts for obtaining raw data for each figure.plot_scripts— Scripts for generating plots from experiment results.results/fig*/— Output directory for experiment results, figures, logs, and sample outputs.run_fig*.shscripts in the repository root — Entry-point scripts for running specific experiments and reproducing figures.
The clock rate output by the console is always the maximum clock rate. Set the lgc value in init_cuda to >= maximum clock rate and the GPU will adjust to maximum clock rate.
The most stable compiler setting is this, which disables all GPU kernel optimizations. Xcicc are the frontend pipelines and Xptxas are the backend pipelines (hence PTX).
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --generate-line-info -O3 -Xcicc -O0 -Xptxas -O0")This is the setting that gives the lowest latency while not optimizing away our memory accesses. Xptxas gives the most performance increase since we are mostly using PTX and register allocation is our biggest concern. Turning either one past O3 on its own will not optimize the access away, but with the other one past O2, it will be gone. They must have some mechanism at work collaborating the two, but it is not a major concern to us.
set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --generate-line-info -O3 -Xcicc -O1 -Xptxas -O3")