Skip to content

sith-lab/gpuhammer

Repository files navigation

GPU-RowHammer (34th USENIX Security Symposium)

Introduction

This is the code artifact for the paper "GPUHammer: Rowhammer Attacks on GPU Memories are Practical", presented at USENIX Security 2025

Authors: Chris S. Lin (University of Toronto), Joyce Qu (University of Toronto), Gururaj Saileshwar (University of Toronto).

Required Environment

Run-time Environment: We suggest using a Linux distribution compatible with g++-11 or newer.

  • Software Dependencies:

    • Anaconda 24.9.2
    • CMake 3.26.4+
    • g++ with C++17 Support
    • Python 3.10+
    • NVIDIA CUDA Driver
    • NVIDIA CUDA Toolkit
    • NVIDIA System Management Interface nvidia-smi
  • Hardware Dependencies:

    • NVIDIA GPU sm_80+

Reference Environment

We evaluated our artifacts on the following reference system:

  • OS: Ubuntu 20.04.6 LTS
  • CPU: AMD Ryzen Threadripper PRO 5945WX 12-Cores
  • GPU: NVIDIA RTX A6000 (48 GB GDDR6, sm_80)
  • Driver: NVIDIA Driver 545.23.08 (includes nvidia-smi)
  • CUDA Toolkit: 12.3
  • Compiler: g++ 11.4.90 with C++17 support

Affected GPUs

  • NVIDIA A6000 GPU with 48GB GDDR6

Steps for Artifact Evaluation

1. Clone the Repository

Ensure you have already cloned the repository:

git clone https://github.com/sith-lab/gpuhammer.git
cd gpuhammer

2. GPU Setup

For the Rowhammer attack, a prerequiste is having ECC disabled. This is already the default setting on many GPUs. But if it is enabled, use the following commands to disable it:

sudo nvidia-smi -e 0
rmmod nvidia_drm 
rmmod nvidia_modeset
sudo reboot

Our profiling is easier with the persistence mode enabled, and with fixed GPU and memory clock rates, although these are not pre-requisites. The following script performs the above actions:

# Example usage: 
#  bash ./util/init_cuda.sh <MAX_GPU_CLOCK> <MAX_MEMORY_CLOCK>
bash ./util/init_cuda.sh 1800 7600

MAX_GPU_CLOCK and MAX_MEMORY_CLOCK can be found with deviceQuery from CUDA samples. We provide this for A6000 in 'deviceQuery.txt'.

These changes can be undone with bash ./util/reset_cuda.sh.

3. Download ImageNet Validation Dataset

Our artifact requires the ImageNet 2012 Validation Dataset, which is available from the official ImageNet website. Please note that downloading requires a (free) ImageNet account — please register at https://www.image-net.org/download-images.php before proceeding.

We require the "Validation images (all tasks)" under Images when inside the ImageNet 2012 DataSet webpage. Please obtain the download link and download it to the repository root as follows:

# Make sure you are downloading the file into the repository root directory
cd gpuhammer
wget <download link>

The downloaded file's name should be ILSVRC2012_img_val.tar.

4. Run the Artifact

Run the following commands to install dependencies, build GPUHammer, and execute experiments.

cd gpuhammer

# Make sure you set HAMMER_ROOT to the repository root directory
export HAMMER_ROOT=`pwd`
bash ./run_artifact.sh

This command will run the following steps:

  • Install Prerequisites and Build (~15 mins)

  • Run GPUHammer Experiments (~4 days, <1GB disc space)

    # Generate Row Sets (takes ~1 day)
    bash run_row_sets.sh
    # one can also skip the above step if row-sets are already generated by a previous run.
    
    # Reverse Engineering Attack Primitives  (takes <6 hour)
    bash run_fig2.sh
    bash run_fig5_6.sh
    bash run_fig8.sh
    bash run_fig10.sh 
    
    # Rowhammer Campaign with 24-sided hammering on 4 banks (takes ~1 day)
    bash run_t1_t3.sh
    
    # Bit Flip Characterization for our 8 bit-flips on 4 banks (takes <6 hours)
    bash run_fig11.sh        # ~2 hours
    bash run_fig12.sh        # ~4 hours
    
    # Exploit using 4 bit flips on 5 ML models (takes ~1.5 day)
    bash run_fig13_t4.sh

If a run script is supposed to generate a figure (Figures 2, 5, 6, 8, 10, 11, 12), it is stored in the respective folders: ./results/fig* .

The results of the campaign run_t1_t3.sh are stored in ./results/campaign and the respective tables (Table 1 and Table 3) are in ./results/campaign/t1.txt and ./results/campaign/t3.txt.

The results of the exploit (Figure 13 and Table 4) are in ./results/fig13_t4/fig13.pdf and./results/fig13_t4/t4.txt

NOTE: We additionally provide sample outputs of all experiments in the folder ./results/sample.

Detailed Instructions for Building the Artifact

Step 1. Installing Prerequisites

First the script will install RAPIDS RMM and Python dependencies:

bash ./run_setup.sh

If an error occurs during setup, run the following to clean up: bash ./run_setup.sh clean

Next, the script enables the RMM development manually when opening a terminal:

conda init
source activate base
conda activate rmm_dev

Step 2 Build GPUHammer Artifact

cmake -S ./src -B ./src/out/build
cd ./src/out/build
make

Detailed Instructions for Running GPU Rowhammer Campaign & Exploit

Step 1: Obtain Row Addresses to Hammer (i.e., Row-Sets)

This step is automatically done by ./run_rowsets.sh. This uses the script, ./util/run_timing_task.py, to generate the Row Set of a bank (unique addresses that map to the different rows in the same bank). Example usage of this script is as follows:

# Display available tasks available for row buffer conflict
python3 ./util/run_timing_task.py -h

# Display usage to the specific task
python3 ./util/run_timing_task.py <task> -h

To obtain the row-sets we have to (1) Identify tRC, (2) Generate a Conflict Set (addresses that have a row-buffer conflict with a given address), (3) Generate a Row Set (set of addresses mapping to unique rows in a bank)

1. Identify tRC: Get a rough idea of the tRC (time between two row activations) with the output of gt task. After inspecting the file you should be able to observe the difference in latency between row-buffer hits and conflicts. By default, the result is stored in a text TIME_VALUE.txt generated by the script. For example, here, we observe tRC is ~43.

python3 ./util/run_timing_task.py gt 

Example output:

...
328
336
327
338
360
360

2. Generate Conflict Set: Obtain a Conflict Set with the output of conf_set task Conflict set is a list of array offsets (addresses) that map to rows that conflict with the row of a fixed address (reference). By default, the result is stored in a text CONF_SET.txt generated by the script.

NOTE: When selecting the conflict threshold, due to noise, pick a number lower than the observed tRC for better accuracy. We observe on A6000 that values around 25-30ns works well.

python3 ./util/run_timing_task.py conf_set \
--threshold 27 \
--step 256 \
--it 15

Example output:

852224
852736
868352
868864
895232
895744
911360
911872

3. Generate Row Set: Obtain the Row Set for the bank corresponding to step 2 with the output of row_set task.

Row Set is a matrix of offsets, where each line lists the address offsets mapping to the same DRAM row. By picking one adddress from each line, we can map to unique rows in the DRAM bank. By default, the result is stored in a text ROW_SET.txt generated by the script.

python3 ./util/run_timing_task.py row_set CONF_SET.txt \
--threshold 27 \
--it 15

Example output:

852224	852736	868352  ...
1153280	1153792	1169408  ...
...

(Optionally) 4. Bank Set: Obtain a Bank Set with the output of bank_set task. Bank Set has the offsets that correspond to different banks in the GPU. It is recommended that you add a large size and test a small max at first, as the number of banks is undecidable for this program. By default, the result is stored in a text BANK_SET.txt generated by the script.

python3 ./util/run_timing_task.py bank_set \
--threshold 27 \
--max 5 \
--step 256 \
--it 15

Example output:

0
256
1024
1280
2048

Step 2: Observe Correct Delay for Alignment

This step helps to obtain the exact delay to be added to synchronize the aggressor pattern with DRAM REF commands. A helpful script, run_delay.sh, is available in the util folder. We assume the row set files are stored at ./results/row_sets/ROW_SET_<bank_id>.txt, and the hammering results is dumped in ./src/log/. Please update the relevant variables in the script, specifically the bank_id and file locations. You can run the script with:

bash ./util/run_delay.sh

You can visualize the result by running

# Example usage:
# python3 ./util/plot_delay.py <iterations> <trefi> <bank_offset> <num_aggressors>
python3 ./util/plot_delay.py 10000 1407 0 24

The first parameter is the number of iterations, which should match the iterations variable in the bash script. The second parameter is the tREFI period in nanoseconds, which is 1407 for the A6000. The fourth and onward parameters specify the number of aggressors. You can plot multiple aggressor configurations on the same plot.

On the output plot, observe that there is a flat-lined area in the middle. This is where synchronization occurs, and the delays in that area are optimal.

Step 3: Profile for Bit-flips

The helper script run_campaign.py is available in the util folder. Again, we assume the row set files exist, so please update the relevant variables. In addition, fill in the delay configuration you obtained from the previous step in results/delay/delays.txt by adding a line in the form <bank_id>, <number of aggressors>, <delay_amount>. You can run the script with the code below to start a campaign on four banks (A, B, C, D) with 24-sided hammering.

python3 run_campaign.py --bank_ids A B C D --num_agg 24

The result of your hammering can be found at ./results/campaign/bank_<offset>/<#_of_aggressors>agg_<vic_pat><agg_pat>_log.txt. Any bitflips will be reported in the form Bit-flip detected! in the log file.

Campaign Configurations

In our systematic study, we ran attacks with num_agg = 8, 12, 16, 20, 24 and skip_step = 3. We launched campaigns with both gridded victim data patterns. (vic_pat=55, agg_pat=aa and vic_pat=aa, agg_pat=55 ) For each configuration, we used the following setups:

  • 8-sided: 8 warps, 1 thread, 3 rounds
  • 12-sided: 6 warps, 2 threads, 2 rounds
  • 16-sided: 8 warps, 2 threads, 1 round
  • 20-sided: 10 warps, 2 threads, 1 round
  • 24-sided: 8 warps, 3 threads, 1 round

For our artifact evaluation, our scripts only perform hammering with 24-sided due to time limitations. But the script can be modified run campaigns for other configurations as well.

Existing Bitflips

Reproducing bitflips from our A6000 can be done with ./util/run_known_flips.sh. You can find the bitflip results in the folder ./results/sample_bitflips.

bash ./util/run_known_flips.sh

Step 4: Perform Time Sliced Exploit

By default our scripts execute the exploits with the bit flips we already discovered. But one can modify this to use new bit flips discovered in the campaigns. Once you have found some 0->1 bit-flips that map to the Most Significant Bit in FP16 (i.e. bit location 6 in an the second byte) , you can run an exploit, targeting these bit-flips in a ML model weight. We provide sample exploit scripts, run_hammer_manual_<bitflip>.sh in data_scripts/fig13_t4 that automates this process.

Finding the Proper Aggressor Row for Victim Row

When a bit-flip is observed, as shown in the paper, it can only be trigger by aggressor rows on one side. Once you have observed a set of aggressors (As) that triggers the bitflip in a victim (V), usually in the form of:

... A - - - A - V - A - - - A ...

You can find out which aggressor triggered it by shifting the pattern by changing min_rowid:

Left Aggressor:   ... A - - - A - V
Right Aggressor:  V - A - - - A ...

Aggressor on the Left/Top of the Victim

To exploit given a left side aggressor, we can create a script based on either the run_hammer_manual_B1.sh or run_hammer_manual_D1.sh. You will need to modify the following parameters based on information of your observed bitflip.

  • aggressor_row: Row id of the aggressor row left of the victim.
  • victim_row: Row id of the victim.
  • victim_row_offset: Offset of the victim row in Row Set. (Note Row id starts at 0, but Row Set text file line number may read starting line 1)
  • aggressor_row_offset: Offset of the aggressor row in Row Set. (Note Row id starts at 0, but Row Set text file line number may read starting line 1)
  • store_dir: Location to store your exploit results.
bash ./data_scripts/fig13_t4/run_hammer_manual_<name>.sh

You may find the results of your exploits in your store_dir, listed in <model>.txt.

Aggressor on the Right/Bottom of the Victim

To exploit given a right side aggressor, we can create a script based on either the run_hammer_manual_B2.sh or run_hammer_manual_D3.sh. You will need to modify the following parameters based on information of your observed bitflip. The parameters are slightly different than what you have for left side:

  • aggressor_row: Row id of the aggressor row left of the victim.
  • victim_byte_offset: The exact byte offset of the bitflip.
  • aggressor_row_offset: Offset of the aggressor row in Row Set. (Note Row id starts at 0, but Row Set text file line number may read starting line 1)
  • store_dir: Location to store your exploit results.
bash ./data_scripts/fig13_t4/run_hammer_manual_<name>.sh

You may find the results of your exploits in your store_dir, listed in <model>.txt.

Exploit with Existing Bitflips

The scripts in ./data_scripts/fig13_t4 will run the exploit on ImageNet models with specific bitflips we found in our A6000 GPU (B1, B2, D1, and D3). The model accuracy will be recorded in ./results/fig13_t4/<bitflip>.

bash ./data_scripts/fig13_t4/run_hammer_manual_<bitflip>.sh

This entire exploit with 4 bit-flips should take less than 1 day.

Folder Structure

  • src/ — Source code of GPUHammer, including CUDA and C++ implementations.
  • util/ — Utility scripts for GPU setup, delay tuning, and campaign orchestration.
  • data_scripts/fig*/ — Scripts for obtaining raw data for each figure.
  • plot_scripts — Scripts for generating plots from experiment results.
  • results/fig*/ — Output directory for experiment results, figures, logs, and sample outputs.
  • run_fig*.sh scripts in the repository root — Entry-point scripts for running specific experiments and reproducing figures.

Additional Notes

Clock Rate Note

The clock rate output by the console is always the maximum clock rate. Set the lgc value in init_cuda to >= maximum clock rate and the GPU will adjust to maximum clock rate.

Compiler Settings Note

The most stable compiler setting is this, which disables all GPU kernel optimizations. Xcicc are the frontend pipelines and Xptxas are the backend pipelines (hence PTX).

set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --generate-line-info -O3 -Xcicc -O0 -Xptxas -O0")

This is the setting that gives the lowest latency while not optimizing away our memory accesses. Xptxas gives the most performance increase since we are mostly using PTX and register allocation is our biggest concern. Turning either one past O3 on its own will not optimize the access away, but with the other one past O2, it will be gone. They must have some mechanism at work collaborating the two, but it is not a major concern to us.

set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --generate-line-info -O3 -Xcicc -O1 -Xptxas -O3")

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •