This project implements a Double DQN agent from scratch, including custom neural networks and GPU support in CUDA C++.
This implementation is easy to read and tweak, self-contained (no external packages), and reasonably fast thanks to NVIDIA GPU support—great for learning, experiments, and especially off-policy workflows like Q-learning. Because the data doesn’t have to be loaded in the traditional way
- Self-contained & zero-deps: Everything is written in CUDA C++, which makes modifications straightforward.
- Heavy lifting on the GPU: Most computations are offloaded to the GPU.
- Everything stays in VRAM: Keeps the working set in GPU memory, reducing copy overhead and enabling great parallelism.
- Fast baseline: With supervised training, MNIST can be trained in under a second (once the data is loaded).
- Allocates all GPU memory upfront at the start of training
- Gradient and value clipping supported
- Optimizers: SGD and Adam
- Full-batch training (entire batch at once)
- Runs on CUDA with custom Kernels and with Tensor Cores via cuBLAS
- Built-in NN inference
- DQN (vanilla) and Double DQN
- Replay buffer fully on the GPU
- Environment simulation fully on the GPU
- Epsilon-greedy exploration policy
- Linux
- NVIDIA driver (matching your CUDA Toolkit)
- CUDA Toolkit (including cuBLAS support)
- C++17 toolchain (
nvcc+gccorclang)
For Tensor Core support, use a GPU with Compute Capability ≥ 7.0 and enable FP16/TF32 math modes as appropriate.
- Contains some dead code, as the project was not fully finished due to system errors that resulted in data loss and loss of motivation.
- The most relevant parts are located in:
Native/Optimization/– core training logic and GPU kernelsNative/Inference/– inference examples and minimal usage
- Everything mentioned under Implementation works correctly if used properly.
- One original motivation was maximum speed by staying fully on GPU, but in many cases, modern frameworks like PyTorch are still faster due to extreme optimization and years of engineering.