Feat: Implement CUDA programming with Unified Memory for dataset loader (train large data on limited VRAM) #1608

darrenchang · 2025-11-18T09:35:27Z

This implementation adds the command line option low_vram.

Pros for enabling low_vram

Allow dataset to be loaded into RAM instead of GPU VRAM.
Allow the entire GPU VRAM to be used for training

Cons for enabling low_vram

Training is a bit slower because the data need to be moved back and forth between CPU RAM and GPU VRAM

Screenshots example
In this test, I was able to load 30Gb+ data and train on a 16GB GPU. Also maxing the batch size and enabling per-image latent feature. I have about 3456 images in this training dataset.

Refactor

- Allow dataset to be loaded into ram for training large datasets on limited vram Add low_vram option to the example python runner

Tom94

Thanks for the contribution! I'd be happy to merge this in -- it's cool functionality in principle. Please see my individual comments for what still needs to change.

Tom94 · 2025-11-18T10:10:20Z

include/neural-graphics-primitives/testbed.h


 	uint32_t m_training_step = 0;
 	uint32_t m_training_batch_size = 1 << 18;
+	bool m_low_vram = false;


This option should be under m_nerf.training and should be named more precisely. E.g. dataset_in_cpu_ram to match the existing m_nerf.training.dataset member.

Function args and locals should be named accordingly. E.g. within NerfDataset it's fine to use just in_cpu_ram, whereas the CLI arg should probably be --nerf_dataset_in_cpu_ram

Sure! I also like this much better.

Tom94 · 2025-11-18T10:10:41Z

include/neural-graphics-primitives/nerf_loader.h

 	}

-	void set_training_image(int frame_idx, const ivec2& image_resolution, const void* pixels, const void* depth_pixels, float depth_scale, bool image_data_on_gpu, EImageDataType image_type, EDepthDataType depth_type, float sharpen_amount = 0.f, bool white_transparent = false, bool black_transparent = false, uint32_t mask_color = 0, const Ray *rays = nullptr);
+	void set_training_image(int frame_idx, const ivec2& image_resolution, const void* pixels, const void* depth_pixels, float depth_scale, bool image_data_on_gpu, EImageDataType image_type, EDepthDataType depth_type, float sharpen_amount = 0.f, bool white_transparent = false, bool black_transparent = false, uint32_t mask_color = 0, const Ray *rays = nullptr, bool low_vram = 0);


Tom94 · 2025-11-18T10:13:32Z

src/nerf_loader.cu

-	CUDA_CHECK_THROW(cudaDeviceSynchronize());
-	// free memory
-	for (uint32_t i = 0; i < result.n_images; ++i) {
+		// free memory


I don't recall exactly why, but I believe there was a reason for the two loops to be separate. I think the underlying memory might be aliased in some cases -- pleave revert. Putting the progress bar in the first loop likely matches current behavior closely enough.

Thank you for the heads up.

You didn't actually put the progress into the first loop. Putting it in the second is somewhat meaningless -- calling free is pretty much free.

Tom94 · 2025-11-18T10:14:26Z

src/nerf_loader.cu

+		progress_to_gpu.update(i);
 	}
+	CUDA_CHECK_THROW(cudaDeviceSynchronize());
+	tlog::success() << "Copy / Converted " << images.size() << " images to GPU after " << tlog::durationToString(progress_to_gpu.duration());


Copied / converted

Tom94 · 2025-11-18T10:15:03Z

src/nerf_loader.cu

 	// copy or convert the pixels
-	pixelmemory[frame_idx].resize(img_size * image_type_size(image_type));
-	void* dst = pixelmemory[frame_idx].data();
+	// pixelmemory[frame_idx].resize(img_size * image_type_size(image_type));


Don't leave old code around as a comment. That's what we have git for.

Tom94 · 2025-11-18T10:17:02Z

src/nerf_loader.cu

 			linear_kernel(from_rgba32<__half>, 0, nullptr, n_pixels, (uint8_t*)pixels, (__half*)images_data_half.data(), white_transparent, black_transparent, mask_color);
-			pixelmemory[frame_idx] = std::move(images_data_half);
-			dst = pixelmemory[frame_idx].data();
+			// pixelmemory[frame_idx] = std::move(images_data_half);


Tom94 · 2025-11-18T10:18:03Z

src/nerf_loader.cu

-			pixelmemory[frame_idx] = std::move(images_data_half);
-			dst = pixelmemory[frame_idx].data();
+			// pixelmemory[frame_idx] = std::move(images_data_half);
+			pixelmemory[frame_idx] = reinterpret_cast<int*>(images_data_half.data());


Memory leak if pixelmemory was already set before.

Tom94 · 2025-11-18T10:18:16Z

src/nerf_loader.cu

-		pixelmemory[frame_idx] = std::move(images_data_sharpened);
-		dst = pixelmemory[frame_idx].data();
+		// pixelmemory[frame_idx] = std::move(images_data_sharpened);
+		pixelmemory[frame_idx] = reinterpret_cast<int*>(images_data_sharpened.data());


Tom94 · 2025-11-18T10:19:11Z

src/nerf_loader.cu

 	metadata_gpu.enlarge(last);
-	CUDA_CHECK_THROW(cudaMemcpy(metadata_gpu.data() + first, metadata.data() + first, n * sizeof(TrainingImageMetadata), cudaMemcpyHostToDevice));
+	size_t total_size = n * sizeof(TrainingImageMetadata);
+	CUDA_CHECK_THROW(cudaMemcpy(metadata_gpu.data() + first, metadata.data() + first, total_size, cudaMemcpyHostToDevice));


Unnecessary change

Tom94 · 2025-11-18T10:27:56Z

src/nerf_loader.cu

-	void* dst = pixelmemory[frame_idx].data();
+	// pixelmemory[frame_idx].resize(img_size * image_type_size(image_type));
+	size_t total_image_mem_size = img_size * image_type_size(image_type);
+	void *pixelmemory[frame_idx] = { nullptr };


You're shadowing NerfDataset::pixelmemory here (and worse, unnecessarily making an array as far as I can tell). Is there something I am missing here?

As far as I can tell, you'd be much better of using pixelmemory[frame_idx] = GPUMemory(img_size * image_type_size(image_type), low_vram);, which'll give you managed memory if low_vram is set without any of the modifications / memory leaks that I'm pointing out in the above comments.

The reason I made an array by doing void *pixelmemory[frame_idx] = { nullptr }; is because of the follow compiler error I got as shown below.
I still do not yet fully understand why the compiler wants it to be an array.

cmake . -B build \ -DCMAKE_BUILD_TYPE=Release \ -DPython_EXECUTABLE:FILEPATH=/python-venv/bin/python3 \ -DPython_LIBRARIES:FILEPATH=/python/lib/libpython3.10.so \ -DPython_INCLUDE_DIR:PATH=/python/include/python3.11 && \ cmake --build build --config Release -j $(nproc) # [ 89%] Building CUDA object CMakeFiles/ngp.dir/src/nerf_loader.cu.o # /app/instant-ngp/src/nerf_loader.cu(778): error: initialization with "{...}" expected for aggregate object # # 1 error detected in the compilation of "/app/instant-ngp/src/nerf_loader.cu". # gmake[2]: *** [CMakeFiles/ngp.dir/build.make:273: CMakeFiles/ngp.dir/src/nerf_loader.cu.o] Error 2 # gmake[1]: *** [CMakeFiles/Makefile2:442: CMakeFiles/ngp.dir/all] Error 2 # gmake: *** [Makefile:136: all] Error 2

Thank you for pointing out pixelmemory[frame_idx] = GPUMemory(img_size * image_type_size(image_type), low_vram); in instant-ngp/dependencies/tiny-cuda-nn/include/tiny-cuda-nn/gpu_memory.h. This is a much cleaner approach, and I absolutely love it! The reason I didn't use it is simply because I didn't know it was already implemented in tiny-cuda-nn.

- The option name is much more intuitive

Tom94 · 2025-11-19T13:09:34Z

src/nerf_loader.cu

+		progress_to_gpu.update(i);
 	}
+	CUDA_CHECK_THROW(cudaDeviceSynchronize());
+	tlog::success() << "Copied / Converted " << images.size() << " images to GPU after " << tlog::durationToString(progress_to_gpu.duration());


Converted -> converted (consistency with above)

Tom94 · 2025-11-19T13:10:02Z

include/neural-graphics-primitives/testbed.h


 	uint32_t m_training_step = 0;
 	uint32_t m_training_batch_size = 1 << 18;
+	bool m_dataset_in_cpu_ram = false;


This member should be under m_nerf.training, i.e. m_nerf.training.dataset_in_cpu_ram.

Tom94 · 2025-11-19T13:11:48Z

src/nerf_loader.cu

+	size_t total_image_mem_size = img_size * image_type_size(image_type);
+	void* dst;
+	if (in_cpu_ram) {
+		pixelmemory[frame_idx] = GPUMemory<uint8_t>(total_image_mem_size, true);


Better to move this out of the if statement and use pixelmemory[frame_idx] = GPUMemory<uint8_t>(total_image_mem_size, in_cpu_ram);

Then you can drop the else branch entirely (resize no longer needed).

Tom94 · 2025-11-19T13:15:09Z

src/nerf_loader.cu


 	metadata_gpu.enlarge(last);
-	CUDA_CHECK_THROW(cudaMemcpy(metadata_gpu.data() + first, metadata.data() + first, n * sizeof(TrainingImageMetadata), cudaMemcpyHostToDevice));
+	if (!in_cpu_ram) {


This seems wrong to me. The metadata is still stored on the GPU -- and it's so small that it wouldn't make use to offload to CPU ram anyway.

Tom94

See individual comments

darrenchang added 2 commits November 17, 2025 10:45

Feat: implement progress bar for image to GPU convert

7254a21

Refactor

Feat: implement --low_vram mode

6de3792

- Allow dataset to be loaded into ram for training large datasets on limited vram Add low_vram option to the example python runner

Tom94 requested changes Nov 18, 2025

View reviewed changes

darrenchang added 2 commits November 19, 2025 11:54

Change low_vram option to nerf_dataset_in_cpu_ram

61d2e49

- The option name is much more intuitive

Fix: Use tinycuda GPUMemory

5529115

darrenchang requested a review from Tom94 November 19, 2025 09:23

Tom94 reviewed Nov 19, 2025

View reviewed changes

Tom94 requested changes Nov 19, 2025

View reviewed changes

darrenchang added 3 commits November 21, 2025 14:42

Refactor

d89a671

Fix: Move dataset_in_cpu_ram to Testbed::Nerf::Training

b12bb3a

Remove debug loggers

8bbdc3e

darrenchang requested a review from Tom94 November 21, 2025 08:46

Feat: Implement CUDA programming with Unified Memory for dataset loader (train large data on limited VRAM) #1608

Are you sure you want to change the base?

Feat: Implement CUDA programming with Unified Memory for dataset loader (train large data on limited VRAM) #1608

Uh oh!

Conversation

darrenchang commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Tom94 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Tom94 Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Tom94 Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Tom94 Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Tom94 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

darrenchang commented Nov 18, 2025 •

edited

Loading

Tom94 Nov 18, 2025 •

edited

Loading

Tom94 Nov 19, 2025 •

edited

Loading

Tom94 Nov 19, 2025 •

edited

Loading