Skip to content

Conversation

@darrenchang
Copy link

@darrenchang darrenchang commented Nov 18, 2025

  • This implementation adds the command line option low_vram.

Pros for enabling low_vram

  • Allow dataset to be loaded into RAM instead of GPU VRAM.
  • Allow the entire GPU VRAM to be used for training

Cons for enabling low_vram

  • Training is a bit slower because the data need to be moved back and forth between CPU RAM and GPU VRAM

Screenshots example
In this test, I was able to load 30Gb+ data and train on a 16GB GPU. Also maxing the batch size and enabling per-image latent feature. I have about 3456 images in this training dataset.

Screenshot 2025-11-18 at 5 25 47 PM Screenshot 2025-11-18 at 5 26 44 PM Screenshot 2025-11-18 at 5 30 41 PM

- Allow dataset to be loaded into ram for training large datasets on
limited vram

Add low_vram option to the example python runner
Copy link
Collaborator

@Tom94 Tom94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! I'd be happy to merge this in -- it's cool functionality in principle. Please see my individual comments for what still needs to change.


uint32_t m_training_step = 0;
uint32_t m_training_batch_size = 1 << 18;
bool m_low_vram = false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This option should be under m_nerf.training and should be named more precisely. E.g. dataset_in_cpu_ram to match the existing m_nerf.training.dataset member.

Function args and locals should be named accordingly. E.g. within NerfDataset it's fine to use just in_cpu_ram, whereas the CLI arg should probably be --nerf_dataset_in_cpu_ram

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! I also like this much better.

}

void set_training_image(int frame_idx, const ivec2& image_resolution, const void* pixels, const void* depth_pixels, float depth_scale, bool image_data_on_gpu, EImageDataType image_type, EDepthDataType depth_type, float sharpen_amount = 0.f, bool white_transparent = false, bool black_transparent = false, uint32_t mask_color = 0, const Ray *rays = nullptr);
void set_training_image(int frame_idx, const ivec2& image_resolution, const void* pixels, const void* depth_pixels, float depth_scale, bool image_data_on_gpu, EImageDataType image_type, EDepthDataType depth_type, float sharpen_amount = 0.f, bool white_transparent = false, bool black_transparent = false, uint32_t mask_color = 0, const Ray *rays = nullptr, bool low_vram = 0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

= false

CUDA_CHECK_THROW(cudaDeviceSynchronize());
// free memory
for (uint32_t i = 0; i < result.n_images; ++i) {
// free memory
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't recall exactly why, but I believe there was a reason for the two loops to be separate. I think the underlying memory might be aliased in some cases -- pleave revert. Putting the progress bar in the first loop likely matches current behavior closely enough.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the heads up.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You didn't actually put the progress into the first loop. Putting it in the second is somewhat meaningless -- calling free is pretty much free.

progress_to_gpu.update(i);
}
CUDA_CHECK_THROW(cudaDeviceSynchronize());
tlog::success() << "Copy / Converted " << images.size() << " images to GPU after " << tlog::durationToString(progress_to_gpu.duration());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copied / converted

// copy or convert the pixels
pixelmemory[frame_idx].resize(img_size * image_type_size(image_type));
void* dst = pixelmemory[frame_idx].data();
// pixelmemory[frame_idx].resize(img_size * image_type_size(image_type));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't leave old code around as a comment. That's what we have git for.

linear_kernel(from_rgba32<__half>, 0, nullptr, n_pixels, (uint8_t*)pixels, (__half*)images_data_half.data(), white_transparent, black_transparent, mask_color);
pixelmemory[frame_idx] = std::move(images_data_half);
dst = pixelmemory[frame_idx].data();
// pixelmemory[frame_idx] = std::move(images_data_half);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

pixelmemory[frame_idx] = std::move(images_data_half);
dst = pixelmemory[frame_idx].data();
// pixelmemory[frame_idx] = std::move(images_data_half);
pixelmemory[frame_idx] = reinterpret_cast<int*>(images_data_half.data());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Memory leak if pixelmemory was already set before.

pixelmemory[frame_idx] = std::move(images_data_sharpened);
dst = pixelmemory[frame_idx].data();
// pixelmemory[frame_idx] = std::move(images_data_sharpened);
pixelmemory[frame_idx] = reinterpret_cast<int*>(images_data_sharpened.data());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

metadata_gpu.enlarge(last);
CUDA_CHECK_THROW(cudaMemcpy(metadata_gpu.data() + first, metadata.data() + first, n * sizeof(TrainingImageMetadata), cudaMemcpyHostToDevice));
size_t total_size = n * sizeof(TrainingImageMetadata);
CUDA_CHECK_THROW(cudaMemcpy(metadata_gpu.data() + first, metadata.data() + first, total_size, cudaMemcpyHostToDevice));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary change

void* dst = pixelmemory[frame_idx].data();
// pixelmemory[frame_idx].resize(img_size * image_type_size(image_type));
size_t total_image_mem_size = img_size * image_type_size(image_type);
void *pixelmemory[frame_idx] = { nullptr };
Copy link
Collaborator

@Tom94 Tom94 Nov 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're shadowing NerfDataset::pixelmemory here (and worse, unnecessarily making an array as far as I can tell). Is there something I am missing here?

As far as I can tell, you'd be much better of using pixelmemory[frame_idx] = GPUMemory(img_size * image_type_size(image_type), low_vram);, which'll give you managed memory if low_vram is set without any of the modifications / memory leaks that I'm pointing out in the above comments.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I made an array by doing void *pixelmemory[frame_idx] = { nullptr }; is because of the follow compiler error I got as shown below.
I still do not yet fully understand why the compiler wants it to be an array.

cmake . -B build \
-DCMAKE_BUILD_TYPE=Release \
-DPython_EXECUTABLE:FILEPATH=/python-venv/bin/python3 \
-DPython_LIBRARIES:FILEPATH=/python/lib/libpython3.10.so \
-DPython_INCLUDE_DIR:PATH=/python/include/python3.11 && \
cmake --build build --config Release -j $(nproc)

# [ 89%] Building CUDA object CMakeFiles/ngp.dir/src/nerf_loader.cu.o
# /app/instant-ngp/src/nerf_loader.cu(778): error: initialization with "{...}" expected for aggregate object
#
# 1 error detected in the compilation of "/app/instant-ngp/src/nerf_loader.cu".
# gmake[2]: *** [CMakeFiles/ngp.dir/build.make:273: CMakeFiles/ngp.dir/src/nerf_loader.cu.o] Error 2
# gmake[1]: *** [CMakeFiles/Makefile2:442: CMakeFiles/ngp.dir/all] Error 2
# gmake: *** [Makefile:136: all] Error 2

Thank you for pointing out pixelmemory[frame_idx] = GPUMemory(img_size * image_type_size(image_type), low_vram); in instant-ngp/dependencies/tiny-cuda-nn/include/tiny-cuda-nn/gpu_memory.h. This is a much cleaner approach, and I absolutely love it! The reason I didn't use it is simply because I didn't know it was already implemented in tiny-cuda-nn.

@darrenchang darrenchang requested a review from Tom94 November 19, 2025 09:23
progress_to_gpu.update(i);
}
CUDA_CHECK_THROW(cudaDeviceSynchronize());
tlog::success() << "Copied / Converted " << images.size() << " images to GPU after " << tlog::durationToString(progress_to_gpu.duration());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converted -> converted (consistency with above)


uint32_t m_training_step = 0;
uint32_t m_training_batch_size = 1 << 18;
bool m_dataset_in_cpu_ram = false;
Copy link
Collaborator

@Tom94 Tom94 Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This member should be under m_nerf.training, i.e. m_nerf.training.dataset_in_cpu_ram.

size_t total_image_mem_size = img_size * image_type_size(image_type);
void* dst;
if (in_cpu_ram) {
pixelmemory[frame_idx] = GPUMemory<uint8_t>(total_image_mem_size, true);
Copy link
Collaborator

@Tom94 Tom94 Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to move this out of the if statement and use pixelmemory[frame_idx] = GPUMemory<uint8_t>(total_image_mem_size, in_cpu_ram);

Then you can drop the else branch entirely (resize no longer needed).


metadata_gpu.enlarge(last);
CUDA_CHECK_THROW(cudaMemcpy(metadata_gpu.data() + first, metadata.data() + first, n * sizeof(TrainingImageMetadata), cudaMemcpyHostToDevice));
if (!in_cpu_ram) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems wrong to me. The metadata is still stored on the GPU -- and it's so small that it wouldn't make use to offload to CPU ram anyway.

Copy link
Collaborator

@Tom94 Tom94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See individual comments

@darrenchang darrenchang requested a review from Tom94 November 21, 2025 08:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants