-
Notifications
You must be signed in to change notification settings - Fork 2k
Feat: Implement CUDA programming with Unified Memory for dataset loader (train large data on limited VRAM) #1608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
- Allow dataset to be loaded into ram for training large datasets on limited vram Add low_vram option to the example python runner
Tom94
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution! I'd be happy to merge this in -- it's cool functionality in principle. Please see my individual comments for what still needs to change.
|
|
||
| uint32_t m_training_step = 0; | ||
| uint32_t m_training_batch_size = 1 << 18; | ||
| bool m_low_vram = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This option should be under m_nerf.training and should be named more precisely. E.g. dataset_in_cpu_ram to match the existing m_nerf.training.dataset member.
Function args and locals should be named accordingly. E.g. within NerfDataset it's fine to use just in_cpu_ram, whereas the CLI arg should probably be --nerf_dataset_in_cpu_ram
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! I also like this much better.
| } | ||
|
|
||
| void set_training_image(int frame_idx, const ivec2& image_resolution, const void* pixels, const void* depth_pixels, float depth_scale, bool image_data_on_gpu, EImageDataType image_type, EDepthDataType depth_type, float sharpen_amount = 0.f, bool white_transparent = false, bool black_transparent = false, uint32_t mask_color = 0, const Ray *rays = nullptr); | ||
| void set_training_image(int frame_idx, const ivec2& image_resolution, const void* pixels, const void* depth_pixels, float depth_scale, bool image_data_on_gpu, EImageDataType image_type, EDepthDataType depth_type, float sharpen_amount = 0.f, bool white_transparent = false, bool black_transparent = false, uint32_t mask_color = 0, const Ray *rays = nullptr, bool low_vram = 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
= false
| CUDA_CHECK_THROW(cudaDeviceSynchronize()); | ||
| // free memory | ||
| for (uint32_t i = 0; i < result.n_images; ++i) { | ||
| // free memory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't recall exactly why, but I believe there was a reason for the two loops to be separate. I think the underlying memory might be aliased in some cases -- pleave revert. Putting the progress bar in the first loop likely matches current behavior closely enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the heads up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You didn't actually put the progress into the first loop. Putting it in the second is somewhat meaningless -- calling free is pretty much free.
src/nerf_loader.cu
Outdated
| progress_to_gpu.update(i); | ||
| } | ||
| CUDA_CHECK_THROW(cudaDeviceSynchronize()); | ||
| tlog::success() << "Copy / Converted " << images.size() << " images to GPU after " << tlog::durationToString(progress_to_gpu.duration()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copied / converted
src/nerf_loader.cu
Outdated
| // copy or convert the pixels | ||
| pixelmemory[frame_idx].resize(img_size * image_type_size(image_type)); | ||
| void* dst = pixelmemory[frame_idx].data(); | ||
| // pixelmemory[frame_idx].resize(img_size * image_type_size(image_type)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't leave old code around as a comment. That's what we have git for.
src/nerf_loader.cu
Outdated
| linear_kernel(from_rgba32<__half>, 0, nullptr, n_pixels, (uint8_t*)pixels, (__half*)images_data_half.data(), white_transparent, black_transparent, mask_color); | ||
| pixelmemory[frame_idx] = std::move(images_data_half); | ||
| dst = pixelmemory[frame_idx].data(); | ||
| // pixelmemory[frame_idx] = std::move(images_data_half); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove
src/nerf_loader.cu
Outdated
| pixelmemory[frame_idx] = std::move(images_data_half); | ||
| dst = pixelmemory[frame_idx].data(); | ||
| // pixelmemory[frame_idx] = std::move(images_data_half); | ||
| pixelmemory[frame_idx] = reinterpret_cast<int*>(images_data_half.data()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Memory leak if pixelmemory was already set before.
src/nerf_loader.cu
Outdated
| pixelmemory[frame_idx] = std::move(images_data_sharpened); | ||
| dst = pixelmemory[frame_idx].data(); | ||
| // pixelmemory[frame_idx] = std::move(images_data_sharpened); | ||
| pixelmemory[frame_idx] = reinterpret_cast<int*>(images_data_sharpened.data()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
src/nerf_loader.cu
Outdated
| metadata_gpu.enlarge(last); | ||
| CUDA_CHECK_THROW(cudaMemcpy(metadata_gpu.data() + first, metadata.data() + first, n * sizeof(TrainingImageMetadata), cudaMemcpyHostToDevice)); | ||
| size_t total_size = n * sizeof(TrainingImageMetadata); | ||
| CUDA_CHECK_THROW(cudaMemcpy(metadata_gpu.data() + first, metadata.data() + first, total_size, cudaMemcpyHostToDevice)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unnecessary change
src/nerf_loader.cu
Outdated
| void* dst = pixelmemory[frame_idx].data(); | ||
| // pixelmemory[frame_idx].resize(img_size * image_type_size(image_type)); | ||
| size_t total_image_mem_size = img_size * image_type_size(image_type); | ||
| void *pixelmemory[frame_idx] = { nullptr }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're shadowing NerfDataset::pixelmemory here (and worse, unnecessarily making an array as far as I can tell). Is there something I am missing here?
As far as I can tell, you'd be much better of using pixelmemory[frame_idx] = GPUMemory(img_size * image_type_size(image_type), low_vram);, which'll give you managed memory if low_vram is set without any of the modifications / memory leaks that I'm pointing out in the above comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I made an array by doing void *pixelmemory[frame_idx] = { nullptr }; is because of the follow compiler error I got as shown below.
I still do not yet fully understand why the compiler wants it to be an array.
cmake . -B build \
-DCMAKE_BUILD_TYPE=Release \
-DPython_EXECUTABLE:FILEPATH=/python-venv/bin/python3 \
-DPython_LIBRARIES:FILEPATH=/python/lib/libpython3.10.so \
-DPython_INCLUDE_DIR:PATH=/python/include/python3.11 && \
cmake --build build --config Release -j $(nproc)
# [ 89%] Building CUDA object CMakeFiles/ngp.dir/src/nerf_loader.cu.o
# /app/instant-ngp/src/nerf_loader.cu(778): error: initialization with "{...}" expected for aggregate object
#
# 1 error detected in the compilation of "/app/instant-ngp/src/nerf_loader.cu".
# gmake[2]: *** [CMakeFiles/ngp.dir/build.make:273: CMakeFiles/ngp.dir/src/nerf_loader.cu.o] Error 2
# gmake[1]: *** [CMakeFiles/Makefile2:442: CMakeFiles/ngp.dir/all] Error 2
# gmake: *** [Makefile:136: all] Error 2Thank you for pointing out pixelmemory[frame_idx] = GPUMemory(img_size * image_type_size(image_type), low_vram); in instant-ngp/dependencies/tiny-cuda-nn/include/tiny-cuda-nn/gpu_memory.h. This is a much cleaner approach, and I absolutely love it! The reason I didn't use it is simply because I didn't know it was already implemented in tiny-cuda-nn.
- The option name is much more intuitive
src/nerf_loader.cu
Outdated
| progress_to_gpu.update(i); | ||
| } | ||
| CUDA_CHECK_THROW(cudaDeviceSynchronize()); | ||
| tlog::success() << "Copied / Converted " << images.size() << " images to GPU after " << tlog::durationToString(progress_to_gpu.duration()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Converted -> converted (consistency with above)
|
|
||
| uint32_t m_training_step = 0; | ||
| uint32_t m_training_batch_size = 1 << 18; | ||
| bool m_dataset_in_cpu_ram = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This member should be under m_nerf.training, i.e. m_nerf.training.dataset_in_cpu_ram.
src/nerf_loader.cu
Outdated
| size_t total_image_mem_size = img_size * image_type_size(image_type); | ||
| void* dst; | ||
| if (in_cpu_ram) { | ||
| pixelmemory[frame_idx] = GPUMemory<uint8_t>(total_image_mem_size, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to move this out of the if statement and use pixelmemory[frame_idx] = GPUMemory<uint8_t>(total_image_mem_size, in_cpu_ram);
Then you can drop the else branch entirely (resize no longer needed).
|
|
||
| metadata_gpu.enlarge(last); | ||
| CUDA_CHECK_THROW(cudaMemcpy(metadata_gpu.data() + first, metadata.data() + first, n * sizeof(TrainingImageMetadata), cudaMemcpyHostToDevice)); | ||
| if (!in_cpu_ram) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems wrong to me. The metadata is still stored on the GPU -- and it's so small that it wouldn't make use to offload to CPU ram anyway.
Tom94
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See individual comments
low_vram.Pros for enabling low_vram
Cons for enabling low_vram
Screenshots example
In this test, I was able to load 30Gb+ data and train on a 16GB GPU. Also maxing the batch size and enabling per-image latent feature. I have about 3456 images in this training dataset.