Skip to content

Thread safety and the HDF5 error stack in 4.9.3 #3193

@djhoese

Description

@djhoese

Further details on this are described in a conda-forge issue I started a while ago, but most of it is me debugging and narrowing things down which I've described below.

Use Case and Disclaimer

I use NetCDF C from netcdf4-python and from a multi-threaded setting, usually via the python dask library. When reading files in parallel it is always through the python xarray library which to my understanding uses locks for different files. However, I originally ran into the below issue when creating new files (one per thread). I understand, although I had forgotten, that these use cases are likely not supported or intended to work, but any details on the state of thread-safety in NetCDF would be interesting to hear.

Related: #382 #1373

The Problem

I discovered that when creating a new NetCDF file from a second thread, when the first thread has already initialized the NetCDF C library (and therefore the HDF5 library's error handling), that the HDF5's error stack messages are printed/leaked. These error messages are for expected failures/error cases of the HDF5 library.

Reproducer

Shame: ChatGPT gave me the skeleton of the below code. I'm primarily a python developer.

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <netcdf.h>
#include <unistd.h>
#include <sys/types.h>

#define FILE_PATH "created_file.nc"  // Will be created (overwritten if exists)

// Macro to handle NetCDF errors
#define NC_CHECK(call)                                 \
    do {                                                \
        int retval = call;                              \
        if (retval != NC_NOERR) {                       \
            fprintf(stderr, "NetCDF error: %s\n", nc_strerror(retval)); \
            exit(EXIT_FAILURE);                         \
        }                                               \
    } while (0)

// Thread function to create a NetCDF-4 file
void* create_netcdf4_file(void* arg) {
    const char* path = (const char*)arg;
    int ncid;

    pid_t tid = gettid();
    printf("Thread ID: %d\n", tid);
    printf("Thread: Creating NetCDF-4 file: %s\n", path);

    NC_CHECK(nc_set_default_format(NC_FORMAT_NETCDF4, NULL));

    // Create a new NetCDF-4 file, overwriting if it exists
    //NC_CHECK(nc_create(path, NC_NETCDF4 | NC_CLOBBER, &ncid));
    NC_CHECK(nc_create(path, NC_CLOBBER, &ncid));
    printf("Thread: File created (ID: %d)\n", ncid);

    // Close the file
    NC_CHECK(nc_close(ncid));
    printf("Thread: NetCDF-4 file created and closed successfully.\n");

    return NULL;
}

int main() {
    pthread_t thread;
    pid_t tid = gettid();

    printf("Main: Starting thread to create NetCDF-4 file...\n");
    printf("Thread ID: %d\n", tid);
    nc_rc_set("HTTP.SSL.CAINFO", "");

    if (pthread_create(&thread, NULL, create_netcdf4_file, (void*)FILE_PATH) != 0) {
        perror("pthread_create");
        return EXIT_FAILURE;
    }

    // Wait for the thread to finish
    pthread_join(thread, NULL);

    printf("Main: Thread finished.\n");
    return EXIT_SUCCESS;
}

Put the above in create_netcdf4_threaded.c and build with:

gcc -o create_netcdf4_threaded create_netcdf4_threaded.c -lnetcdf -lpthread 

Then when run make sure to delete the existing file, otherwise the error doesn't show.

rm created_netcdf.nc
./create_netcdf4_threaded

In the output there should be something like:

HDF5-DIAG: Error detected in HDF5 (1.14.6) thread 1:
  #000: H5F.c line 496 in H5Fis_accessible(): unable to determine if file is accessible as HDF5
    major: File accessibility
    minor: Not an HDF5 file
  #001: H5VLcallback.c line 3913 in H5VL_file_specific(): file specific failed
    major: Virtual Object Layer
    minor: Can't operate on object
  #002: H5VLcallback.c line 3848 in H5VL__file_specific(): file specific failed
    major: Virtual Object Layer
    minor: Can't operate on object
  #003: H5VLnative_file.c line 344 in H5VL__native_file_specific(): error in HDF5 file check
    major: File accessibility
    minor: Can't get value
  #004: H5Fint.c line 1055 in H5F__is_hdf5(): unable to open file
    major: File accessibility
    minor: Unable to initialize object
  #005: H5FD.c line 788 in H5FD_open(): can't open file
    major: Virtual File Layer
    minor: Unable to open file
  #006: H5FDsec2.c line 324 in H5FD__sec2_open(): unable to open file: name = 'test.nc', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0
    major: File accessibility
    minor: Unable to open file

This happens in NetCDF C 4.9.3 but not 4.9.2.

Environment

This has been tested on various flavors of linux, using the conda-forge libnetcdf package 4.9.3. I've also built it from source and git bisect'd the change that introduced this. Version 4.9.2 does not see this. Builds have so far been using gcc 15.1 from conda-forge. HDF5 1.14.6 is used from conda-forge but this has also been seen when HDF5 is built from source for debugging.

Bisect

This started happening in f37fe57 which is part of #2021.

It has been a while since I started looking at this and tracked the actual functions down, but if I remember correctly it is that the NC4/HDF5 initialization functions set the "auto" functions for HDF5's error stack printing to NULL so nothing is printed:

static herr_t
set_auto(void* func, void *client_data)
{
#ifdef DEBUGH5
return H5Eset_auto2(H5E_DEFAULT,(H5E_auto2_t)h5catch,client_data);
#else
return H5Eset_auto2(H5E_DEFAULT,(H5E_auto2_t)func,client_data);
#endif
}
/**
* @internal Provide a function to do any necessary initialization of
* the HDF5 library.
*/
void
nc4_hdf5_initialize(void)
{
if (set_auto(NULL, NULL) < 0)
LOG((0, "Couldn't turn off HDF5 error messages!"));
LOG((1, "HDF5 error messages have been turned off."));
NC4_hdf5_filter_initialize();
nc4_hdf5_initialized = 1;
}

But this only gets called once and is not re-initialized in a new thread. In my example script I specifically call nc_rc_set to force this initialization in the main thread. This mimics some calls in the python netcdf4-python library.

The changes in #2021 seem to try to open the specified file for reading even though we're explicitly creating a new file and/or clobbering if it does exist. It uses HDF5 to tell if the file exists or not depending on HDF5 raising an error:

https://github.com/Unidata/netcdf-c/pull/2021/files#diff-1e6e01c6a7a1ed4c38d2bb4760c1f0740ee7dd825c0ba8cac86162bafa50949fR1597-R1600

Questions

So I guess my main question is: how much of this is expected? Given that it didn't happen in 4.9.2 but happens in 4.9.3 I'm hoping that it is unintended and not that I was just getting lucky not hitting it prior to this.

Edit: I should have added, everything works fine. This is just a printed error, but the error is expected by the code in NetCDF and the file is created just fine in the end.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions