Skip to content

Conversation

@ianhi
Copy link
Contributor

@ianhi ianhi commented Nov 19, 2025

In #10804 we made a breaking change without warning. That PR significantly restricted the URLs that the netcdf4 backend claims to be able to open (guess_can_open). The netcdf4 engine previously claimed to be able open any URL, which is the wrong behavior. It results in bugs (#10801) for anyone using a non-netcdf data format. The original behavior adds a complexity burden for users who want to open remote zarr, or any other custom xarray backend (examples)

Even though that PR was a bugfix, it was also a breaking change that introduced new bugs for users who relied on xr.open("some_url") defaulting to netcdf4. See #10804 (comment)

This PR restores backward compatibility while fixing the specific case of zarr, and adding a deprecation path toward stricter URL detection.

Goals

1. Existing workflows are not broken

We need to make sure that all the workflows that relied on the prior behavior are not suddenly broken. This PR restores the behavior of netcdf4 grabbing almost all URLs, with a small exception for zarr.

2. Do not re-introduce the bug for zarr users

zarr users are an important and growing demographic of xarray users. I don't want to re-introduce the original bug where when you tried to open a remote zarr store you got an error about netcdf being unable to read it. Very confusing!

So I add an exception in the netcdf4 URL guessing that if zarr is present anywhere in the URL it passes on it. This might have a consequence for anyone using netcdf4 for reading zarr. But I think the tradeoff of those users being forced to use engine="netcdf4" vs the convenience for people who want the zarr backend is a worthwhile one.

3. Future proof the guess_can_open

zarr will not be the last format that gets invented. And people will continue to write custom xarray backends. This means that the zarr exception in the netcdf4 guess_can_open in this PR will be inadequate to protect custom backends or for new formats that want to be accessed by remote access. So we can't just continue to add exceptions to the netcdf4 URL guessing. To future proof here I have:

  1. significantly expanding what is detected as a dap url by netcdf4
    • DAP protocol schemes: dap://, dap2://, dap4://
    • Server-specific paths: /dodsC/ (THREDDS), /dods/ (GrADS), /opendap/ (Hyrax), /erddap/ (ERDDAP)
    • This centralized logic is shared between netcdf4 and pydap backends
  2. Added deprecation warning - If all specific URL checks fail and netcdf4 falls back to claiming a URL, it now emits a
    FutureWarning explaining what users should do
  3. Added DAP URL sanitization - The netcdf4 backend now converts dap2:// and dap4:// schemes to https:// (following pydap's convention) since the underlying netCDF4 library doesn't understand these custom schemes

attn: @ocefpaf @dopplershift @Mikejmnez

  • Closes #xxxx
  • Tests added
  • User visible changes (including notable bug fixes) are documented in whats-new.rst
  • New functions/methods are listed in api.rst

emit_user_level_warning(
f"The NetCDF4 backend is guessing that {filename_or_obj!r} is a NetCDF file. "
"In the future, xarray will require remote URLs to either have a .nc, .nc4, or .cdf "
"extension, use a DAP protocol (dap2://, dap4://, or /dap/ in the path), or specify "
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't strictly true, but the full truth is, i think, too verbose for this already long warning

@dcherian
Copy link
Contributor

I'll wait for @ocefpaf and @dopplershift to chime in before merging and issuing a bugfix release.

ianhi and others added 2 commits November 19, 2025 19:35
@ocefpaf
Copy link
Contributor

ocefpaf commented Nov 20, 2025

The plan going forward seems to be:

The NetCDF4 backend now emits a ``FutureWarning`` when opening remote URLs without
  a ``.nc``/``.nc4``/``.cdf`` extension or DAP protocol indicators (e.g., ``dap2://``,
  ``/dodsC/``, ``/erddap/``). In the future, xarray will require remote URLs to either
  have a NetCDF file extension, use a recognized DAP URL pattern, or specify the backend
  explicitly via ``engine="netcdf4"``. 

That is not OK IMO. Most user are not data providers and cannot change the URL. While they can choose an engine, we loose one of the best things we had before. If you have pydap or netcdf4 installed that would just work. Sure, this seems to break in edge cases for biologist using zarr. That seems to be as easy as checking for .zarr b/c, different from opdendap, remote zarr URL are easily identified.

At the end of the day it doesn't matter what we want, devs will do what they have to. Users will get a breaking change and will have to adapt. I'm biased here b/c I believe that the amount of Met/Ocean users that will be on the "hard lifting part of this task" is much larger than those that will benefit from it, and identifying .zarr would probably make both side happy.

TL;DR that is my take as a user! As a developer I would deprecate and remove all guessing.

@ianhi
Copy link
Contributor Author

ianhi commented Nov 20, 2025

@ocefpaf thanks for the feedback, I'd really like to get this as right as possible and this is really helpful. I have some follow up questions below

TL;DR that is my take as a user! As a developer I would deprecate and remove all guessing.

Thanks for the dual perspective :)

That is not OK IMO. Most user are not data providers and cannot change the URL.

I take it that my effort to expand the possible scope of dap URLs is inadequate to solve this problem? I'd be happy to try to encode a larger set of common data providers and URL patterns. Its easier to put in positive includes rather than the arbitrary and unlimited set of excludes. It's sad for us that there is no single, or small set, of unifying features that definitively show that something is a DAP url. My hope was to cover ~90% percent of data URL cases, and ask the remaining 10% dap users to add an explicit engine. Does that seem feasible/acceptable?

Sure, this seems to break in edge cases for biologist using zarr

I'd amend this to: "break [anyone/most] using zarr" including geo users. For example, I know that some weather modellers are moving towards providing data via zarr. In those cases especially remote access is critical. So I disagree that this is an edge case. It's a mainline usecase for a growing number of users in multiple fields, and they similarly deserve the "just work" behavior.

That seems to be as easy as checking for .zarr b/c, different from opdendap, remote zarr URL are easily identified.

This is a fine solution for officially supported backends, but anyone who defines a custom backend will not be able to add an exception. Although, maybe that's ok. But the out would be just namign their backend in a way that comes alphabetically ahead of netcdf, which is not amazing.

An alternate future proofing could be to have an actual precedence system for backends where they each report a number.

Fixing this bug

Maybe the move here is to punt on the deprecation/future proofing in order to release a bugfix. Then open a new PR with possibilities for thinkign about the future

@keewis
Copy link
Collaborator

keewis commented Nov 20, 2025

Maybe the move here is to punt on the deprecation/future proofing in order to release a bugfix.

I think that would be the easiest, yes. Just revert the original PR (if possible) and release, and then we can have a more relaxed conversation about the way forward.

@ocefpaf
Copy link
Contributor

ocefpaf commented Nov 20, 2025

I take it that my effort to expand the possible scope of dap URLs is inadequate to solve this problem? I'd be happy to try to encode a larger set of common data providers and URL patterns. Its easier to put in positive includes rather than the arbitrary and unlimited set of excludes. It's sad for us that there is no single, or small set, of unifying features that definitively show that something is a DAP url. My hope was to cover ~90% percent of data URL cases, and ask the remaining 10% dap users to add an explicit engine. Does that seem feasible/acceptable?

I'm not sure that is possible without inspecting the URL header response. Maybe even that may fail in some corner cases, but why handling the .zarr ones is not enough? Definitely simpler. Leaving the users to pass any other URL and dealing with any possible error themselves.

One thing that I used in the past was to use the, more costly header fetch, after everything fails. That helped users figure out by themselves by giving some hint of that is in the URL. For example, that URL does have the opendap info in it:

import requests
url = "https://www.neracoos.org/erddap/griddap/WW3_EastCoast_latest"
r = requests.get(url)
r.headers["xdods-server"]
'dods/3.7'

Not sure if this is worth doing here though.

I'd amend this to: "break [anyone/most] using zarr" including geo users. For example, I know that some weather modellers are moving towards providing data via zarr. In those cases especially remote access is critical. So I disagree that this is an edge case. It's a mainline usecase for a growing number of users in multiple fields, and they similarly deserve the "just work" behavior.

Sure, as a heavy user of zarr<3 myself, I also use netcdf4 for reading those URLs 😬 . Anyway, we are in user land and things will always be complicated. Someone will be unsatisfied and mad b/c the keyboard key press is no longer heating the room.

This is a fine solution for officially supported backends, but anyone who defines a custom backend will not be able to add an exception. Although, maybe that's ok. But the out would be just namign their backend in a way that comes alphabetically ahead of netcdf, which is not amazing.
An alternate future proofing could be to have an actual precedence system for backends where they each report a number.

I'm not sure I follow that part. I'll believe you as the expert here on the topic.

I think that would be the easiest, yes. Just revert the original PR (if possible) and release, and then we can have a more relaxed conversation about the way forward.

+1. The way forward will leave a group unhappy. The best solution is the one where everybody is unhappy! No really, that is real diplomacy, look it up.

My 2 cents is that we need at least a deprecation warning. Then folks can prepare downstream libraries to keep using netcdf4 as their engine and not be surprised by missing dependency or changed behavior. (I know that pydap has its fans and I do like its pure Python nature, but I've been bitten way to many times with datasets it cannot open that "just works" with netcdf4.)

@Mikejmnez
Copy link
Contributor

Mikejmnez commented Nov 20, 2025

That is not OK IMO. Most user are not data providers and cannot change the URL. While they can choose an engine, we loose one of the best things we had before. If you have pydap or netcdf4 installed that would just work. Sure, this seems to break in edge cases for biologist using zarr. That seems to be as easy as checking for .zarr b/c, different from opdendap, remote zarr URL are easily identified.

It's sad for us that there is no single, or small set, of unifying features that definitively show that something is a DAP url

+1 In many cases, it is the systems admin and not even the data producer who creates the redirect exposing the url to the user, and there is no guarantee the file extension or that any of those patterns (dodsC, dap, opendap, hyrax, erddap) will appear in the url (although they seem to do so in my experience). This is a facet of the service where the developer has little influence.

I think that would be the easiest, yes. Just revert the original PR (if possible) and release, and then we can have a more relaxed conversation about the way forward.

+1 Seems best

IMHO I am not against requiring an engine for remote urls to be specified, never had an issue with engine="pydap" and, in fact, I am in favor of this because it brings awareness to the data user about the different backend engines, useful when debugging or reporting a bug. I have regular meetings with the thredds folks and I know that issues that have been brought to me either via xarray or pydap's issue tracker, have helped identify bugs and improve the tds's dap4 service. The one thing I am certain is about dap urls requiring a dap2 or dap4 specifier. Whether a url has dap2/dap4 endpoints can certainly be inspected with a requests approach, but I strongly advise against that or any similar header inspection as a pattern within xarray. That should belong on the user side. And perhaps this pattern will eventually cover ~90% percent of dap data URL cases access cases.

I know that the original purpose of the PR was to make things more streamlined/faster, and that should always be welcomed. My only comment back then (and continues to be the same today) was that I was against breaking anyone's workflow.

@dopplershift
Copy link
Contributor

I appreciate the work to roll things back while putting in some structure that makes things more sustainable going forward. If we're excluding common URL patterns (and can add more when we find them) from the warning, then I'm more ok with a warning.

I also want to note that (so it's not overlooked):

The netcdf-c library supports reading zarr. I certainly understand that xarray may want to rely on the Python implementation by default, but I want to make sure everyone is clear that the netcdf4 should be a valid engine for reading zarr. (I'm also curious what was going wrong that netcdf-c was choking on the biology datasets)

@ianhi
Copy link
Contributor Author

ianhi commented Nov 20, 2025

as a heavy user of zarr<3 myself, I also use netcdf4 for reading those URLs

I want to make sure everyone is clear that the netcdf4 should be a valid engine for reading zarr.

I truly did not know this. Although that said, I can't get it to work with engine=netcdf4

# /// script
# requires-python = ">=3.11"
# dependencies = [
#   "xarray[complete]",
#   "zarr<3.1.4",
#   "numcodecs<0.16.4"
# ]
# ///
#

import numpy as np
import pandas as pd

import xarray as xr

ds = xr.Dataset(
    {"foo": (("x", "y"), np.random.rand(4, 5))},
    coords={
        "x": [10, 20, 30, 40],
        "y": pd.date_range("2000-01-01", periods=5),
        "z": ("x", list("abcd")),
    },
)
# same result for format 2 or 3
ds.to_zarr("test.zarr", zarr_format=2, consolidated=False, mode="w")
xr.open_dataset("test.zarr", engine="netcdf4")

fails with

  File "src/netCDF4/_netCDF4.pyx", line 2517, in netCDF4._netCDF4.Dataset.__init__
  File "src/netCDF4/_netCDF4.pyx", line 2154, in netCDF4._netCDF4._ensure_nc_success
OSError: [Errno -51] NetCDF: Unknown file format: '/Users/ian/Documents/dev/xarray/test.zarr'

@dopplershift
Copy link
Contributor

@ianhi Zarr is an optional flag to enable, so it's entirely possible your copy doesn't have it turned on (it's a recent addition and some kinks are still getting worked out). Where'd you get your copy of netcdf4 from?

@ianhi
Copy link
Contributor Author

ianhi commented Nov 20, 2025

Where'd you get your copy of netcdf4 from?

Wherever uv sync pulls from. Which is suspect is pypi by default. I ran that example with uv run file.py

@ocefpaf
Copy link
Contributor

ocefpaf commented Nov 20, 2025

Wherever uv sync pulls from. Which is suspect is pypi by default. I ran that example with uv run file.py

That is likely the PyPI wheel. If you are on macos or Linux it doesn't have any of the plugins or zarr support yet, unless you build your own version. We are working on a new version of the wheel now that we are using ABI3 and the buidl matrix reduced significantly. (But that will take some time.)

Also, unfortunately the syntax is a bit awkward and could use some... Guessing ;-p
See https://docs.unidata.ucar.edu/netcdf-c/4.9.2/md__media_psf_Home_Desktop_netcdf_releases_v4_9_2_release_netcdf_c_docs_nczarr.html

@dopplershift
Copy link
Contributor

@ianhi Good to know. Looks like it's not working for me on the conda-forge packages, though I've confirmed from logs that support was enabled.

I've opened Unidata/netcdf-c#3214 to work through some of the interop issues over there.

@shoyer
Copy link
Member

shoyer commented Nov 20, 2025

Let me recap my comment from over in #10804 for visiblity:

I appreciate that defaulting to netCDF4 for URL strings is long-standing in Xarray, but I don't think it's a good design in the long-term. It's somewhat ambiguous what a URL means (is it an DAP endpoint? is it an HTTP server? is it some Zarr end-point?), and even though netCDF-C can in-principle handle most of these, I don't think using it as a intermediary in all cases is will be a good experience for most users. In particular, I think Zarrs are much better handled by default by the Zarr-Python library.

So I agree that we should restore netCDF4 grabbing most URLs for now (sudden breaking changes are bad!), but I think this behavior should be deprecated, and users should be encouraged to make an explicit choice of backend for HTTP URLs. The users should see something like FutureWarning: Xarray will not no longer guess that netCDF4 can open all URLs in the future. Set engine='netcdf4' explicitly (or another xarray backend engine) to silence this message.

There are many ways we could allow users to be explicit about the sort of files they are opening, and in the long term I think this is a way better strategy that adding more clever "guessing" logic. Some possibilities, which hopefully are not too much worse than xr.open_dataset('https://myserver.org/dataset'):

  • Explicit protocol in the URI: xr.open_dataset('dap://myserver.org/dataset')
  • Explicit suffix in the URI: xr.open_dataset('https://myserver.org/dataset.dap')
  • Explicit engine keyword: xr.open_dataset('https://myserver.org/dataset', engine='netcdf4')
  • Explicit constructors: xr.open_dap_dataset('https://myserver.org/dataset') (and maybe also open_dap_datatree() and open_dap_array())
  • String matching anywhere in a URI: xr.open_dataset('https://myserver.org/erddap/dataset') (I really don't like this option, it seems very error prone/surprising)

The explicit constructor option is probably my favorite. We would only need a few of these built into Xarray (e.g., open_netcdf_dataset(), open_dap_dataset(), open_zarr_dataset()) and it would be entirely explicit for both users and type-systems, with just a few extra characters typed.

@ocefpaf
Copy link
Contributor

ocefpaf commented Nov 20, 2025

@ianhi Good to know. Looks like it's not working for me on the conda-forge packages, though I've confirmed from logs that support was enabled.

@dopplershift try the previous libnetcdf version, I'm pretty sure latest broke this feature.

@ianhi
Copy link
Contributor Author

ianhi commented Nov 20, 2025

netCDF-C can in-principle handle most of these, I don't think using it as a intermediary in all cases is will be a good experience for most users

I agree. Even knowing that it can, and especially in light of the issues we saw here, I will continue to prefer zarr-python or icechunk as my backend for zarr stores in xarray.

String matching anywhere in a URI: xr.open_dataset('https://myserver.org/erddap/dataset') (I really don't like this option, it seems very error prone/surprising)

In general yes, but are you suggesting also to pare back the extra string matching added here? Or to be thoughtful in the future as we move towards more explicit semantics?

Conclusion here

I am going to modify this PR to remove the warning for now, but leave in the exception for zarr from netcdf. Then I will open a follow-up where we can be more precise with the warning language and spend more time considering what the optimal deprecation path and end state is.

Comment on lines +466 to +474
# Replace DAP protocol prefixes with https:// - netCDF4 library can't handle them
# These prefixes may be added by users to explicitly indicate DAP protocol
# Following pydap's convention, we convert to https://
# See: https://github.com/pydap/pydap/blob/0a2b0892611abaf0a9762ffd4f2f082cb8e497c2/src/pydap/handlers/dap.py#L103-L107
if isinstance(filename, str):
filename_lower = filename.lower()
if filename_lower.startswith(("dap2://", "dap4://")):
filename = "https://" + filename[7:]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note for review. A sneaky thing this PR does is allow the dap4:// syntax to work with engine="netcdf4" This makes it cleaner to ahve one dap url detection function, but may have downsides I'm unaware of.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants