-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Mostly restore netcdf backend behavior with URLs
#10931
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
xarray/backends/netCDF4_.py
Outdated
| emit_user_level_warning( | ||
| f"The NetCDF4 backend is guessing that {filename_or_obj!r} is a NetCDF file. " | ||
| "In the future, xarray will require remote URLs to either have a .nc, .nc4, or .cdf " | ||
| "extension, use a DAP protocol (dap2://, dap4://, or /dap/ in the path), or specify " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this isn't strictly true, but the full truth is, i think, too verbose for this already long warning
|
I'll wait for @ocefpaf and @dopplershift to chime in before merging and issuing a bugfix release. |
Co-authored-by: Deepak Cherian <[email protected]>
|
The plan going forward seems to be: That is not OK IMO. Most user are not data providers and cannot change the URL. While they can choose an engine, we loose one of the best things we had before. If you have pydap or netcdf4 installed that would just work. Sure, this seems to break in edge cases for biologist using zarr. That seems to be as easy as checking for .zarr b/c, different from opdendap, remote zarr URL are easily identified. At the end of the day it doesn't matter what we want, devs will do what they have to. Users will get a breaking change and will have to adapt. I'm biased here b/c I believe that the amount of Met/Ocean users that will be on the "hard lifting part of this task" is much larger than those that will benefit from it, and identifying .zarr would probably make both side happy. TL;DR that is my take as a user! As a developer I would deprecate and remove all guessing. |
|
@ocefpaf thanks for the feedback, I'd really like to get this as right as possible and this is really helpful. I have some follow up questions below
Thanks for the dual perspective :)
I take it that my effort to expand the possible scope of dap URLs is inadequate to solve this problem? I'd be happy to try to encode a larger set of common data providers and URL patterns. Its easier to put in positive includes rather than the arbitrary and unlimited set of excludes. It's sad for us that there is no single, or small set, of unifying features that definitively show that something is a DAP url. My hope was to cover ~90% percent of data URL cases, and ask the remaining 10% dap users to add an explicit engine. Does that seem feasible/acceptable?
I'd amend this to: "break [anyone/most] using zarr" including geo users. For example, I know that some weather modellers are moving towards providing data via zarr. In those cases especially remote access is critical. So I disagree that this is an edge case. It's a mainline usecase for a growing number of users in multiple fields, and they similarly deserve the "just work" behavior.
This is a fine solution for officially supported backends, but anyone who defines a custom backend will not be able to add an exception. Although, maybe that's ok. But the out would be just namign their backend in a way that comes alphabetically ahead of netcdf, which is not amazing. An alternate future proofing could be to have an actual precedence system for backends where they each report a number. Fixing this bugMaybe the move here is to punt on the deprecation/future proofing in order to release a bugfix. Then open a new PR with possibilities for thinkign about the future |
I think that would be the easiest, yes. Just revert the original PR (if possible) and release, and then we can have a more relaxed conversation about the way forward. |
I'm not sure that is possible without inspecting the URL header response. Maybe even that may fail in some corner cases, but why handling the One thing that I used in the past was to use the, more costly header fetch, after everything fails. That helped users figure out by themselves by giving some hint of that is in the URL. For example, that URL does have the opendap info in it: import requests
url = "https://www.neracoos.org/erddap/griddap/WW3_EastCoast_latest"
r = requests.get(url)
r.headers["xdods-server"]
'dods/3.7'Not sure if this is worth doing here though.
Sure, as a heavy user of zarr<3 myself, I also use netcdf4 for reading those URLs 😬 . Anyway, we are in user land and things will always be complicated. Someone will be unsatisfied and mad b/c the keyboard key press is no longer heating the room.
I'm not sure I follow that part. I'll believe you as the expert here on the topic.
+1. The way forward will leave a group unhappy. The best solution is the one where everybody is unhappy! No really, that is real diplomacy, look it up. My 2 cents is that we need at least a deprecation warning. Then folks can prepare downstream libraries to keep using netcdf4 as their engine and not be surprised by missing dependency or changed behavior. (I know that pydap has its fans and I do like its pure Python nature, but I've been bitten way to many times with datasets it cannot open that "just works" with netcdf4.) |
+1 In many cases, it is the systems admin and not even the data producer who creates the redirect exposing the url to the user, and there is no guarantee the file extension or that any of those patterns (dodsC, dap, opendap, hyrax, erddap) will appear in the url (although they seem to do so in my experience). This is a facet of the service where the developer has little influence.
+1 Seems best IMHO I am not against requiring an engine for remote urls to be specified, never had an issue with I know that the original purpose of the PR was to make things more streamlined/faster, and that should always be welcomed. My only comment back then (and continues to be the same today) was that I was against breaking anyone's workflow. |
|
I appreciate the work to roll things back while putting in some structure that makes things more sustainable going forward. If we're excluding common URL patterns (and can add more when we find them) from the warning, then I'm more ok with a warning. I also want to note that (so it's not overlooked): The netcdf-c library supports reading zarr. I certainly understand that xarray may want to rely on the Python implementation by default, but I want to make sure everyone is clear that the netcdf4 should be a valid engine for reading zarr. (I'm also curious what was going wrong that netcdf-c was choking on the biology datasets) |
I truly did not know this. Although that said, I can't get it to work with # /// script
# requires-python = ">=3.11"
# dependencies = [
# "xarray[complete]",
# "zarr<3.1.4",
# "numcodecs<0.16.4"
# ]
# ///
#
import numpy as np
import pandas as pd
import xarray as xr
ds = xr.Dataset(
{"foo": (("x", "y"), np.random.rand(4, 5))},
coords={
"x": [10, 20, 30, 40],
"y": pd.date_range("2000-01-01", periods=5),
"z": ("x", list("abcd")),
},
)
# same result for format 2 or 3
ds.to_zarr("test.zarr", zarr_format=2, consolidated=False, mode="w")
xr.open_dataset("test.zarr", engine="netcdf4")fails with |
|
@ianhi Zarr is an optional flag to enable, so it's entirely possible your copy doesn't have it turned on (it's a recent addition and some kinks are still getting worked out). Where'd you get your copy of netcdf4 from? |
Wherever |
That is likely the PyPI wheel. If you are on macos or Linux it doesn't have any of the plugins or zarr support yet, unless you build your own version. We are working on a new version of the wheel now that we are using ABI3 and the buidl matrix reduced significantly. (But that will take some time.) Also, unfortunately the syntax is a bit awkward and could use some... Guessing ;-p |
|
@ianhi Good to know. Looks like it's not working for me on the conda-forge packages, though I've confirmed from logs that support was enabled. I've opened Unidata/netcdf-c#3214 to work through some of the interop issues over there. |
|
Let me recap my comment from over in #10804 for visiblity: I appreciate that defaulting to netCDF4 for URL strings is long-standing in Xarray, but I don't think it's a good design in the long-term. It's somewhat ambiguous what a URL means (is it an DAP endpoint? is it an HTTP server? is it some Zarr end-point?), and even though netCDF-C can in-principle handle most of these, I don't think using it as a intermediary in all cases is will be a good experience for most users. In particular, I think Zarrs are much better handled by default by the Zarr-Python library. So I agree that we should restore netCDF4 grabbing most URLs for now (sudden breaking changes are bad!), but I think this behavior should be deprecated, and users should be encouraged to make an explicit choice of backend for HTTP URLs. The users should see something like There are many ways we could allow users to be explicit about the sort of files they are opening, and in the long term I think this is a way better strategy that adding more clever "guessing" logic. Some possibilities, which hopefully are not too much worse than
The explicit constructor option is probably my favorite. We would only need a few of these built into Xarray (e.g., |
@dopplershift try the previous libnetcdf version, I'm pretty sure latest broke this feature. |
I agree. Even knowing that it can, and especially in light of the issues we saw here, I will continue to prefer zarr-python or icechunk as my backend for zarr stores in xarray.
In general yes, but are you suggesting also to pare back the extra string matching added here? Or to be thoughtful in the future as we move towards more explicit semantics? Conclusion hereI am going to modify this PR to remove the warning for now, but leave in the exception for zarr from netcdf. Then I will open a follow-up where we can be more precise with the warning language and spend more time considering what the optimal deprecation path and end state is. |
| # Replace DAP protocol prefixes with https:// - netCDF4 library can't handle them | ||
| # These prefixes may be added by users to explicitly indicate DAP protocol | ||
| # Following pydap's convention, we convert to https:// | ||
| # See: https://github.com/pydap/pydap/blob/0a2b0892611abaf0a9762ffd4f2f082cb8e497c2/src/pydap/handlers/dap.py#L103-L107 | ||
| if isinstance(filename, str): | ||
| filename_lower = filename.lower() | ||
| if filename_lower.startswith(("dap2://", "dap4://")): | ||
| filename = "https://" + filename[7:] | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note for review. A sneaky thing this PR does is allow the dap4:// syntax to work with engine="netcdf4" This makes it cleaner to ahve one dap url detection function, but may have downsides I'm unaware of.
In #10804 we made a breaking change without warning. That PR significantly restricted the URLs that the
netcdf4backend claims to be able to open (guess_can_open). The netcdf4 engine previously claimed to be able open any URL, which is the wrong behavior. It results in bugs (#10801) for anyone using a non-netcdf data format. The original behavior adds a complexity burden for users who want to open remote zarr, or any other custom xarray backend (examples)Even though that PR was a bugfix, it was also a breaking change that introduced new bugs for users who relied on
xr.open("some_url")defaulting to netcdf4. See #10804 (comment)This PR restores backward compatibility while fixing the specific case of zarr, and adding a deprecation path toward stricter URL detection.
Goals
1. Existing workflows are not broken
We need to make sure that all the workflows that relied on the prior behavior are not suddenly broken. This PR restores the behavior of netcdf4 grabbing almost all URLs, with a small exception for zarr.
2. Do not re-introduce the bug for zarr users
zarr users are an important and growing demographic of xarray users. I don't want to re-introduce the original bug where when you tried to open a remote zarr store you got an error about netcdf being unable to read it. Very confusing!
So I add an exception in the netcdf4 URL guessing that if
zarris present anywhere in the URL it passes on it. This might have a consequence for anyone using netcdf4 for reading zarr. But I think the tradeoff of those users being forced to useengine="netcdf4"vs the convenience for people who want the zarr backend is a worthwhile one.3. Future proof the guess_can_open
zarr will not be the last format that gets invented. And people will continue to write custom xarray backends. This means that the
zarrexception in the netcdf4guess_can_openin this PR will be inadequate to protect custom backends or for new formats that want to be accessed by remote access. So we can't just continue to add exceptions to the netcdf4 URL guessing. To future proof here I have:FutureWarning explaining what users should do
attn: @ocefpaf @dopplershift @Mikejmnez
whats-new.rstapi.rst