Skip to content

Conversation

@jauy123
Copy link
Contributor

@jauy123 jauy123 commented Mar 3, 2025

Fixes #4907

Changes made in this Pull Request:

This is a still work in progress, but here's a implementation of @BradyAJohnston 's code wrapped into classes. I still need to write tests and docs for the entire thing.

  • Added two classes: DownloaderBase and 'PDBDownloader' in order to implement downloading structure file from online sources such as the PDB databank.
  • Added requests as a dependency
  • mda.fetch_pdb() is implemented as a wrapper to commonly used option in 'PDBDownloader'

PR Checklist

  • Issue raised/referenced?
  • Tests updated/added?
  • Documentation updated/added?
  • package/CHANGELOG file updated?

Developers Certificate of Origin

I certify that I can submit this code contribution as described in the Developer Certificate of Origin, under the MDAnalysis LICENSE.


📚 Documentation preview 📚: https://mdanalysis--4943.org.readthedocs.build/en/4943/

@jauy123
Copy link
Contributor Author

jauy123 commented Mar 3, 2025

I'm not sure where to put this code in the codebase, so I create a new folder for it right now. I'm open to it being moved somewhere

Some stuff which I like to still add (besides tests and docs):

  1. Verbose option for PdbDownloader.download() (I think tqdm was a dependency last time I saw?)
  2. Integration with Mdanalysis' logger
  3. Probably could wrap the cache logic into a separate function

@BradyAJohnston BradyAJohnston self-assigned this Mar 4, 2025
@BradyAJohnston
Copy link
Member

BradyAJohnston commented Mar 4, 2025

I think others will have to confirm, but likely we'll want to have requests be an optional dependency to reduce core library dependencies (as the fetching of structures won't be something that lot of users will be doing).

Additional it's not finalised yet but if the mmcif reader in #2367 gets finalised then the default download shouldn't be .pdb (but can remain for now).

@codecov
Copy link

codecov bot commented Mar 4, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.73%. Comparing base (5c7c480) to head (04275e9).

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #4943      +/-   ##
===========================================
+ Coverage    92.68%   92.73%   +0.05%     
===========================================
  Files          180      180              
  Lines        22452    22474      +22     
  Branches      3186     3190       +4     
===========================================
+ Hits         20809    20842      +33     
- Misses        1169     1176       +7     
+ Partials       474      456      -18     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jauy123
Copy link
Contributor Author

jauy123 commented Mar 5, 2025

I'm ok with that. I can make the code raise an exception if requests is not in the environment.

@jauy123
Copy link
Contributor Author

jauy123 commented Mar 10, 2025

Assuming that requests will be optional dependency, how would I exactly specify in the build files? Right now, I'm just hard coding it in, so that the github CLI tests can build successfully and run.

@BradyAJohnston
Copy link
Member

You've added it to one of the optional dependency categories which is all that should be required. For the actual files where it is used you'll need to have something setup like the usage of biopython:

try:
import Bio.AlignIO
import Bio.Align
import Bio.Align.Applications
except ImportError:
HAS_BIOPYTHON = False
else:
HAS_BIOPYTHON = True

I'm not an expert on the pipelines so someone else would have to pitch in more on that.

@jauy123
Copy link
Contributor Author

jauy123 commented Mar 19, 2025

Thanks for the comment!

@jauy123
Copy link
Contributor Author

jauy123 commented Mar 19, 2025

I happen to have another question!

Is it normal for some of the tests to not be consistent across each commit? From what I understand, each github CLI has to get and build each MDAnalysis from source, and this instance can potentially timeout from what I observe across each commit.

The macOS (of the latest commit) failed at 97% of test because it reached the max wall time of two hours.

Even then the latest Azure tests failed because of other tests in the source code which I didn't write (namely due to other tests)

From Azure_Tests Win-Python310-64bit-full (commit 651bf267076d2d7da6491608b1b5136915caf2e2)

FAIL MDAnalysisTests/coordinates/test_h5md.py::TestH5MDReaderWithRealTrajectory::test_open_filestream - Issue #2884
XFAIL MDAnalysisTests/coordinates/test_h5md.py::TestH5MDWriterWithRealTrajectory::test_write_with_drivers[core] - occasional PermissionError on windows
XFAIL MDAnalysisTests/coordinates/test_memory.py::TestMemoryReader::test_frame_collect_all_same - reason: memoryreader allows independent coordinates
XFAIL MDAnalysisTests/coordinates/test_memory.py::TestMemoryReader::test_timeseries_values[slice0] - reason: MemoryReader uses deprecated stop inclusive indexing, see Issue #3893
XFAIL MDAnalysisTests/coordinates/test_memory.py::TestMemoryReader::test_timeseries_values[slice1] - reason: MemoryReader uses deprecated stop inclusive indexing, see Issue #3893
XFAIL MDAnalysisTests/coordinates/test_memory.py::TestMemoryReader::test_timeseries_values[slice2] - reason: MemoryReader uses deprecated stop inclusive indexing, see Issue #3893
XFAIL MDAnalysisTests/core/test_topologyattrs.py::TestResids::test_set_atoms
XFAIL MDAnalysisTests/lib/test_util.py::test_which - util.which does not get right binary on Windows
XFAIL MDAnalysisTests/converters/test_rdkit.py::TestRDKitFunctions::test_order_independant_issue_3339[C-[N+]#N] - Not currently tackled by the RDKitConverter
XFAIL MDAnalysisTests/converters/test_rdkit.py::TestRDKitFunctions::test_order_independant_issue_3339[C-N=[N+]=[N-]] - Not currently tackled by the RDKitConverter
XFAIL MDAnalysisTests/converters/test_rdkit.py::TestRDKitFunctions::test_order_independant_issue_3339[C-[O+]=C] - Not currently tackled by the RDKitConverter
XFAIL MDAnalysisTests/converters/test_rdkit.py::TestRDKitFunctions::test_order_independant_issue_3339[C-[N+]#[C-]] - Not currently tackled by the RDKitConverter
XFAIL MDAnalysisTests/coordinates/test_dcd.py::TestDCDReader::test_frame_collect_all_same - reason: DCDReader allows independent coordinates.This behaviour is deprecated and will be changedin 3.

@orbeckst
Copy link
Member

In principle, tests should pass everywhere.

The Azure tests time out in the test

_________________________ Test_Fetch_Pdb.test_timeout _________________________

which looks like something that you added. I haven't looked at your code but it might simply be the case that some stuff needs to be written differently for windows.

@orbeckst
Copy link
Member

@jauy123 do you have time to pick up this PR again? Would be great to have the feature in 2.10!

@jauy123
Copy link
Contributor Author

jauy123 commented Jun 12, 2025

I have time again. I was busy starting the end of spring break with comps, classes, and you know what.

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main thig: make your docs appear :-)


from .due import due, Doi, BibTeX

from .topology.PDBParser import fetch_pdb
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@IAlibay @BradyAJohnston are we sure that we want the import at the top level?

If we do more fetch_xxx() in the future then we may have to deprecate it again, e.g. in favor of a mda.fetch.pdb(...) or Universe.from_fetched.

I think it's ok to leave it here for now because we don't have anything else. If we get more before 3.0, we still have time to deprecate and remove in 3.0.

If it is left in then does it need to be documented at the top level, too?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree an mda.fetch module would be nicer than having this at the top-level. And it makes more sense for the fetching code to be in its own module rather than PDBParser as there's no parsing happening here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@p-j-smith thank you for commenting. If you have strong opinions, make it a blocking a review. That's the best way to drive a discussion.

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More minor corrections & suggestions.

'xml.gz', 'pdb', 'pdb.gz', 'pdb1', 'pdb1.gz' file formats and can therefore be
downloaded. Not all of these formats can be currently read with MDAnalysis.
Cache, controlled by the `cache_patch` parameter, is handled internally by pooch.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix:

Suggested change
Cache, controlled by the `cache_patch` parameter, is handled internally by pooch.
Caching, controlled by the `cache_path` parameter, is handled internally by :mod:`pooch`.

downloaded. Not all of these formats can be currently read with MDAnalysis.
Cache, controlled by the `cache_patch` parameter, is handled internally by pooch.
The default None arguments stores the data files in the platform dependent
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

markup

Suggested change
The default None arguments stores the data files in the platform dependent
The default ``None`` arguments stores the data files in the platform dependent

Cache, controlled by the `cache_patch` parameter, is handled internally by pooch.
The default None arguments stores the data files in the platform dependent
`Pooch Default Cache Path`_ under the folder MDAnalysis_pdbs. To clear cache
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following could also be deleted if it's already explained in the parameter docs

Suggested change
`Pooch Default Cache Path`_ under the folder MDAnalysis_pdbs. To clear cache
`Pooch Default Cache Path`_ under the folder :data:`DEFAULT_CACHE_NAME_DOWNLOADER`. To clear cache

The default None arguments stores the data files in the platform dependent
`Pooch Default Cache Path`_ under the folder MDAnalysis_pdbs. To clear cache
(and subsquently force re-fetching), it is required to delete the cache folder
as specified by cache_path.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

markup and be explicit

Suggested change
as specified by cache_path.
as specified by `cache_path`; if you used the default then the path to the pooch cache
is ``pooch.os_cache(MDAnalysis.topology.PDBParser.DEFAULT_CACHE_NAME_DOWNLOADER)``.

Comment on lines 606 to 607
.. _`Pooch Default Cache Path`:
https://www.fatiando.org/pooch/latest/api/generated/pooch.os_cache.html
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you use intersphix then you can replace the explict link with

:func:`Pooch Default Cache Path<pooch.os_cache>`_ under the folder

(actually... test it first)

>>> mda.fetch_pdb(["1AKE", "4BWZ"], progressbar=True)
['./MDAnalysis_pdbs/1AKE.pdb.gz', './MDAnalysis_pdbs/4BWZ.pdb.gz']
Download a single PDB file and converting it to a universe:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

converting -> convert

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix grammar

jauy123 and others added 3 commits October 28, 2025 08:28
…nx stuff

	modified:   package/MDAnalysis/topology/PDBParser.py
	modified:   package/doc/sphinx/source/conf.py
@jauy123 jauy123 requested a review from orbeckst October 28, 2025 16:44
Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor doc fixes

>>> mda.fetch_pdb(["1AKE", "4BWZ"], progressbar=True)
['./MDAnalysis_pdbs/1AKE.pdb.gz', './MDAnalysis_pdbs/4BWZ.pdb.gz']
Download a single PDB file and converting it to a universe:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix grammar

@jauy123 jauy123 requested a review from orbeckst October 28, 2025 20:44
Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check formatting first on RTD

Here the SeeAlso in https://mdanalysis--4943.org.readthedocs.build/en/4943/documentation_pages/topology/PDBParser.html somehow includes the classes:

image

I actually liked "Classes and Functions" as heading so change "Classes" to "Classes and Functions". Make sure that the reST looks right.

@jauy123
Copy link
Contributor Author

jauy123 commented Oct 29, 2025

Ok, I think everything is order. I checked the html docs build on firefox, and it looks good from my end.

@jauy123 jauy123 requested a review from orbeckst October 29, 2025 15:20
Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the last comments. All looking good to me now – ✅ !

Let's keep the top level import mda.fetch_pdb() for now. If we expand "fetchers" then we can revisit. Given that we currently do not document our top-level imports it is sufficient to document topology.PDBParser.fetch_pdb. Perhaps I could encourage you to add an entry to the UserGuide (see Contributing to the User Guide)?

Another thought for the future (and a possible fetchers module) is that fetch() should perhaps return pathlib.Path instances. I do not propose to change fetch_pdb() as this can be the bare-bones workhorse function that we may want to wrap with a generic fetch() function or Universe.from_fetched() static method.

@jauy123
Copy link
Contributor Author

jauy123 commented Nov 1, 2025

Thanks for the feedback!

Another thought for the future (and a possible fetchers module) is that fetch() should perhaps return pathlib.Path instances.

I would disagree with this idea simply because the standard library still has modules that still takes paths as string (i.e. os.path). Unless the only way to work with paths in python are pathlib Path objects, I prefer to return paths as a string as that has been historical python behavior, and leave users to convert to a pathlib object if they want to.

@jauy123
Copy link
Contributor Author

jauy123 commented Nov 12, 2025

@BradyAJohnston Nice to meet you at the UGM!

Just want to remind you about this PR. It deviate quite a bit from the initial conception back in March, but I believe this is robust enough to be accepted within the main codebase.

@orbeckst
Copy link
Member

@jauy123 I think @BradyAJohnston is currently here, maybe catch him ;-). Lesson for the future: if you have the opportunity to corner a reviewer of one of your PRs, have them review/approve right while they are sitting in front of you ;-). Potentially bribe them with a beer.

@orbeckst
Copy link
Member

Regarding pathlib: Using Path objects is forward looking and anything that os.path can do, you can do with a Path, but in a more abstract (and thus more robust) manner. I'd argue the opposite of you: if you really need a string representation of a Path for legacy code, use str(pth).

@jauy123
Copy link
Contributor Author

jauy123 commented Nov 12, 2025

Can't go to the botanical garden today xD. I got a car, but I walk everyday to campus. It would take too long for me to get back and drive there.

Lesson for the future: if you have the opportunity to corner a reviewer of one of your PRs, have them review/approve right while they are sitting in front of you ;-). Potentially bribe them with a beer.

I forgot! I just remember when I was walking home last night after dinner and brain farted it until five minutes ago. Am taking bribing people as alcohol as official sanctioned Ph.D advice in the future xD.

@jauy123
Copy link
Contributor Author

jauy123 commented Nov 12, 2025

Regarding pathlib: Using Path objects is forward looking and anything that os.path can do, you can do with a Path, but in a more abstract (and thus more robust) manner. I'd argue the opposite of you: if you really need a string representation of a Path for legacy code, use str(pth).

I see the point, and I personally use pathlib.Path for all my scripts. But is there like an official PEP saying that pathlib.Path is the preferred way to deal with paths over strings? Or is it like just a community practice since pathlib has been in the standard library for so long?

My next question if anything in the standard library which only take strings, or that they all of them take Paths in python 3.10+, the minimum supported version of python that MDAnalysis supports? If pathlib.Path are basically first class objects on the same level of strings in python 3.10+ (like you mentioned in the os.path), then I see no reason to not return paths as pathlib.Path objects instead of strings.

@p-j-smith
Copy link
Member

the standard library still has modules that still takes paths as string (i.e. os.path)

Functions in os.path will take anything that is path-like, including a pathlib.Path

@orbeckst
Copy link
Member

I didn't know that os.path was so flexible. I learned something new.

Copy link
Member

@p-j-smith p-j-smith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @jauy123!

Blocking whilst we consider moving fetch_pdb into a new module e.g. mda.fetch. It feels like it would be better placed there, rather than in topology.PDBParser as it's not doing any topology parsing



def fetch_pdb(
pdb_ids=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if pdb_ids is None? I think this should be a required argument

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function would throw a requests.exceptions.HTTPError since it would try to query wwPDB with a None string. I don't think that guarding or throwing a custom exception is necessary since I think it would be obvious that is no None PDB code for any structure.

def test_one_file_download(self, tmp_path, pdb_id):
path = mda.fetch_pdb(pdb_id, cache_path=tmp_path, file_format="cif")
assert isinstance(path, str)
assert true_basename(path) == pdb_id
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert true_basename(path) == pdb_id
assert pathlib.Path(path).name == f"{pdb_id}.cif"

It would be better to include the extension in the comparison, that way you know the correct format was downloaded

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we want to still return the path as a string or pathlib.Path? I agree with the change otherwise.

assert all(isinstance(pdb_id, str) for pdb_id in list_of_path_strings)
assert all(
[
true_basename(path) == name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
true_basename(path) == name
pathlib.Path(path).name == f"{name}.pdb.gz"

Same here

"h5py>=2.10",
"chemfiles>=0.10",
"parmed",
"pooch",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could go into a new group, e.g. data as it's not adding support for extra formats

Copy link
Contributor Author

@jauy123 jauy123 Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just stuck in here because none of the other categories (analysis, doc, parallel) fit, and a downloader's purpose is to allow the user access to "extra formats" from the web which made sense to me. So if you were to put make a new data category would it just like this at the bottom?

data = [
    "pooch",
]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@IAlibay @hmacdope @fiona-naughton (as the other 3/4 of the release team) could you weigh in on how to specify extra dependencies?

@marinegor
Copy link
Contributor

Small suggestion here: could we perhaps make the default format not pdb.gz but cif.gz, in line with #5142, its fixes for #5089 and the fact that:

as of April 2025 3.7% of the PDB archive is not available in legacy PDB format

source

Or at least make it default if gemmi is installed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

mda.fetch_pdb() to generate Universe from Protein Databank structures

7 participants