-
Notifications
You must be signed in to change notification settings - Fork 736
Implementation of fetch_pdb() #4943
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Changes from 105 commits
7899f3d
44393be
b1f6002
9c6e87a
aecefc9
9510cc6
8c1a196
f0e30ed
1c7d909
eb23ed1
b0c7f5a
f2ec203
a21fd94
d58bed9
1147b6d
91feb16
ddcef9e
bf3e07f
09cc409
d78a954
560e1c2
c43c10d
e6a0f05
ada1b38
043c006
ea5c5b7
5d6d3e8
6590c42
6e9b9f3
252b23c
440e3b8
c3f74f9
10f66be
cda3559
cecd570
03638c8
544de38
fdaacf1
215ee43
5990939
7f7387f
f3456a5
8b8492f
3fea571
f5d6a9f
64ac4e5
0f54e8e
867614a
c85fd75
96dbf05
b15d148
2d10ad3
ab7bc8a
9289792
96d7341
c74a46e
6b20e86
0d793e9
d964bc5
124d06a
b8f7a81
d78bae6
577ac9d
608d991
07d124c
8a9ac84
939d5f0
7107aa4
c869bbc
557b1e9
f3a4d7b
2a97d9b
9d0f53a
595423a
eed80ed
802183f
bf9292c
b595f09
e93c73a
5f407ba
f09115a
e2141a8
bf81128
934eda3
9b8da31
0b80840
a7519af
72c24e0
ffcc270
b07a16d
a2aff4c
98fa75b
1e635c4
447de56
c28110f
02d81a3
adbfabb
31a8f7b
e2a28ec
546538a
735586f
b5da9be
b0c808a
ab0f635
d2f7857
09e7ef5
6beffd8
04275e9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -56,14 +56,17 @@ | |
| * :class:`MDAnalysis.coordinates.PDB.PDBReader` | ||
| * :class:`MDAnalysis.core.universe.Universe` | ||
|
|
||
|
|
||
jauy123 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| Classes | ||
orbeckst marked this conversation as resolved.
Show resolved
Hide resolved
jauy123 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| ------- | ||
|
|
||
| .. autoclass:: PDBParser | ||
| :members: | ||
| :inherited-members: | ||
|
|
||
| .. autofunction:: fetch_pdb | ||
|
|
||
| .. autodata:: DEFAULT_CACHE_NAME_DOWNLOADER | ||
|
|
||
| """ | ||
| import numpy as np | ||
| import warnings | ||
|
|
@@ -95,6 +98,31 @@ | |
| # Set up a logger for the PDBParser | ||
| logger = logging.getLogger("MDAnalysis.topology.PDBParser") | ||
|
|
||
| try: | ||
| import pooch | ||
| except ImportError: | ||
| HAS_POOCH = False | ||
| else: | ||
| HAS_POOCH = True | ||
|
|
||
orbeckst marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| #: Name of the :mod:`pooch` cache directory ``pooch.os_cache(DEFAULT_CACHE_NAME_DOWNLOADER)``; | ||
| #: see :func:`pooch.os_cache` for further details. | ||
jauy123 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| DEFAULT_CACHE_NAME_DOWNLOADER = "MDAnalysis_pdbs" | ||
jauy123 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| # These file formats are here (https://www.rcsb.org/docs/programmatic-access/file-download-services) under "PDB entry files" | ||
jauy123 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| SUPPORTED_FILE_FORMATS_DOWNLOADER = ( | ||
| "cif", | ||
| "cif.gz", | ||
| "bcif", | ||
| "bcif.gz", | ||
| "xml", | ||
| "xml.gz", | ||
| "pdb", | ||
| "pdb.gz", | ||
| "pdb1", | ||
| "pdb1.gz", | ||
| ) | ||
|
|
||
|
|
||
| def float_or_default(val, default): | ||
| try: | ||
|
|
@@ -515,3 +543,131 @@ def _parse_conect(conect): | |
| bond_atoms = (int(conect[11 + i * 5: 16 + i * 5]) for i in | ||
| range(n_bond_atoms)) | ||
| return atom_id, bond_atoms | ||
|
|
||
|
|
||
| def fetch_pdb( | ||
| pdb_ids=None, | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What happens if
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The function would throw a
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If there's no useful behaviour when |
||
| cache_path=None, | ||
| progressbar=False, | ||
| file_format="pdb.gz", | ||
| ): | ||
| """ | ||
| Download one or more PDB files from the RCSB Protein Data Bank and cache | ||
| them locally. | ||
|
|
||
| Given one or multiple PDB IDs, downloads the corresponding structure files | ||
| format and stores them in a local cache directory. If files are cached on | ||
| disk, *fetch_pdb* will skip the download and use the cached version instead. | ||
|
|
||
| Returns the path(s) as a string to the downloaded file(s). | ||
|
|
||
| Parameters | ||
| ---------- | ||
| pdb_ids : str or sequence of str | ||
| A single PDB ID as a string, or a sequence of PDB IDs to fetch. | ||
| cache_path : str or pathlib.Path | ||
jauy123 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| Directory where downloaded file(s) will be cached. | ||
jauy123 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| The default ``None`` argument uses the :mod:`pooch` default cache with | ||
| project name :data:`DEFAULT_CACHE_NAME_DOWNLOADER`. | ||
| file_format : str | ||
| The file extension/format to download (e.g., "cif", "pdb"). | ||
| See the Notes section below for a list of all supported file formats. | ||
| progressbar : bool, optional | ||
| If True, display a progress bar during file downloads. Default is False. | ||
|
|
||
| Returns | ||
| ------- | ||
| str or list of str | ||
| The path(s) to the downloaded file(s). Returns a single string if | ||
| one PDB ID is given, or a list of strings if multiple PDB IDs are | ||
| provided. | ||
|
|
||
| Raises | ||
| ------ | ||
| ValueError | ||
| For an invalid file format. Supported file formats are under Notes. | ||
|
|
||
| :class:`requests.exceptions.HTTPError` | ||
| If an invalid PDB code is specified. Note that this is :mod:`requests`, not the | ||
| standard library :mod:`urllib.request`. | ||
orbeckst marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| Notes | ||
| ----- | ||
| This function uses the `RCSB File Download Services`_ for directly downloading | ||
| structure files via https. | ||
|
|
||
| .. _`RCSB File Download Services`: | ||
| https://www.rcsb.org/docs/programmatic-access/file-download-services | ||
|
|
||
| The RCSB currently provides data in 'cif', 'cif.gz', 'bcif', 'bcif.gz', 'xml', | ||
| 'xml.gz', 'pdb', 'pdb.gz', 'pdb1', 'pdb1.gz' file formats and can therefore be | ||
| downloaded. Not all of these formats can be currently read with MDAnalysis. | ||
|
|
||
| Caching, controlled by the `cache_path` parameter, is handled internally by | ||
| :mod:`pooch`. The default cache name is taken from | ||
| :data:`DEFAULT_CACHE_NAME_DOWNLOADER`. To clear cache (and subsequently force | ||
| re-fetching), it is required to delete the cache folder as specified by | ||
| `cache_path`. | ||
|
|
||
| Examples | ||
| -------- | ||
| Download a single PDB file: | ||
|
|
||
| >>> mda.fetch_pdb("1AKE", file_format="cif") | ||
| './MDAnalysis_pdbs/1AKE.cif' | ||
|
|
||
| Download multiple PDB files with a progress bar: | ||
|
|
||
| >>> mda.fetch_pdb(["1AKE", "4BWZ"], progressbar=True) | ||
| ['./MDAnalysis_pdbs/1AKE.pdb.gz', './MDAnalysis_pdbs/4BWZ.pdb.gz'] | ||
|
|
||
| Download a single PDB file and convert it to a universe: | ||
|
|
||
| >>> mda.Universe(mda.fetch_pdb("1AKE"), file_format="pdb.gz") | ||
| <Universe with 3816 atoms> | ||
|
|
||
| Download multiple PDB files and convert each of them into a universe: | ||
|
|
||
| >>> [mda.Universe(pdb) for pdb in mda.fetch_pdb(["1AKE", "4BWZ"], progressbar=True)] | ||
| [<Universe with 3816 atoms>, <Universe with 2824 atoms>] | ||
|
|
||
orbeckst marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| .. versionadded:: 2.11.0 | ||
| """ | ||
|
|
||
p-j-smith marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| if not HAS_POOCH: | ||
| raise ModuleNotFoundError( | ||
| "pooch is needed as a dependency for fetch_pdb()" | ||
| ) | ||
| elif file_format not in SUPPORTED_FILE_FORMATS_DOWNLOADER: | ||
| raise ValueError( | ||
| "Invalid file format. Supported file formats " | ||
| f"are {SUPPORTED_FILE_FORMATS_DOWNLOADER}" | ||
| ) | ||
|
|
||
| if isinstance(pdb_ids, str): | ||
| _pdb_ids = (pdb_ids,) | ||
| else: | ||
| _pdb_ids = pdb_ids | ||
|
|
||
| if cache_path is None: | ||
jauy123 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| cache_path = pooch.os_cache(DEFAULT_CACHE_NAME_DOWNLOADER) | ||
|
|
||
| # Have to do this dictionary approach instead of using pooch.retrieve in order | ||
| # to prevent the hardcoded known_hash warning from showing up. | ||
| registry_dictionary = { | ||
| f"{pdb_id}.{file_format}": None for pdb_id in _pdb_ids | ||
| } | ||
|
|
||
| downloader = pooch.create( | ||
| path=cache_path, | ||
| base_url="https://files.wwpdb.org/download/", | ||
jauy123 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| registry=registry_dictionary, | ||
| ) | ||
|
|
||
| paths = [ | ||
| downloader.fetch(fname=file_name, progressbar=progressbar) | ||
| for file_name in registry_dictionary.keys() | ||
| ] | ||
|
|
||
| return paths if not isinstance(pdb_ids, str) else paths[0] | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -72,6 +72,7 @@ extra_formats = [ | |
| "h5py>=2.10", | ||
| "chemfiles>=0.10", | ||
| "parmed", | ||
| "pooch", | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this could go into a new group, e.g.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I just stuck in here because none of the other categories (analysis, doc, parallel) fit, and a downloader's purpose is to allow the user access to "extra formats" from the web which made sense to me. So if you were to put make a new data = [
"pooch",
]
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @IAlibay @hmacdope @fiona-naughton (as the other 3/4 of the release team) could you weigh in on how to specify extra dependencies? |
||
| "pyedr>=0.7.0", | ||
| "pytng>=0.2.3", | ||
| "gsd>3.0.0", | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -13,6 +13,7 @@ networkx | |
| numpy>=1.23.2 | ||
| packaging | ||
| parmed | ||
| pooch | ||
| pytest | ||
| scikit-learn | ||
| scipy | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@IAlibay @BradyAJohnston are we sure that we want the import at the top level?
If we do more
fetch_xxx()in the future then we may have to deprecate it again, e.g. in favor of amda.fetch.pdb(...)orUniverse.from_fetched.I think it's ok to leave it here for now because we don't have anything else. If we get more before 3.0, we still have time to deprecate and remove in 3.0.
If it is left in then does it need to be documented at the top level, too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree an
mda.fetchmodule would be nicer than having this at the top-level. And it makes more sense for the fetching code to be in its own module rather thanPDBParseras there's no parsing happening hereThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@p-j-smith thank you for commenting. If you have strong opinions, make it a blocking a review. That's the best way to drive a discussion.