-
Notifications
You must be signed in to change notification settings - Fork 736
Implementation of fetch_pdb() #4943
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
|
I'm not sure where to put this code in the codebase, so I create a new folder for it right now. I'm open to it being moved somewhere Some stuff which I like to still add (besides tests and docs):
|
|
I think others will have to confirm, but likely we'll want to have Additional it's not finalised yet but if the mmcif reader in #2367 gets finalised then the default download shouldn't be |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #4943 +/- ##
===========================================
+ Coverage 92.68% 92.73% +0.05%
===========================================
Files 180 180
Lines 22452 22474 +22
Branches 3186 3190 +4
===========================================
+ Hits 20809 20842 +33
- Misses 1169 1176 +7
+ Partials 474 456 -18 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
I'm ok with that. I can make the code raise an exception if |
|
Assuming that |
|
You've added it to one of the optional dependency categories which is all that should be required. For the actual files where it is used you'll need to have something setup like the usage of biopython: mdanalysis/package/MDAnalysis/analysis/align.py Lines 200 to 207 in dcaa087
I'm not an expert on the pipelines so someone else would have to pitch in more on that. |
|
Thanks for the comment! |
|
I happen to have another question! Is it normal for some of the tests to not be consistent across each commit? From what I understand, each github CLI has to get and build each MDAnalysis from source, and this instance can potentially timeout from what I observe across each commit. The macOS (of the latest commit) failed at 97% of test because it reached the max wall time of two hours. Even then the latest Azure tests failed because of other tests in the source code which I didn't write (namely due to other tests) |
|
In principle, tests should pass everywhere. The Azure tests time out in the test which looks like something that you added. I haven't looked at your code but it might simply be the case that some stuff needs to be written differently for windows. |
|
@jauy123 do you have time to pick up this PR again? Would be great to have the feature in 2.10! |
|
I have time again. I was busy starting the end of spring break with comps, classes, and you know what. |
…since text files are just binary files with special encoding
orbeckst
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Main thig: make your docs appear :-)
|
|
||
| from .due import due, Doi, BibTeX | ||
|
|
||
| from .topology.PDBParser import fetch_pdb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@IAlibay @BradyAJohnston are we sure that we want the import at the top level?
If we do more fetch_xxx() in the future then we may have to deprecate it again, e.g. in favor of a mda.fetch.pdb(...) or Universe.from_fetched.
I think it's ok to leave it here for now because we don't have anything else. If we get more before 3.0, we still have time to deprecate and remove in 3.0.
If it is left in then does it need to be documented at the top level, too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree an mda.fetch module would be nicer than having this at the top-level. And it makes more sense for the fetching code to be in its own module rather than PDBParser as there's no parsing happening here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@p-j-smith thank you for commenting. If you have strong opinions, make it a blocking a review. That's the best way to drive a discussion.
orbeckst
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More minor corrections & suggestions.
| 'xml.gz', 'pdb', 'pdb.gz', 'pdb1', 'pdb1.gz' file formats and can therefore be | ||
| downloaded. Not all of these formats can be currently read with MDAnalysis. | ||
| Cache, controlled by the `cache_patch` parameter, is handled internally by pooch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix:
| Cache, controlled by the `cache_patch` parameter, is handled internally by pooch. | |
| Caching, controlled by the `cache_path` parameter, is handled internally by :mod:`pooch`. |
| downloaded. Not all of these formats can be currently read with MDAnalysis. | ||
| Cache, controlled by the `cache_patch` parameter, is handled internally by pooch. | ||
| The default None arguments stores the data files in the platform dependent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
markup
| The default None arguments stores the data files in the platform dependent | |
| The default ``None`` arguments stores the data files in the platform dependent |
| Cache, controlled by the `cache_patch` parameter, is handled internally by pooch. | ||
| The default None arguments stores the data files in the platform dependent | ||
| `Pooch Default Cache Path`_ under the folder MDAnalysis_pdbs. To clear cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following could also be deleted if it's already explained in the parameter docs
| `Pooch Default Cache Path`_ under the folder MDAnalysis_pdbs. To clear cache | |
| `Pooch Default Cache Path`_ under the folder :data:`DEFAULT_CACHE_NAME_DOWNLOADER`. To clear cache |
| The default None arguments stores the data files in the platform dependent | ||
| `Pooch Default Cache Path`_ under the folder MDAnalysis_pdbs. To clear cache | ||
| (and subsquently force re-fetching), it is required to delete the cache folder | ||
| as specified by cache_path. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
markup and be explicit
| as specified by cache_path. | |
| as specified by `cache_path`; if you used the default then the path to the pooch cache | |
| is ``pooch.os_cache(MDAnalysis.topology.PDBParser.DEFAULT_CACHE_NAME_DOWNLOADER)``. |
| .. _`Pooch Default Cache Path`: | ||
| https://www.fatiando.org/pooch/latest/api/generated/pooch.os_cache.html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you use intersphix then you can replace the explict link with
:func:`Pooch Default Cache Path<pooch.os_cache>`_ under the folder(actually... test it first)
| >>> mda.fetch_pdb(["1AKE", "4BWZ"], progressbar=True) | ||
| ['./MDAnalysis_pdbs/1AKE.pdb.gz', './MDAnalysis_pdbs/4BWZ.pdb.gz'] | ||
| Download a single PDB file and converting it to a universe: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
converting -> convert
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix grammar
Co-authored-by: Oliver Beckstein <[email protected]>
…nx stuff modified: package/MDAnalysis/topology/PDBParser.py modified: package/doc/sphinx/source/conf.py
orbeckst
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor doc fixes
| >>> mda.fetch_pdb(["1AKE", "4BWZ"], progressbar=True) | ||
| ['./MDAnalysis_pdbs/1AKE.pdb.gz', './MDAnalysis_pdbs/4BWZ.pdb.gz'] | ||
| Download a single PDB file and converting it to a universe: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix grammar
orbeckst
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check formatting first on RTD
Here the SeeAlso in https://mdanalysis--4943.org.readthedocs.build/en/4943/documentation_pages/topology/PDBParser.html somehow includes the classes:
I actually liked "Classes and Functions" as heading so change "Classes" to "Classes and Functions". Make sure that the reST looks right.
|
Ok, I think everything is order. I checked the html docs build on firefox, and it looks good from my end. |
orbeckst
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing the last comments. All looking good to me now – ✅ !
Let's keep the top level import mda.fetch_pdb() for now. If we expand "fetchers" then we can revisit. Given that we currently do not document our top-level imports it is sufficient to document topology.PDBParser.fetch_pdb. Perhaps I could encourage you to add an entry to the UserGuide (see Contributing to the User Guide)?
Another thought for the future (and a possible fetchers module) is that fetch() should perhaps return pathlib.Path instances. I do not propose to change fetch_pdb() as this can be the bare-bones workhorse function that we may want to wrap with a generic fetch() function or Universe.from_fetched() static method.
|
Thanks for the feedback!
I would disagree with this idea simply because the standard library still has modules that still takes paths as string (i.e. |
|
@BradyAJohnston Nice to meet you at the UGM! Just want to remind you about this PR. It deviate quite a bit from the initial conception back in March, but I believe this is robust enough to be accepted within the main codebase. |
|
@jauy123 I think @BradyAJohnston is currently here, maybe catch him ;-). Lesson for the future: if you have the opportunity to corner a reviewer of one of your PRs, have them review/approve right while they are sitting in front of you ;-). Potentially bribe them with a beer. |
|
Regarding |
|
Can't go to the botanical garden today xD. I got a car, but I walk everyday to campus. It would take too long for me to get back and drive there.
I forgot! I just remember when I was walking home last night after dinner and brain farted it until five minutes ago. Am taking bribing people as alcohol as official sanctioned Ph.D advice in the future xD. |
I see the point, and I personally use My next question if anything in the standard library which only take strings, or that they all of them take |
Functions in |
|
I didn't know that |
p-j-smith
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work @jauy123!
Blocking whilst we consider moving fetch_pdb into a new module e.g. mda.fetch. It feels like it would be better placed there, rather than in topology.PDBParser as it's not doing any topology parsing
|
|
||
|
|
||
| def fetch_pdb( | ||
| pdb_ids=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if pdb_ids is None? I think this should be a required argument
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function would throw a requests.exceptions.HTTPError since it would try to query wwPDB with a None string. I don't think that guarding or throwing a custom exception is necessary since I think it would be obvious that is no None PDB code for any structure.
| def test_one_file_download(self, tmp_path, pdb_id): | ||
| path = mda.fetch_pdb(pdb_id, cache_path=tmp_path, file_format="cif") | ||
| assert isinstance(path, str) | ||
| assert true_basename(path) == pdb_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| assert true_basename(path) == pdb_id | |
| assert pathlib.Path(path).name == f"{pdb_id}.cif" |
It would be better to include the extension in the comparison, that way you know the correct format was downloaded
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would we want to still return the path as a string or pathlib.Path? I agree with the change otherwise.
| assert all(isinstance(pdb_id, str) for pdb_id in list_of_path_strings) | ||
| assert all( | ||
| [ | ||
| true_basename(path) == name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| true_basename(path) == name | |
| pathlib.Path(path).name == f"{name}.pdb.gz" |
Same here
| "h5py>=2.10", | ||
| "chemfiles>=0.10", | ||
| "parmed", | ||
| "pooch", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this could go into a new group, e.g. data as it's not adding support for extra formats
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just stuck in here because none of the other categories (analysis, doc, parallel) fit, and a downloader's purpose is to allow the user access to "extra formats" from the web which made sense to me. So if you were to put make a new data category would it just like this at the bottom?
data = [
"pooch",
]There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@IAlibay @hmacdope @fiona-naughton (as the other 3/4 of the release team) could you weigh in on how to specify extra dependencies?
Fixes #4907
Changes made in this Pull Request:
This is a still work in progress, but here's a implementation of @BradyAJohnston 's code wrapped into classes. I still need to write tests and docs for the entire thing.
DownloaderBaseand 'PDBDownloader' in order to implement downloading structure file from online sources such as the PDB databank.requestsas a dependencymda.fetch_pdb()is implemented as a wrapper to commonly used option in 'PDBDownloader'PR Checklist
package/CHANGELOGfile updated?Developers Certificate of Origin
I certify that I can submit this code contribution as described in the Developer Certificate of Origin, under the MDAnalysis LICENSE.
📚 Documentation preview 📚: https://mdanalysis--4943.org.readthedocs.build/en/4943/