A software suite that enables remote extraction, transformation and loading of data.
This repository is geared heavily towards drawing articles from PubMed, identifying scientific articles containing information about biological pathways, and loading the records into a data store.
- Python (version >=3.8<3.10)
- Poetry (version >=1.5.0)
- Docker (version 20.10.14) and Docker Compose (version 2.5.1)
- We use Docker to create a RethinkDB (v2.3.6) instance for loading data.
- Miniconda (Optional)
- For creating virtual environments. Any other solution will work.
- Graphics Processing Unit (GPU) (Optional)
- The pipeline classifier can be sped up an order of magnitude by running on a system with a GPU. We have been using a system running Ubuntu 18.04.5 LTS, Intel(R) Xeon(R) CPU E5-2687W, 24 Core with an NVIDIA GP102 [TITAN Xp] GPU.
Create a conda environment, here named pipeline:
$ conda create --name pipeline python=3.8 --yes
$ conda activate pipelineDownload the remote:
$ git clone https://github.com/jvwong/classifier-pipeline
$ cd classifier-pipelineInstall the dependencies:
$ poetry installTo start up the server:
uvicorn classifier_pipeline.main:app --port 8000 --reload- uvicron options
--reload: Enable auto-reload.--port INTEGER: Bind socket to this port (default 8000)
And now, go to http://127.0.0.1:8000/redoc (swap out the port if neccessary) to see the automatic documentation.
Launch a pipeline to process daily updates from PubMed and dump the RethinkDB database:
$ ./scripts/cron/install.shThe scripts directory contains python files that chain functions in classifier_pipeline to:
- read in data from
- csv, via stdin (
csv2dict_reader) - daily PubMed updates (
updatefiles_extractor)
- csv, via stdin (
- retrieve records/files from PubMed (
pubmed_transformer) - apply various filters on the individual records (
citation_pubtype_filter,citation_date_filter) - apply a deep-learning classifier to text fields (
classification_transformer) - loads the formatted data into a RethinkDB instance (
db_loader)
- Pipelines are launched through bash scripts that retrieve PubMed article records in two ways:
./scripts/cron/cron.sh: retrieves via the FTP file server all new content./scripts/csv/pmids.sh: retrieve using the NCBI E-Utilities given a set of PubMed IDs
- Variables
DATA_DIRroot directory where your data files existDATA_FILEname of the csv file in yourDATA_DIRARG_IDCOLUMNthe csv header column name containing either- a list of update files to extract (
dailyupdates.sh) - a list of PubMed IDs to extract (
pmids.sh)
- a list of update files to extract (
JOB_NAMEthe name of this pipeline jobCONDA_ENVshould be the environment name you declared in the first stepsARG_TYPE- use
fetchfor downloading individual PubMed IDs - use
downloadto retrieve FTP update files
- use
ARG_MINYEARarticles published in years before this will be filtered out (optional)ARG_TABLEis the name of the table to dump results intoARG_THRESHOLDset the lowest probability to classify an article as 'positive' using pathway-abstract-classifier
There is a convenience script that can be launched:
$ ./test.shThis will run the tests in ./tests, lint with flake8 and type check with mypy.