This is the code for detecting tandem gene amplifications in ultra-deep Nanopore long read sequencing data at frequencies as low as
All the input files are described in config/config.yaml:
- path to the TSV file with sample names and corresponding FASTQ files;
- path to the TSV file with the analysis' parameters (see the section below)
- path to the FASTA file with the plasmid sequence;
- path to the FASTA file with the blaSHV gene sequence;
- path to the output file name (without extension);
The output is a TSV text file with
- observed, expected (theoretical) and corrected gene counts,
- observed, expected and corrected gene frequency,
- detection limit for each sample and
- copy number variant of the gene.
If you don't have Snakemake installed, install it.
You can run the analysis using containers (recommended) or conda environments.
To use containers you need to have both Apptainer and Docker installed.
If you want to use containers, read here how to build them.
If you want to use conda environments, Snakemake will take care of installing all the dependencies automatically.
snakemake --use-singularity --configfile config/config.yaml --cores 2 --singularity-args "--bind ${PWD}/config/,${PWD}/results/"or simply
snakemake --configfile config/config.yaml --profile profiles/apptainerin this case you may want to edit the profiles/apptainer/config.yaml file to include the paths to the test dataset and apptainer's cache and tmp directories.
snakemake --use-conda --cores 2 --configfile config/config.yamlOn a moderately powerful desktop computer, this process takes a couple of minutes to finish.
The results of this run will be saved in two files under results/tables/ directory: frequencies_all_test.tsv and frequencies_all_test.xlsx. You can compare them to the expected results in results/test_dataset/ directory available in this repository.
The Nanopore long sequencing data can be found in the NCBI SRA database under the accession number PRJNA1299340.
To run the pipeline on these samples, edit config/samples.tsv to include absolute paths to the FASTQ files on your computer, then replace
sample_table: "config/samples_test.tsv"with
sample_table: "config/samples.tsv"and replace
output_name: "results/tables/frequencies_test"with
output_name: "results/tables/frequencies"Commands to run the pipeline with Apptainer and conda are the same as for the quick run, except for the --singularity-args option: you might need to bind directory with the actual data to the container.
Parameters for the analysis and their values are specified in the config/params.tsv file. These parameters are:
- minimum read length: reads shorter than this are discarded from the analysis.
- fr_red_start and fr_red_end: the start and end positions of the flanking region 1.
- fr_green_start and fr_green_end: the start and end positions of the flanking region 2.
- ru_start and ru_end: the start and end positions of the repeat unit (IS element).
- bla_start and bla_end: the start and end positions of the blaSHV gene.
- format: blast output format.
- n_fr_aligns: the number of flanking region alignments to consider.
- n_bla_aligns: the number of blaSHV gene alignments to consider.
- min_fr_len: the minimum length of the flanking region to consider.
- min_identity: the minimum identity BLAST hits to consider.
- max_e_value: the maximum E-value of BLAST hits to consider.
- min_ru_len: the minimum length of the repeat unit to consider.
- max_dist: the maximum distance between the BLAST hits to consider.
- dist_to_end: the distance from the end of the read to the end of the repeat unit to consider.
- max_cn: the maximum copy number of the repeat unit to consider.
- increment: increase in length of the DNA segment with each new blaSHV copy.
- base_len: length of a blaSHV gene for expected copy number calculation.
- dist: the distance between BLAST hits to use in
bedtools merge.
snakevision variant
OS: Ubuntu 22.04.5 LTS
Software:
