-
Notifications
You must be signed in to change notification settings - Fork 822
Description
Description of feature
Authors: Juliana Assis, Sebastian Schulz and Albert Palleja
We would like to emphasize the importance to facilitate the processing of prokaryotic bulk RNAseq data using the v3.x nf-core/rnaseq pipelines, as proposed by Matthias Zepper previously (#1085).
Given the growing importance of (large-scale) transcriptomic analyses in microbial biology and biotechnology, there is a need for standardized and thus reproducible processing of prokaryotic RNAseq data (bacterial and archaeal).
Although bacterial RNAseq data have been processed with nf-core/rnaseq pipelines previously, this was – to the best of our knowledge – either performed using the older v1.4.2 pipeline (Muehler et al., 2022 and Muehler et al., 2024) or using a workaround that required prior manipulation of input files in combination with setting specific pipeline parameters (summarized by Marine Cambon in a Slack thread).
Our recent attempts to run the nf-core/rnaseq pipeline v3.16.1, launched from the Seqera platform, using said workaround, however, failed (for details see section "Pipeline testing").
Conclusion
Neither the use of old pipeline versions nor the manipulation of input files is desirable from the perspective of using most updated software pipelines and standardized data processing workflows, respectively.
Pipeline testing
We followed the suggested workaround procedure for processing prokaryotic RNAseq data sets by using a genome fasta, a modified gtf and a self-generated transcript fasta file as input to the pipeline (v3.16.1) and by setting the following pipeline parameters:
--extra_star_align_args "--sjdbGTFfeatureExon CDS"(suggested in slack post)--featurecounts_feature_type "CDS"(modified by us to use CDS from the gtf file; pipeline default setting is "exon").
The pipeline failed at Salmon quantification stage. The error message reports a mismatch of sequence lengths between entries of the user-provided transcript fasta file and the pipeline-generated SAM file.
Error message
Workflow execution completed unsuccessfully
The exit status of the task that caused the workflow execution to fail was: 1
Error executing process > 'NFCORE_RNASEQ:RNASEQ:QUANTIFY_STAR_SALMON:SALMON_QUANT (SRR9681149)'
Caused by:
Process `NFCORE_RNASEQ:RNASEQ:QUANTIFY_STAR_SALMON:SALMON_QUANT (SRR9681149)` terminated with an error exit status (1)
Command executed:
salmon quant \
--geneMap GCF_000005845.2_ASM584v2_genomic.filtered.gtf \
--threads 6 \
--libType=ISR \
-t GCF_000005845.2_ASM584v2_genomic_emptyTIDremoved_transcript.fna \
-a SRR9681149.Aligned.toTranscriptome.out.bam \
\
-o SRR9681149
if [ -f SRR9681149/aux_info/meta_info.json ]; then
cp SRR9681149/aux_info/meta_info.json "SRR9681149_meta_info.json"
fi
if [ -f SRR9681149/lib_format_counts.json ]; then
cp SRR9681149/lib_format_counts.json "SRR9681149_lib_format_counts.json"
fi
cat <<-END_VERSIONS > versions.yml
"NFCORE_RNASEQ:RNASEQ:QUANTIFY_STAR_SALMON:SALMON_QUANT":
salmon: $(echo $(salmon --version) | sed -e "s/salmon //g")
END_VERSIONS
Command exit status:
1
Command output:
(empty)
Command error:
Version Info: This is the most recent version of salmon.
# salmon (alignment-based) v1.10.1
# [ program ] => salmon
# [ command ] => quant
# [ geneMap ] => { GCF_000005845.2_ASM584v2_genomic.filtered.gtf }
# [ threads ] => { 6 }
# [ libType ] => { ISR }
# [ targets ] => { GCF_000005845.2_ASM584v2_genomic_emptyTIDremoved_transcript.fna }
# [ alignments ] => { SRR9681149.Aligned.toTranscriptome.out.bam }
# [ output ] => { SRR9681149 }
Logs will be written to SRR9681149/logs
Library format { type:paired end, relative orientation:inward, strandedness:(antisense, sense) }
[2025-01-27 14:07:36.152] [jointLog] [info] setting maxHashResizeThreads to 6
[2025-01-27 14:07:36.152] [jointLog] [info] Fragment incompatibility prior below threshold. Incompatible fragments will be ignored.
[2025-01-27 14:07:36.152] [jointLog] [info] numQuantThreads = 3
parseThreads = 3
Checking that provided alignment files have consistent headers . . . done
Populating targets from aln = "SRR9681149.Aligned.toTranscriptome.out.bam", fasta = "GCF_000005845.2_ASM584v2_genomic_emptyTIDremoved_transcript.fna" . . .[0;31m
SAM file says target gnl|b0001|mrna.NP_414542 has length 63, but the FASTA file contains a sequence of length [66 or 65]
Work dir:
az://seqera/scratch/18YRRZhBVsZxN6/0e/7701dec8788b9f8519918c4a03f516
Container:
quay.io/biocontainers/salmon:1.10.1--h7e5ed60_0
Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line
Steps to reproduce the error
-
Download the Escherichia coli RNAseq data set PRJNA554579 published by Rychel et al., 2023.
-
Download the reference genome fasta and the annotation gtf file for Escherichia coli str. K-12 substr. MG1655 (accession: GCF_000005845.2).
-
Extract and modify the gtf file by deleting all lines with an empty "transcript_id" field.
gunzip -k annotation.gtf.gz
sed '/transcript_id "" /d' annotation.gtf > annotation_modified.gtf- Generate a custom transcript file from the extracted genome fasta and in 3. modified gtf file using gffread (v0.12.7).
gunzip -k genome.fna
gffread -w transcript.fna -g genome.fna annotation_modified.gtf- Run the nf-core/rnaseq pipeline v3.16.1 using (an appropriately adapted) command and configuration (only the essential part of our configuration is shown below).
Command
nextflow run 'https://github.com/nf-core/rnaseq'
-name Ecoli_gtf_transcript_files_modified
-params-file 'https://api.cloud.seqera.io/ephemeral/mDXJ21XZtCJ8MH6N6Jumgg.json'
-with-tower
-r 3.16.1Parameters
params {
input = 'az://seqera/sebastian_tests/project2_ecoli_partial/data/samplesheet.csv'
genome = null
splicesites = null
gtf_extra_attributes = 'gene_name'
gtf_group_features = 'gene_id'
skip_gtf_filter = false
skip_gtf_transcript_filter = false
featurecounts_feature_type = 'CDS'
featurecounts_group_type = 'gene_biotype'
gencode = false
save_reference = false
igenomes_base = 's3://ngi-igenomes/igenomes/'
igenomes_ignore = false
with_umi = false
skip_umi_extract = false
umitools_extract_method = 'string'
umitools_grouping_method = 'directional'
umitools_dedup_stats = false
umitools_bc_pattern = null
umitools_bc_pattern2 = null
umitools_umi_separator = null
umi_discard_read = null
save_umi_intermeds = false
trimmer = 'trimgalore'
min_trimmed_reads = 10000
extra_trimgalore_args = null
extra_fastp_args = null
save_trimmed = false
skip_trimming = false
bbsplit_fasta_list = null
save_bbsplit_reads = false
skip_bbsplit = true
remove_ribo_rna = false
save_non_ribo_reads = false
ribo_database_manifest = '/.nextflow/assets/nf-core/rnaseq/workflows/rnaseq/assets/rrna-db-defaults.txt'
aligner = 'star_salmon'
pseudo_aligner = null
pseudo_aligner_kmer_size = 31
seq_center = null
bam_csi_index = false
star_ignore_sjdbgtf = false
salmon_quant_libtype = null
hisat2_build_memory = '200.GB'
stringtie_ignore_gtf = false
min_mapped_reads = 5
extra_star_align_args = '--sjdbGTFfeatureExon CDS'
extra_salmon_quant_args = null
extra_kallisto_quant_args = null
kallisto_quant_fraglen = 200
kallisto_quant_fraglen_sd = 200
save_merged_fastq = false
save_unaligned = false
save_align_intermeds = false
skip_markduplicates = false
skip_alignment = false
skip_pseudo_alignment = false
stranded_threshold = 0.8
unstranded_threshold = 0.1
skip_qc = false
skip_bigwig = false
skip_stringtie = false
skip_fastqc = false
skip_preseq = true
skip_dupradar = false
skip_qualimap = false
contaminant_screening = null
kraken_db = null
save_kraken_assignments = false
save_kraken_unassigned = false
bracken_precision = 'S'
skip_rseqc = false
skip_biotype_qc = false
skip_deseq2_qc = false
skip_multiqc = false
deseq2_vst = true
rseqc_modules = 'bam_stat,inner_distance,infer_experiment,junction_annotation,junction_saturation,read_distribution,read_duplication'
multiqc_config = null
multiqc_title = null
multiqc_logo = null
max_multiqc_email_size = '25.MB'
multiqc_methods_description = null
outdir = 'az://seqera/sebastian_tests/project2_ecoli_partial/results_gtf_transcipt_files_modified/'
publish_dir_mode = 'copy'
email = null
email_on_fail = null
plaintext_email = false
monochrome_logs = false
hook_url = null
help = false
help_full = false
show_hidden = false
version = false
pipelines_testdata_base_path = 'https://raw.githubusercontent.com/nf-core/test-datasets/7f1614baeb0ddf66e60be78c3d9fa55440465ac8/'
config_profile_name = null
config_profile_description = null
custom_config_version = 'master'
umi_dedup_tool = 'umitools'
extra_fqlint_args = '--disable-validator P001'
custom_config_base = 'https://raw.githubusercontent.com/nf-core/configs/master'
skip_linting = false
fasta = 'az://seqera/sebastian_tests/project2_ecoli_partial/data/GCF_000005845.2_ASM584v2_genomic.fna.gz'
gtf = 'az://seqera/sebastian_tests/project2_ecoli_partial/data/GCF_000005845.2_ASM584v2_genomic_emptyTIDremoved.gtf.gz'
transcript_fasta = 'az://seqera/sebastian_tests/project2_ecoli_partial/data/GCF_000005845.2_ASM584v2_genomic_emptyTIDremoved_transcript.fna.gz'
config_profile_contact = null
config_profile_url = null
validate_params = true
genomes {
…
Recommendations for v3.x pipeline development
The following points are subject to discussion with the nf-core community.
Prokaryotic vs. eukaryotic mode
It might be beneficial to have a parameter to run the pipeline in either --prokaryotic or --eukaryotic mode, as suggested previously (#1085).
Integration of featureCounts
Re-introducing featureCounts into v3.x pipelines might be beneficial for processing prokaryotic datasets. This has been suggested recently (Perelo et al., 2024):
- “Regarding pipeline development, it would be worth discussing to include the tool featureCounts as a quantification tool option also in the v3.x pipelines. The advantage of featureCounts on datasets with a high number of one-isoform genes has been shown here and in a previous study (11) and should also be kept in mind for the analysis of prokaryotic transcriptomes.”
(text snippet from Perelo et al., 2024)
Alignment and Quantification Tools
Using STAR might not be optimal for prokaryotic datasets. A splice-unaware aligner might be the better choice since bacterial transcripts lack introns and are therefore not (alternatively) spliced (Mahmud et al., 2021):
- “Prokaryotic RNA-Seq analysis is challenging because most available RNA-Seq packages assume the input data reflect eukaryotic gene structures, which in many aspects differ from those of prokaryotes (Johnson et al., 2016). Bacterial transcripts do not have introns and are not alternatively spliced; therefore, using an aligner developed to consider splice junctions often increases falsely assigned reads in the genome (Magoc et al., 2013). Moreover, unlike in eukaryotes, under specific stresses, the expression of almost all prokaryotic genes can change (Creecy and Conway, 2015).”
(text snippet from Mahmud et al., 2021)
We suggest integrating Bowtie/Bowtie2 as a splice-unaware aligner and featureCounts for quantification. This combination is used in other prokaryotic RNAseq processing pipelines (see the iModulon pipeline and the ProkSeq pipeline).
Metadata
Metadata
Labels
Type
Projects
Status