Skip to content

Commit 1a70697

Browse files
Develop3 (katsikora#9)
* default correctRTA false * rm kept genes * rename DTE folders * edgeR gene * comma * Snakefile * typo * cleanup * rn rmd * more gene level rmds * full support for gene-level de * comma * fix * unique name --------- Co-authored-by: [email protected] <[email protected]>
1 parent edffa51 commit 1a70697

File tree

12 files changed

+578
-35
lines changed

12 files changed

+578
-35
lines changed

Snakefile

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -48,7 +48,7 @@ if not fromBam:
4848
expand("oarfish_output/{sample}.meta_info.json",sample=samples),
4949
expand("oarfish_output/{sample}.infreps.pq",sample=samples)]
5050
if config["sampleSheet"]:
51-
req_files.append("edgeR_output/report.html")
51+
req_files.append(["edgeR_transcript_output/report.html","edgeR_gene_output/report.html"])
5252
else:
5353
include: "snakefiles/make_allelic_reads.smk"
5454
include: "snakefiles/oarfish_allelic.smk"
@@ -62,8 +62,8 @@ else:
6262
expand("oarfish_output/{sample}_{allele}.infreps.pq",sample=samples,allele=alleles)
6363
]
6464
if config["sampleSheet"]:
65-
req_files.append(["edgeR_allele_output/report.html","edgeR_allele_condition_output/report.html"])
66-
65+
req_files.append(["edgeR_transcript_allele_output/report.html","edgeR_transcript_allele_condition_output/report.html"])
66+
req_files.append(["edgeR_gene_allele_output/report.html","edgeR_gene_allele_condition_output/report.html"])
6767

6868

6969
rule all:

configs/config.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,4 +3,4 @@ fromBam: False
33
phasedVcf:
44
sampleSheet: /data/manke/sikora/oarfish_miniworkshop/oarfish_test_data/sampleSheet.tsv
55
organism: hg38
6-
6+
correctRTA: False

organisms/hg38.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -93,3 +93,4 @@ ignoreForNormalization: chrX chrY chrM GL000008.2 GL000009.2 GL000194.1 GL000195
9393
known_splicesites: /data/repository/organisms/GRCh38_gencode_40/gencode/release-40/HISAT2/genome.ss
9494
star_index: /data/repository/organisms/GRCh38_gencode_40/Indices/STAR_2.7.10
9595
rmsk_file: /data/repository/organisms/GRCh38_gencode_40/repeatMasker/genome.fa.tbl
96+
t2g: /data/repository/organisms/GRCh38_gencode_40/gencode/release-40/transcript2gene.txt

organisms/mm39.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,3 +21,4 @@ ignoreForNormalization: MT X Y JH584299.1 GL456233.2 JH584301.1 GL456211.1 GL456
2121
known_splicesites: /data/repository/organisms/GRCm39_ensembl_106/ensembl/release-106/HISAT2/genome.ss
2222
rmsk_file: /data/repository/organisms/GRCm39_ensembl_106/repeatMasker/genome.fa.tbl
2323
star_index: /data/repository/organisms/GRCm39_ensembl_106/Indices/STAR_2.7.10
24+
t2g: /data/repository/organisms/GRCm39_ensembl_106/ensembl/release-106/transcript2gene.txt

rscripts/edgeR_gene.Rmd

Lines changed: 151 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,151 @@
1+
---
2+
title: "edgeR"
3+
author: "Katarzyna Sikora"
4+
output: html_document
5+
---
6+
7+
```{r setup, include=FALSE}
8+
wd<-file.path(snakemake@params[["basedir"]],snakemake@params[["outdir"]])
9+
system(paste0('mkdir -p ',wd))
10+
knitr::opts_chunk$set(echo = TRUE)
11+
knitr::opts_knit$set(root.dir = wd)
12+
```
13+
14+
## Libs
15+
16+
```{r libs}
17+
library(tximport)
18+
library(edgeR)
19+
library(GenomicRanges)
20+
```
21+
22+
## Import counts and summarize gene-wise
23+
24+
Original oarfish output will be used.
25+
26+
The output files are read using tximport function, which imports the transcript-level counts and summarizes them per gene using a transcript to gene mapping file.
27+
28+
```{r tximport}
29+
input_files<-unlist(snakemake@params[["input_files"]])
30+
names(input_files)<-gsub(".quant","",basename(input_files))
31+
tx2gene_file<-snakemake@input[["t2g"]]
32+
tx2gene<-read.delim(tx2gene_file,header=FALSE)[,c(1,2)]
33+
txi<-tximport(input_files, type = "oarfish", tx2gene = tx2gene)
34+
35+
```
36+
37+
## Prep the DGEList object
38+
39+
40+
```{r sampleinfo}
41+
data.counts<-txi$counts
42+
write.table(data.counts,"data.counts.tsv",sep="\t",quote=FALSE)
43+
44+
sampleSheet<-snakemake@input[["sampleSheet"]]
45+
sampleInfo<-read.table(sampleSheet,sep="\t",header=TRUE)
46+
rownames(sampleInfo)<-sampleInfo$name
47+
write.table(sampleInfo,"sampleInfo.tsv",sep="\t",quote=FALSE)
48+
data.counts<-data.counts[,sampleInfo$name]
49+
50+
51+
y <- DGEList(counts = data.counts, samples = sampleInfo, group = sampleInfo$condition)
52+
53+
head(y$genes)
54+
55+
#gtf_file<-snakemake@input[["gtf_file"]]
56+
#gtf<-rtracklayer::import(gtf_file)
57+
head(y$genes)
58+
```
59+
60+
61+
## Filter and normalize
62+
63+
Lowly expressed genes are filtered out prior to the downstream analysis.
64+
65+
Scaling factors can computed using the TMM method to convert the resulting library sizes to effective library sizes.
66+
67+
```{r filt norm}
68+
keep <- filterByExpr(y)
69+
table(keep)
70+
y <- y[keep, , keep.lib.sizes=FALSE]
71+
72+
73+
y <- normLibSizes(y)
74+
y$samples
75+
```
76+
77+
78+
## Calculate MDS
79+
80+
MDS plots can also be used to visualize differences between the expression profiles of different samples with gene-level counts.
81+
82+
```{r MDS}
83+
plotMDS(y,col = c(1:2)[y$samples$condition],labels = y$samples$name,xlim = c(-4,4))
84+
```
85+
86+
## Design matrix
87+
88+
We create the design matrix to compare HEK293 cells against HAP1 cells.
89+
90+
```{r design}
91+
design <- model.matrix(~ condition, data = y$samples)
92+
design
93+
```
94+
95+
96+
## Dispersion estimation
97+
98+
Estimate and visualize NB dispersions.
99+
100+
```{r disp}
101+
y <- estimateDisp(y, design, robust=TRUE)
102+
saveRDS(y,"y.RDS")
103+
y$common.dispersion
104+
plotBCV(y)
105+
106+
```
107+
108+
The NB dispersion estimates will not be used further under the latest quasi-likelihood (QL) pipeline.
109+
For DGE analyses, we're going to use the quasi-likelihood (QL) pipeline for stricter error rate control by accounting for the uncertainty associated with the dispersion estimation.
110+
111+
```{r quasil}
112+
fit <- glmQLFit(y, design, robust=TRUE)
113+
plotQLDisp(fit)
114+
```
115+
116+
## Differential expression
117+
118+
Differentially expressed genes are tested between cell lines using the QL F-test.
119+
120+
```{r dte}
121+
qlf <- glmQLFTest(fit)
122+
is.de <- decideTests(qlf, p.value=0.05)
123+
summary(is.de)
124+
125+
126+
tt <- as.data.frame(topTags(qlf, n = Inf,p.value=0.05))
127+
head(tt)
128+
length(unique(tt$gene_id))
129+
```
130+
## MA plot
131+
132+
```{r ma}
133+
plotMD(qlf)
134+
```
135+
136+
## Save and export results
137+
138+
```{r save}
139+
saveRDS(qlf,"qlf.RDS")
140+
write.table(tt,"topTags_pval0.05.tsv",sep="\t",quote=FALSE)
141+
142+
```
143+
144+
145+
## Session Info
146+
147+
```{r session info}
148+
sessionInfo()
149+
```
150+
151+

rscripts/edgeR_gene_allele.Rmd

Lines changed: 162 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,162 @@
1+
---
2+
title: "edgeR"
3+
author: "Katarzyna Sikora"
4+
output: html_document
5+
---
6+
7+
```{r setup, include=FALSE}
8+
wd<-file.path(snakemake@params[["basedir"]],snakemake@params[["outdir"]])
9+
system(paste0('mkdir -p ',wd))
10+
knitr::opts_chunk$set(echo = TRUE)
11+
knitr::opts_knit$set(root.dir = wd)
12+
```
13+
14+
## Libs
15+
16+
```{r libs}
17+
library(tximport)
18+
library(edgeR)
19+
library(GenomicRanges)
20+
```
21+
22+
## Import counts and summarize gene-wise
23+
24+
Original oarfish output will be used.
25+
26+
The output files are read using tximport function, which imports the transcript-level counts and summarizes them per gene using a transcript to gene mapping file.
27+
28+
```{r tximport}
29+
input_files<-unlist(snakemake@params[["input_files"]])
30+
names(input_files)<-gsub(".quant","",basename(input_files))
31+
tx2gene_file<-snakemake@input[["t2g"]]
32+
tx2gene<-read.delim(tx2gene_file,header=FALSE)[,c(1,2)]
33+
txi<-tximport(input_files, type = "oarfish", tx2gene = tx2gene)
34+
35+
```
36+
37+
## Prep the DGEList object
38+
39+
```{r sampleinfo}
40+
data.counts<-txi$counts
41+
write.table(data.counts,"data.counts.tsv",sep="\t",quote=FALSE)
42+
43+
sampleSheet<-snakemake@input[["sampleSheet"]]
44+
sampleInfo<-read.table(sampleSheet,sep="\t",header=TRUE)
45+
sampleInfo_h1<-sampleInfo
46+
sampleInfo_h2<-sampleInfo
47+
sampleInfo_h1$allele<-"h1"
48+
sampleInfo_h2$allele<-"h2"
49+
allelic_sampleInfo<-as.data.frame(rbind(sampleInfo_h1,sampleInfo_h2))
50+
#rownames(sampleInfo)<-sampleInfo$name
51+
allelic_sampleInfo$unique_name<-with(allelic_sampleInfo,paste0(name,"_",allele))
52+
53+
write.table(allelic_sampleInfo,"allelic_sampleInfo.tsv",sep="\t",quote=FALSE)
54+
data.counts<-data.counts[,allelic_sampleInfo$unique_name]
55+
56+
57+
y <- DGEList(counts = data.counts, samples = allelic_sampleInfo, group = allelic_sampleInfo$allele)
58+
59+
head(y$genes)
60+
61+
#gtf_file<-snakemake@input[["gtf_file"]]
62+
63+
64+
#gtf<-rtracklayer::import(gtf_file)
65+
66+
#ginfo <- mcols(gtf)[match(rownames(y),mcols(gtf)$transcript_id),c("transcript_type","gene_id","gene_name")]
67+
#y$genes <- cbind(y$genes,ginfo)
68+
#head(y$genes)
69+
```
70+
71+
72+
## Filter and normalize
73+
74+
Lowly expressed genes are filtered out prior to the downstream analysis.
75+
Scaling factors can computed using the TMM method to convert the resulting library sizes to effective library sizes.
76+
77+
```{r filt norm}
78+
keep <- filterByExpr(y)
79+
table(keep)
80+
y <- y[keep, , keep.lib.sizes=FALSE]
81+
82+
83+
y <- normLibSizes(y)
84+
y$samples
85+
```
86+
87+
88+
## Calculate MDS
89+
90+
MDS plots can also be used to visualize differences between the expression profiles of different samples with gene-level counts.
91+
92+
```{r MDS}
93+
plotMDS(y,col = c(1:2)[y$samples$allele],labels = y$samples$unique_name,xlim = c(-4,4))
94+
```
95+
96+
## Design matrix
97+
98+
We create the design matrix to compare HEK293 cells against HAP1 cells.
99+
100+
```{r design}
101+
design <- model.matrix(~ allele, data = y$samples)
102+
design
103+
```
104+
105+
106+
## Dispersion estimation
107+
108+
Estimate and visualize NB dispersions.
109+
110+
```{r disp}
111+
y <- estimateDisp(y, design, robust=TRUE)
112+
saveRDS(y,"y.RDS")
113+
y$common.dispersion
114+
plotBCV(y)
115+
116+
```
117+
118+
The NB dispersion estimates will not be used further under the latest quasi-likelihood (QL) pipeline.
119+
For DGE analyses, we're going to use the quasi-likelihood (QL) pipeline for stricter error rate control by accounting for the uncertainty associated with the dispersion estimation.
120+
121+
```{r quasil}
122+
fit <- glmQLFit(y, design, robust=TRUE)
123+
plotQLDisp(fit)
124+
```
125+
126+
## Differential expression
127+
128+
Differentially expressed genes are tested between cell lines using the QL F-test.
129+
130+
```{r dte}
131+
qlf <- glmQLFTest(fit)
132+
is.de <- decideTests(qlf, p.value=0.05)
133+
summary(is.de)
134+
135+
136+
tt <- as.data.frame(topTags(qlf, n = Inf,p.value=0.05))
137+
head(tt)
138+
table(tt$transcript_type)
139+
length(unique(tt$gene_id))
140+
```
141+
## MA plot
142+
143+
```{r ma}
144+
plotMD(qlf)
145+
```
146+
147+
## Save and export results
148+
149+
```{r save}
150+
saveRDS(qlf,"qlf.RDS")
151+
write.table(tt,"topTags_pval0.05.tsv",sep="\t",quote=FALSE)
152+
153+
```
154+
155+
156+
## Session Info
157+
158+
```{r session info}
159+
sessionInfo()
160+
```
161+
162+

0 commit comments

Comments
 (0)