Skip to content

Commit 5c28b51

Browse files
authored
Update README.md
1 parent e78012f commit 5c28b51

File tree

1 file changed

+45
-1
lines changed

1 file changed

+45
-1
lines changed

README.md

Lines changed: 45 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,45 @@
1-
# compendium-expression-matrix
1+
# compendium-expression-matrix
2+
Build a matrix of samples vs genes out of individual `rsem_genes.results` files generated by RSEM.
3+
4+
```
5+
usage: build_compendium_matrix.py [-h] --name COMPENDIUM_NAME --input INPUTFILE (--tpm | --expected-count) [--output OUTPUTDIR]
6+
[--mapfile ENSEMBL_HUGO_MAPPING_FILE]
7+
```
8+
9+
### Usage:
10+
- Gather a list of paths to the `rsem_genes.results` files for your samples.
11+
- Format in a headerless tab-separated file (eg `samples.txt`) with the sample name as the first column and the path as the second. For example:
12+
13+
| | |
14+
| - | - |
15+
| MySample1 | /home/treehouse/samples/MySample1/expression/RSEM/rsem_genes.results |
16+
| MySample2 | /home/treehouse/samples/MySample2/expression/RSEM/rsem_genes.results |
17+
| Downloaded_XYZ | /shared/downloads/expression/RSEM_XYZ/rsem_genes.results |
18+
19+
- Choose your output format: **Hugo gene names and the log2(TPM+1) metric** (`--tpm`) or **Ensembl gene IDs and the Expected Count metric.** (`--expected-count`).
20+
- Optionally, specify the output path with `--output`.
21+
- If the Ensembl-to-Hugo mapping file (`EnsGeneID_Hugo_Observed_Conversions.txt`, provided) is not in the current directory, specify its path with `--mapfile`.
22+
23+
Example:
24+
```git clone https://github.com/UCSC-Treehouse/compendium-expression-matrix.git
25+
./compendium-expression-matrix/build_compendium_matrix.py \
26+
--name MyCompendium \
27+
--input samples.txt \
28+
--tpm \
29+
--mapfile compendium-expression-matrix/EnsGeneID_Hugo_Observed_Conversions.txt
30+
```
31+
32+
### Output
33+
Running this script will generate a gzipped TSV named in the format NAME_METRIC_TODAY's DATE.tsv.gz.
34+
For example, `MyCompendium_hugo_log2tpm_2025-03-07.tsv.gz `.
35+
It will also generate an intermediate json file samples_vs_rgr_files_NAME_DATE.json. This can be deleted.
36+
37+
The first column of the TSV is named `Gene` and consists of the gene names (either Hugo or Ensembl) in alphabetical order.
38+
Each subsequent column is named after the corresponding sample and consists of that sample's expression for the gene name.
39+
40+
### Mapping note
41+
Input `rsem_genes.results` files use Ensembl gene IDs. For output files using Hugo gene names, the IDs are converted using the `EnsGeneID_Hugo_Observed_Conversions.txt`. In some cases, multiple Ensembl IDs map to the same Hugo name. To map these rows within a sample,
42+
the values of app matching Ensembl IDs are summed together to produce the value associated with the Hugo name.
43+
In addition, the Ensembl IDs which map to `NA` are dropped from the Hugo output file.
44+
45+

0 commit comments

Comments
 (0)