|
1 | | -# compendium-expression-matrix |
| 1 | +# compendium-expression-matrix |
| 2 | +Build a matrix of samples vs genes out of individual `rsem_genes.results` files generated by RSEM. |
| 3 | + |
| 4 | +``` |
| 5 | +usage: build_compendium_matrix.py [-h] --name COMPENDIUM_NAME --input INPUTFILE (--tpm | --expected-count) [--output OUTPUTDIR] |
| 6 | + [--mapfile ENSEMBL_HUGO_MAPPING_FILE] |
| 7 | +``` |
| 8 | + |
| 9 | +### Usage: |
| 10 | +- Gather a list of paths to the `rsem_genes.results` files for your samples. |
| 11 | +- Format in a headerless tab-separated file (eg `samples.txt`) with the sample name as the first column and the path as the second. For example: |
| 12 | + |
| 13 | +| | | |
| 14 | +| - | - | |
| 15 | +| MySample1 | /home/treehouse/samples/MySample1/expression/RSEM/rsem_genes.results | |
| 16 | +| MySample2 | /home/treehouse/samples/MySample2/expression/RSEM/rsem_genes.results | |
| 17 | +| Downloaded_XYZ | /shared/downloads/expression/RSEM_XYZ/rsem_genes.results | |
| 18 | + |
| 19 | +- Choose your output format: **Hugo gene names and the log2(TPM+1) metric** (`--tpm`) or **Ensembl gene IDs and the Expected Count metric.** (`--expected-count`). |
| 20 | +- Optionally, specify the output path with `--output`. |
| 21 | +- If the Ensembl-to-Hugo mapping file (`EnsGeneID_Hugo_Observed_Conversions.txt`, provided) is not in the current directory, specify its path with `--mapfile`. |
| 22 | + |
| 23 | +Example: |
| 24 | +```git clone https://github.com/UCSC-Treehouse/compendium-expression-matrix.git |
| 25 | +./compendium-expression-matrix/build_compendium_matrix.py \ |
| 26 | +--name MyCompendium \ |
| 27 | +--input samples.txt \ |
| 28 | +--tpm \ |
| 29 | +--mapfile compendium-expression-matrix/EnsGeneID_Hugo_Observed_Conversions.txt |
| 30 | +``` |
| 31 | + |
| 32 | +### Output |
| 33 | +Running this script will generate a gzipped TSV named in the format NAME_METRIC_TODAY's DATE.tsv.gz. |
| 34 | +For example, `MyCompendium_hugo_log2tpm_2025-03-07.tsv.gz `. |
| 35 | +It will also generate an intermediate json file samples_vs_rgr_files_NAME_DATE.json. This can be deleted. |
| 36 | + |
| 37 | +The first column of the TSV is named `Gene` and consists of the gene names (either Hugo or Ensembl) in alphabetical order. |
| 38 | +Each subsequent column is named after the corresponding sample and consists of that sample's expression for the gene name. |
| 39 | + |
| 40 | +### Mapping note |
| 41 | +Input `rsem_genes.results` files use Ensembl gene IDs. For output files using Hugo gene names, the IDs are converted using the `EnsGeneID_Hugo_Observed_Conversions.txt`. In some cases, multiple Ensembl IDs map to the same Hugo name. To map these rows within a sample, |
| 42 | +the values of app matching Ensembl IDs are summed together to produce the value associated with the Hugo name. |
| 43 | +In addition, the Ensembl IDs which map to `NA` are dropped from the Hugo output file. |
| 44 | + |
| 45 | + |
0 commit comments