Update README.md

e-t-k · web-flow · commit 5c28b5115380 · 2025-03-07T13:25:57.000-08:00
diff --git a/README.md b/README.md
@@ -1 +1,45 @@
-# compendium-expression-matrix
+# compendium-expression-matrix
+Build a matrix of samples vs genes out of individual `rsem_genes.results` files generated by RSEM.
+
+```
+usage: build_compendium_matrix.py [-h] --name COMPENDIUM_NAME --input INPUTFILE (--tpm | --expected-count) [--output OUTPUTDIR]
+                                  [--mapfile ENSEMBL_HUGO_MAPPING_FILE]
+```
+
+### Usage:
+- Gather a list of paths to the `rsem_genes.results` files for your samples.
+- Format in a headerless tab-separated file (eg `samples.txt`) with the sample name as the first column and the path as the second. For example:
+
+| | |
+| - | - |
+| MySample1 | /home/treehouse/samples/MySample1/expression/RSEM/rsem_genes.results |
+| MySample2 | /home/treehouse/samples/MySample2/expression/RSEM/rsem_genes.results |
+| Downloaded_XYZ | /shared/downloads/expression/RSEM_XYZ/rsem_genes.results |
+
+- Choose your output format: **Hugo gene names and the log2(TPM+1) metric** (`--tpm`)  or **Ensembl gene IDs and the Expected Count metric.** (`--expected-count`).
+- Optionally, specify the output path with `--output`.
+- If the Ensembl-to-Hugo mapping file (`EnsGeneID_Hugo_Observed_Conversions.txt`, provided) is not in the current directory, specify its path with `--mapfile`.
+
+Example:
+```git clone https://github.com/UCSC-Treehouse/compendium-expression-matrix.git
+./compendium-expression-matrix/build_compendium_matrix.py \
+--name MyCompendium \
+--input samples.txt \
+--tpm \
+--mapfile compendium-expression-matrix/EnsGeneID_Hugo_Observed_Conversions.txt
+```
+
+### Output
+Running this script will generate a gzipped TSV named in the format NAME_METRIC_TODAY's DATE.tsv.gz.
+For example, `MyCompendium_hugo_log2tpm_2025-03-07.tsv.gz `.
+It will also generate an intermediate json file samples_vs_rgr_files_NAME_DATE.json. This can be deleted.
+
+The first  column of the TSV is named `Gene` and consists of the gene names (either Hugo or Ensembl) in alphabetical order.
+Each subsequent column is named after the corresponding sample and consists of that sample's expression for the gene name.
+
+### Mapping note
+Input `rsem_genes.results` files use Ensembl gene IDs. For output files using Hugo gene names, the IDs are converted using the `EnsGeneID_Hugo_Observed_Conversions.txt`. In some cases, multiple Ensembl IDs map to the same Hugo name. To map these rows within a sample,
+the values of app matching Ensembl IDs are summed together to produce the value associated with the Hugo name.
+In addition, the Ensembl IDs which map to `NA` are dropped from the Hugo output file.
+
+