ExRNA Data Analysis Using the exceRpt small RNA-seq Pipeline

Much of the material on this page has been taken from the exceRpt GitHub page.
Many of the files generated by exceRpt for a given sample will include that sample's name.
We use the term [sampleName] to refer to this name in a general sense.

Step 1: Look at Your .stats File

The first place to look when trying to understand your results is the .stats file.
Your .stats file can be found directly inside the directory associated with your exceRpt run on Genboree.
It can also be found in the CORE_RESULTS archive generated from your exceRpt run (further described below).
This file summarizes the number of reads that map to each class of targets in the pipeline.
In order to better understand how the pipeline maps to each class of targets (endogenous miRNAs, tRNAs, exogenous miRNAs, etc.), click here.
This link contains a useful infographic and an explanation of how the pipeline works.

Here is an example output for a .stats file:

#STATS from the exceRpt smallRNA-seq pipeline v.4.3.2 for sample sample_C5_non_pregnant5_SRR822437_fastq. Run started at 2016-07-12--12:03:16
Stage    ReadCount
input    1291553
successfully_clipped    1291498
failed_quality_filter    138130
failed_homopolymer_filter    49
calibrator    NA
UniVec_contaminants    32636
rRNA    23020
reads_used_for_alignment    1097663
genome    72996
miRNA_sense    26683
miRNA_antisense    0
miRNAprecursor_sense    386
miRNAprecursor_antisense    1
tRNA_sense    1685
tRNA_antisense    0
piRNA_sense    0
piRNA_antisense    0
gencode_sense    7189
gencode_antisense    3555
circularRNA_sense    0
circularRNA_antisense    0
not_mapped_to_genome_or_libs    1024667
#END OF STATS from the exceRpt smallRNA-seq pipeline. Run completed at 2016-07-12--15:15:46

The .stats file above was generated by an exceRpt run with exogenous mapping disabled (endogenous-only).
Your .stats file will have more information if you choose a different exogenous mapping setting.
If you choose the "endogenous + exogenous (miRNA)" setting, your .stats file will have the following extra lines:

repetitiveElements    2080
endogenous_gapped    11627
input_to_exogenous_miRNA    1010951
exogenous_miRNA    9
input_to_exogenous_rRNA    1010942
exogenous_rRNA    852

Finally, if you choose the most extensive exogenous option, "endogenous + exogenous (miRNA + Genome)", your .stats file will also have these lines:

input_to_exogenous_genomes    1010090
exogenous_genomes 2807

You can learn more about the different exogenous settings by viewing the tutorial on exceRpt's settings.

Step 2: Look at the Contents of your CORE_RESULTS Archive

The next place to look for your analysis is the CORE_RESULTS archive uploaded with every successful exceRpt run.
This archive will be sufficient for most analyses.
We decompress the archive for you in the Genboree Workbench - you can find its contents in the CORE_RESULTS sub-folder associated with a particular run.

When you decompress your CORE_RESULTS archive (or look at the contents on Genboree), you will immediately see the following files in the base directory:

File Name Description of File
[sampleName].log Text file containing logging information for this run
[sampleName].qcResult Text file containing a variety of QC metrics for this sample
[sampleName].stats Text file containing a variety of alignment statistics for this sample

In general, you shouldn't need to look at the .log file. It contains a detailed log of the different steps performed during the course of the pipeline.
We are happy to look at the .log file to help you if something goes wrong with exceRpt or if you have a question for us.
The .qcResult file will contain data for the QC metrics discussed here.
In addition, the transcriptome complexity provided in the .qcResult file is calculated by dividing the total number of unique sequence alignments by the total number of sequence alignments when aligning to the transcriptome.
The alignments used in this calculation are taken from the endogenousAlignments_Accepted.txt.gz file (described in more detail below and only available in the full results archive).

The .stats file was discussed above.

There will be a folder in your CORE_RESULTS archive that matches the name of your sample. That folder will contain the following files:

File Name Description of File
readCounts_*_sense.txt Read counts of each annotated RNA using sense alignments
readCounts_*_antisense.txt Read counts of each annotated RNA using antisense alignments
*.coverage.txt Contains read-depth across all gencode transcripts
*.CIGARstats.txt Summary of the alignment characteristics for genome-mapped reads
[sampleName].*_fastqc.zip FastQC output both before and after UniVec/rRNA contaminant removal
[sampleName].*.readLengths.txt Counts of the number of reads of each length following adapter removal
[sampleName].*.counts Read counts mapped to UniVec & rRNA (and calibrator oligo, if used) sequences
[sampleName].*.knownAdapterSeq 3' adapter sequence guessed (from known adapters) in a given sample
[sampleName].*.adapterSeq 3' adapter used to clip the reads in a given run
[sampleName].*.qualityEncoding PHRED encoding guessed for the input sequence reads

If you chose the "endogenous + exogenous (miRNA)" setting (mapping to exogenous miRNA / rRNA),
there will be an additional subfolder named EXOGENOUS_miRNA which will include some additional readCounts files
for exogenous miRNA. There are no readCounts files for exogenous rRNA.

Finally, if you chose the "endogenous + exogenous (miRNA + Genome)" setting (mapping to all of the above as well as exogenous genomes),
there will be an additional subfolder named EXOGENOUS_genomes which will include a taxonomy tree file
named ExogenousGenomicAlignments.result.taxaAnnotated.txt.
This text file will provide taxonomy information about the different taxons found in your sample.

When looking at the files above, you'll probably be most interested in the readCounts files.
An example of how these files are formatted can be seen below:

ReferenceID uniqueReadCount totalReadCount multimapAdjustedReadCount multimapAdjustedBarcodeCount
hsa-miR-143-3p:MIMAT0000435:Homo:sapiens:miR-143-3p 1235 4147219 4147219.0 0.0
hsa-miR-10b-5p:MIMAT0000254:Homo:sapiens:miR-10b-5p 1430 2420500 2420241.0 0.0
hsa-miR-10a-5p:MIMAT0000253:Homo:sapiens:miR-10a-5p 1115 784863 784600.5 0.0
hsa-miR-192-5p:MIMAT0000222:Homo:sapiens:miR-192-5p 759 559068 558542.5 0.0

Below, you can see a description of each column:

  • ReferenceID is the ID of each annotated RNA.
  • uniqueReadCount is the number of unique insert sequences attributed to each annotated RNA.
  • totalReadCount is the total number of reads attributable to each annotated RNA.
  • multimapAdjustedReadCount is the count after adjusting for multi-mapped reads.
  • multimapAdjustedBarcodeCount (available only for samples prepped with randomly barcoded 5' and/or 3' adapters such as Bioo) is the number of unique N-mer barcodes
    adjusted for multimapping ambiguity in the insert sequence.

If your exceRpt run didn't map to a given library, there will be no corresponding readCounts file in your CORE_RESULTS archive.
For example, if you didn't have any tRNA sense reads, there will be no [sampleName].readCounts_tRNA_sense.txt file.

Step 3 (Optional): Look at the Contents of Your Full Results Archive

If the files given above are not sufficient, you can select the "Upload Full Results" option when launching your exceRpt job.
This will make your exceRpt job upload an archive containing all files created by exceRpt during the processing of your sample(s).
This means that your full results archive will contain all of the files located in your CORE_RESULTS archive.
Because some of the files inside this archive can be large, it is not recommended to choose this option unless you absolutely need these files.

When you open this archive, you will see a folder with your sample name (just like the CORE_RESULTS archive).
Inside that folder, you will see the following types of files:

Intermediate files containing reads 'surviving' each stage

In order of the exceRpt workflow, these files include the reads remaining after:
  1. 3' adapter clipping
  2. 5'/3' end trimming
  3. read-quality and homopolymer filtering
  4. UniVec contaminant removal
  5. rRNA removal
  6. Transcriptome alignments (ungapped) of reads mapped to the genome
  7. Transcriptome alignments (ungapped) of reads not mapped to the genome
  8. Repetitive elements (only present if exogenous mapping is being done)
  9. Genome allowing gaps / novel splices (only present if exogenous mapping is being done)
  10. Exogenous miRNA (only present if exogenous mapping is being done)
  11. Exogenous rRNA (only present if exogenous mapping is being done)

The names of these files will look like the following:

File Name Description of File
[sampleName].*.fastq.gz Reads remaining after each QC / filtering / alignment step

The one exception is the read file associated with reads remaining after exogenous rRNA alignment.
This file ends in .fq.gz.

Reads aligned at each step of the pipeline

In order of the exceRpt workflow, these files include reads aligned at the following stages:
  1. UniVec
  2. rRNA
  3. endogenous genome
  4. endogenous transcriptome

The names of these files will look like the following:

File Name Description of File
filteringAlignments_*.bam Alignments to the UniVec and rRNA sequences
endogenousAlignments_genome*.bam Alignments (ungapped) to the endogenous genome
endogenousAlignments_genomeMapped_transcriptome*.bam Transcriptome alignments (ungapped) of reads mapped to the genome
endogenousAlignments_genomeUnmapped_transcriptome*.bam Transcriptome alignments (ungapped) of reads not mapped to the genome

Alignment summary information obtained after invoking the library priority

By default, the library priority will choose a miRBase alignment over any other alignment.
For example, if a read is aligned to both a miRNA in miRBase and a miRNA in Gencode, the miRBase alignment is kept and all others discarded.
It is especially important for tRNAs to be chosen in favour of piRNAs, as the latter have quite a large number of misannotations compared to the former.

The names of these files will look like the following:

File Name Description of File
endogenousAlignments_Accepted.txt.gz All compatible alignments against the transcriptome after invoking the library priority
endogenousAlignments_Accepted.dict Contains the ID(s) of the RNA annotations indexed in the fifth column of the .txt.gz file above

Step 4 (Optional): Look at Post-processed Results

If you submitted dozens or even hundreds of samples for processing, you might not want to crawl through each sample's read count files.
In this situation, we recommend looking at your submission's post-processed results.
This tool combines the information from each sample into comprehensive files that cover all of the samples.
For example, if you submitted 100 samples for processing, you could look at the [analysisName]_miRNA_ReadCounts.txt file
in your post-processed results to see miRNA read counts for all 100 samples at the same time.

You can learn more about these results here.

Also available in: HTML TXT