Overview
- ExRNA Data Analysis Using the exceRpt small RNA-seq Pipeline
- Step 1: Look at Your .stats File
- Step 2: Look at the Contents of your CORE_RESULTS Archive
- Step 3 (Optional): Look at the Contents of Your Full Results Archive
- Step 4 (Optional): Look at Post-processed Results
ExRNA Data Analysis Using the exceRpt small RNA-seq Pipeline¶
Much of the material on this page has been taken from the exceRpt GitHub page.
Many of the files generated by exceRpt for a given sample will include that sample's name.
We use the term [sampleName] to refer to this name in a general sense.
Step 1: Look at Your .stats File¶
The first place to look when trying to understand your results is the .stats file.
Your .stats file can be found directly inside the directory associated with your exceRpt run on Genboree.
It can also be found in the CORE_RESULTS archive generated from your exceRpt run (further described below).
This file summarizes the number of reads that map to each class of targets in the pipeline.
In order to better understand how the pipeline maps to each class of targets (endogenous miRNAs, tRNAs, exogenous miRNAs, etc.), click here.
This link contains a useful infographic and an explanation of how the pipeline works.
Here is an example output for a .stats file:
#STATS from the exceRpt smallRNA-seq pipeline v.4.3.2 for sample sample_C5_non_pregnant5_SRR822437_fastq. Run started at 2016-07-12--12:03:16 Stage ReadCount input 1291553 successfully_clipped 1291498 failed_quality_filter 138130 failed_homopolymer_filter 49 calibrator NA UniVec_contaminants 32636 rRNA 23020 reads_used_for_alignment 1097663 genome 72996 miRNA_sense 26683 miRNA_antisense 0 miRNAprecursor_sense 386 miRNAprecursor_antisense 1 tRNA_sense 1685 tRNA_antisense 0 piRNA_sense 0 piRNA_antisense 0 gencode_sense 7189 gencode_antisense 3555 circularRNA_sense 0 circularRNA_antisense 0 not_mapped_to_genome_or_libs 1024667 #END OF STATS from the exceRpt smallRNA-seq pipeline. Run completed at 2016-07-12--15:15:46
The .stats file above was generated by an exceRpt run with exogenous mapping disabled (endogenous-only).
Your .stats file will have more information if you choose a different exogenous mapping setting.
If you choose the "endogenous + exogenous (miRNA)" setting, your .stats file will have the following extra lines:
repetitiveElements 2080 endogenous_gapped 11627 input_to_exogenous_miRNA 1010951 exogenous_miRNA 9 input_to_exogenous_rRNA 1010942 exogenous_rRNA 852
Finally, if you choose the most extensive exogenous option, "endogenous + exogenous (miRNA + Genome)", your .stats file will also have these lines:
input_to_exogenous_genomes 1010090 exogenous_genomes 2807
You can learn more about the different exogenous settings by viewing the tutorial on exceRpt's settings.
Step 2: Look at the Contents of your CORE_RESULTS Archive¶
The next place to look for your analysis is the CORE_RESULTS archive uploaded with every successful exceRpt run.
This archive will be sufficient for most analyses.
We decompress the archive for you in the Genboree Workbench - you can find its contents in the CORE_RESULTS
sub-folder associated with a particular run.
When you decompress your CORE_RESULTS archive (or look at the contents on Genboree), you will immediately see the following files in the base directory:
File Name | Description of File |
[sampleName].log | Text file containing logging information for this run |
[sampleName].qcResult | Text file containing a variety of QC metrics for this sample |
[sampleName].stats | Text file containing a variety of alignment statistics for this sample |
In general, you shouldn't need to look at the .log file. It contains a detailed log of the different steps performed during the course of the pipeline.
We are happy to look at the .log file to help you if something goes wrong with exceRpt or if you have a question for us.
The .qcResult file will contain data for the QC metrics discussed here.
In addition, the transcriptome complexity provided in the .qcResult file is calculated by dividing the total number of unique sequence alignments by the total number of sequence alignments when aligning to the transcriptome.
The alignments used in this calculation are taken from the endogenousAlignments_Accepted.txt.gz file (described in more detail below and only available in the full results archive).
The .stats file was discussed above.
There will be a folder in your CORE_RESULTS archive that matches the name of your sample. That folder will contain the following files:
File Name | Description of File |
readCounts_*_sense.txt | Read counts of each annotated RNA using sense alignments |
readCounts_*_antisense.txt | Read counts of each annotated RNA using antisense alignments |
*.coverage.txt | Contains read-depth across all gencode transcripts |
*.CIGARstats.txt | Summary of the alignment characteristics for genome-mapped reads |
[sampleName].*_fastqc.zip | FastQC output both before and after UniVec/rRNA contaminant removal |
[sampleName].*.readLengths.txt | Counts of the number of reads of each length following adapter removal |
[sampleName].*.counts | Read counts mapped to UniVec & rRNA (and calibrator oligo, if used) sequences |
[sampleName].*.knownAdapterSeq | 3' adapter sequence guessed (from known adapters) in a given sample |
[sampleName].*.adapterSeq | 3' adapter used to clip the reads in a given run |
[sampleName].*.qualityEncoding | PHRED encoding guessed for the input sequence reads |
If you chose the "endogenous + exogenous (miRNA)" setting (mapping to exogenous miRNA / rRNA),
there will be an additional subfolder named EXOGENOUS_miRNA
which will include some additional readCounts files
for exogenous miRNA. There are no readCounts files for exogenous rRNA.
Finally, if you chose the "endogenous + exogenous (miRNA + Genome)" setting (mapping to all of the above as well as exogenous genomes),
there will be an additional subfolder named EXOGENOUS_genomes
which will include a taxonomy tree file
named ExogenousGenomicAlignments.result.taxaAnnotated.txt
.
This text file will provide taxonomy information about the different taxons found in your sample.
When looking at the files above, you'll probably be most interested in the readCounts files.
An example of how these files are formatted can be seen below:
ReferenceID | uniqueReadCount | totalReadCount | multimapAdjustedReadCount | multimapAdjustedBarcodeCount |
hsa-miR-143-3p:MIMAT0000435:Homo:sapiens:miR-143-3p | 1235 | 4147219 | 4147219.0 | 0.0 |
hsa-miR-10b-5p:MIMAT0000254:Homo:sapiens:miR-10b-5p | 1430 | 2420500 | 2420241.0 | 0.0 |
hsa-miR-10a-5p:MIMAT0000253:Homo:sapiens:miR-10a-5p | 1115 | 784863 | 784600.5 | 0.0 |
hsa-miR-192-5p:MIMAT0000222:Homo:sapiens:miR-192-5p | 759 | 559068 | 558542.5 | 0.0 |
Below, you can see a description of each column:
- ReferenceID is the ID of each annotated RNA.
- uniqueReadCount is the number of unique insert sequences attributed to each annotated RNA.
- totalReadCount is the total number of reads attributable to each annotated RNA.
- multimapAdjustedReadCount is the count after adjusting for multi-mapped reads.
- multimapAdjustedBarcodeCount (available only for samples prepped with randomly barcoded 5' and/or 3' adapters such as Bioo) is the number of unique N-mer barcodes
adjusted for multimapping ambiguity in the insert sequence.
If your exceRpt run didn't map to a given library, there will be no corresponding readCounts file in your CORE_RESULTS archive.
For example, if you didn't have any tRNA sense reads, there will be no [sampleName].readCounts_tRNA_sense.txt file.
Step 3 (Optional): Look at the Contents of Your Full Results Archive¶
If the files given above are not sufficient, you can select the "Upload Full Results" option when launching your exceRpt job.
This will make your exceRpt job upload an archive containing all files created by exceRpt during the processing of your sample(s).
This means that your full results archive will contain all of the files located in your CORE_RESULTS archive.
Because some of the files inside this archive can be large, it is not recommended to choose this option unless you absolutely need these files.
When you open this archive, you will see a folder with your sample name (just like the CORE_RESULTS archive).
Inside that folder, you will see the following types of files:
Intermediate files containing reads 'surviving' each stage
In order of the exceRpt workflow, these files include the reads remaining after:- 3' adapter clipping
- 5'/3' end trimming
- read-quality and homopolymer filtering
- UniVec contaminant removal
- rRNA removal
- Transcriptome alignments (ungapped) of reads mapped to the genome
- Transcriptome alignments (ungapped) of reads not mapped to the genome
- Repetitive elements (only present if exogenous mapping is being done)
- Genome allowing gaps / novel splices (only present if exogenous mapping is being done)
- Exogenous miRNA (only present if exogenous mapping is being done)
- Exogenous rRNA (only present if exogenous mapping is being done)
The names of these files will look like the following:
File Name | Description of File |
[sampleName].*.fastq.gz | Reads remaining after each QC / filtering / alignment step |
The one exception is the read file associated with reads remaining after exogenous rRNA alignment.
This file ends in .fq.gz.
Reads aligned at each step of the pipeline
In order of the exceRpt workflow, these files include reads aligned at the following stages:- UniVec
- rRNA
- endogenous genome
- endogenous transcriptome
The names of these files will look like the following:
File Name | Description of File |
filteringAlignments_*.bam | Alignments to the UniVec and rRNA sequences |
endogenousAlignments_genome*.bam | Alignments (ungapped) to the endogenous genome |
endogenousAlignments_genomeMapped_transcriptome*.bam | Transcriptome alignments (ungapped) of reads mapped to the genome |
endogenousAlignments_genomeUnmapped_transcriptome*.bam | Transcriptome alignments (ungapped) of reads not mapped to the genome |
Alignment summary information obtained after invoking the library priority
By default, the library priority will choose a miRBase alignment over any other alignment.
For example, if a read is aligned to both a miRNA in miRBase and a miRNA in Gencode, the miRBase alignment is kept and all others discarded.
It is especially important for tRNAs to be chosen in favour of piRNAs, as the latter have quite a large number of misannotations compared to the former.
The names of these files will look like the following:
File Name | Description of File |
endogenousAlignments_Accepted.txt.gz | All compatible alignments against the transcriptome after invoking the library priority |
endogenousAlignments_Accepted.dict | Contains the ID(s) of the RNA annotations indexed in the fifth column of the .txt.gz file above |
Step 4 (Optional): Look at Post-processed Results¶
If you submitted dozens or even hundreds of samples for processing, you might not want to crawl through each sample's read count files.
In this situation, we recommend looking at your submission's post-processed results.
This tool combines the information from each sample into comprehensive files that cover all of the samples.
For example, if you submitted 100 samples for processing, you could look at the [analysisName]_miRNA_ReadCounts.txt file
in your post-processed results to see miRNA read counts for all 100 samples at the same time.
You can learn more about these results here.