exceRpt Small RNA-seq Data Analysis Pipeline - ¶
v4.6.2 (Newest Version, Version associated with exRNA Atlas)¶
- Various bug fixes for exceRpt and makefile.
v4.4.1 through v4.6.1¶
- Various minor fixes and updated QC plots.
- Now sorts and resolves RDP alignments against the NCBI taxonomy.
- rRNA taxa counts now included in the CORE RESULTS.
- Added column headers to the *result.taxaAnnotated.txt files.
- New method that builds taxonomy trees in a more robust and much faster manner.
- Fixed incorrect counting of reads input to exogenous miRNA
- Fixed incorrect parsing of piRNA identifiers
v4.3.1 through v4.4.0¶
- exceRpt now outputs confidence of 3' adapter identification to the .qcResult file.
- Bug fixes - circularRNA sense and antisense counts should now be accurately reported in .stats file.
- Fixed some memory issues and improved logging for transcriptome QC.
- Updated exogenous taxonomy plots (fixed node labels).
- Empty post-processing files are no longer generated.
- Calibrator counts file is generated by post-processing script (contains calibrator counts for each sample in submission).
- Various bug fixes and efficiency improvements.
- Enabled variable minimum adapter sequence length for 3' adapter clipping.
v4.2.0 through v4.3.0¶
- Parameter tweak to make exceRpt more likely to identify adapters in short (< 50nt) reads.
- Added metazoa to the exogenous genomes used by exceRpt.
- Streamlined the collection of exogenous alignments.
- Post-processing script now reads, combines, saves, and plots the QC metrics in the .qcResult files.
- Substantial improvements in the speed of exogenous taxonomy tree generation. Algorithm redesigned to traverse
the taxonomy from bottom to top (instead of top to bottom) to find the optimal alignment. At each iteration, the leaf node
at which to start is selected as that which has the most reads aligned to it.
- Added exogenous taxonomy plots to post-processing script (if full exogenous mapping is selected) - uses NCBI taxonomy data.
- Post-processing script now writes to a file the adapter sequences used for each sample (handy for QC).
- Separated out the reading, normalizing, and saving of data from plotting for future improvements to sample groups.
- Added internal tool to remove duplicate FASTA entries (by header ID or sequence) to tidy up the piRNA references.
- Post-endogenous alignments now very strictly require end-to-end, 0/1 mismatch, of at least 18nt reads.
v4.1.0 through v4.1.9¶
- Minor update in adapter reporting.
- Fixed an issue with calculating aggregate mapping qualities over the read length.
- Improved axis labeling in post-processing script output when there are a large number of samples.
- Fixed N mismatches in calibrator oligo alignment.
- Added option to trim N bases from the 5' end of all reads after adapter removal.
- Added 'help' target to makefile to print of options to the command line.
- Added option to downsample transcriptome alignments.
- Finished migrating UniVec + rRNA alignments to STAR (uses endogenous genome parameters).
v4.0.0 through v.4.0.9¶
- Added support for CIGAR strings as an alignment QC option.
- Pipeline now computes read coverage (and entropy) over gencode transcripts.
- Started migrating UniVec + rRNA alignments to STAR.
- Added code to parse transcriptome alignments and calculate coverage over all gencode transcripts.
- Improved adapter identification code to more reliably distinguish similar adapter sequences (e.g., Illumina_1.5_smallRNA_3p and Illumina_1.0_smallRNA_3p).
- More updates to exogenous alignments.
- STAR aligner is now used for endogenous genome, transcriptome, repetitive elements, gapped genome, and miRBase alignment.
- Unified transcriptome alignments are no longer (potentially very large) SAM file to be merged and sorted by readID.
- Number of genome mapped reads output to the .stats file now accounts for both genome AND transcriptome mapped reads.
v3.4.1 (not installed on Genboree)¶
- Updated endogenous alignment processing to be more memory efficient by:
- Splitting the tasks of choosing the best alignment and quantifying alignments into two separate tasks.
- Updating the exogenous alignment taxonomy analysis to work in batches of reads when there are many alignments.
v3.4.0 (not installed on Genboree)¶
- Added code to better quantify exogenous read counts using the known taxonomy from NCBI.
- Added read-length distribution as fraction of total reads per sample to the post-processing output.
v3.3.0 (Version associated with exRNA Atlas v3 Snapshot)¶
- Now writes a new plain-text file ([sampleID].qcResult) containing quality control (QC) metrics used by the exRNA Communication Consortium.
- Evaluates each sample in terms of PASS/FAIL given the following criteria:
- Minimum # reads mapped (sense OR antisense) to the annotated transcriptome > 100,000
- Minimum percentage of genome-mapped reads that must map (sense OR antisense) to the annotated transcriptome > 50%
- The first line in the .qcResult file is the PASS/FAIL result with the following lines containing information from this sample used to make this decision.
- Improvements to the automatic adapter identification algorithm and added support for the IonTorrent (NEXTflex smallRNA) 3' adapter. Existing support for Illumina and SOLiD adapters is unchanged.
- In samples prepped with random barcodes, reads for which no 3' adapter can be detected/removed are now suppressed from downstream alignment as the 3' random barcode is not guaranteed to be correct for these reads.
- Bowtie (1&2) alignments now respect the phred-encoding of the input fastq.
- Changed options for maximum number of mismatches. Previously, users could select the maximum number of mismatches when mapping to miRNAs, as well as the maximum number of mismatches when mapping to other libraries. Now, users can select the maximum number of mismatches allowed during endogenous alignment (0-3) and exogenous alignment (0-1).
- Bowtie seed length is now alterable.
- Support for endogenous library mapping prioritization. For example, previously, mapping was always done in the same order: miRNA > tRNA > piRNA > Gencode > circRNA. Now, you can change the priority of these libraries, or even remove libraries if you don't want to map to them.
- Previously, the exceRpt small RNA-seq Pipeline used sRNAbench to map reads to the host genome and various small RNA libraries. This new, updated version of exceRpt has its own endogenous alignment and quantification engine which has the following benefits:
- Much more reliable quantification of non-miRNA libraries
- Full use of read qualities during alignment
- Can prioritize alignments to different classes of RNA
- Output genome alignments in BAM/WIG for viewing in a browser
- Much better control over memory usage
- Fully modular species databases
- Faster for most samples
- In addition, this version of exceRpt adds support for *N random barcodes on the inner edges (3', 5', or both) of adapter sequences. These random barcodes help normalize the read-counts for amplification artifacts and serve as an alternative to the read-count for smallRNA quantitation (the final column in the "readCounts_*.txt" files supplied in your pipeline results).
2nd Generation (Discontinued)¶
- We moved the alignment against endogenous repetitive elements (RE) to occur after the main smallRNA alignments
performed by sRNABench. This is because we noticed that the RE library was able to ‘compete’ for reads that would be
better annotated/interpreted as coming from tRNAs, piRNAs, or other transcripts. This competitive alignment did not ever
affect miRNAs as these are always aligned to before other annotated RNAs, but we expect that this update will faithfully
capture reads aligning to repetitive small-RNAs, especially tRNAs, piRNAs, and snoRNAs.
- exceRpt still aligns to REs as a final step before aligning to exogenous sequences as this is critical to remove highly
repetitive endogenous sequences that might otherwise be confused as exogenous sequences.
- Alignment to all known exogenous genomes - The pipeline uses STAR alignment tool for mapping reads to all genomes from NCBI and Ensembl.
- Tool UI Settings dialog now allows the user to select mapping reads to all exogenous genomes and miRNAs or miRNAs from miRbase
or just perform endogenous alignments.
- The tool gets a new name exceRpt - abbr. for extracellular RNA processing tool
- The smallRNA pipeline now prepares a new results .zip archive with all .grouped files and uploads this to the user's db.
These files are unpacked in the user's db under a properly named directory "GROUPED_FILES".
- This version has updated rRNA libraries to include mitochondrial rRNA and updated version of bowtie1 indices for human and mouse piRNAs.
- This version also has new libraries of human and mouse repetitive elements (REs).
- The tool can now process multiple FASTQ/SRA files at a time. Each file can be compressed or user has the option to upload one or
more compressed archive(s) of all (compressed) FASTQ/SRA files.
- The tool now supports new genome versions, namely hg38 and mm10.
- The pipeline also uses the latest version of miRbase (version 21) and the latest Gencode annotations for all supported genomes.
- Updated to latest version of sRNAbench.
- Contaminant removal using Univec contaminant database.
- Tool UI settings now include an option to upload custom spike-in FASTA file or use previously uploaded spike-in libraries
- Tool UI also has advanced options to set mapping parameters.
- The post-processing tool has been integrated with the latest version of the smallRNA pipeline tool, so result files of all successful
samples will be automatically used for post-processing and plots, etc. will be uploaded to the user database.
- Initial release of the tool to perform small RNA-seq data analysis of exRNA profiling datasets.
- Performs automatic detection and removal of 3' adapter sequences.
- Performs QC of sequence reads.
- Maps exRNA-seq reads to various small RNA libraries including miRNAs, piRNAs, tRNAs, rRNAs, etc
- Explicit rRNA filtering and QC.
- Output data includes abundance estimates for each of the requested libraries, a variety of quality control metrics
such as read-length distribution, summaries of reads mapped to each library, and detailed mapping information for each read mapped to each library.
Also available in: