Overview

Long RNA-Seq Data Analysis Using RSEQtools in the Genboree Workbench

View Screencast (no audio)

Preliminary Steps to Set Up Any Analysis in the Genboree Workbench

  • Create a Group for your analysis - FAQ. This step is optional. You can also use your default/existing group.
    What is a Group? A "Group" contains Databases and Projects and controls access to all content within.
    You control access to your Group(s), and who is a member of your group. You can also belong to multiple Groups (i.e. collaborators).
  • Create a Database for your analysis - FAQ. This step is optional. You can also use your default/existing database.
    What is a Database? A Database contains Tracks, Lists, Sample Sets, Samples, and Files.
    Each database can be associated with a reference genome.
  • Create a Redmine Project for your analysis - FAQ. This step is REQUIRED.
    What is a Redmine Project? The Redmine Project holds files (HTML, plots, etc) that contain analysis results from your tool.
  • Upload your data file(s) - FAQ
    What type of files can be uploaded? The long RNA-Seq pipeline using RSEQtools accepts a single-end or paired-end FASTQ files as input.
    The input files can be compressed.

Step-by-step Instructions to Set Up Long RNA-Seq Data Analysis

  1. Drag Single or paired-end FASTQ files to Input Data panel. The input files can be compressed.
  2. Drag a Database and a Project to Output Targets panel to store results.
  3. Select Transcriptome » Analyze RNA-Seq Data » Analyze RNA-Seq data by RSEQtools from the Toolset menu.
  4. Fill in appropriate details in the Tool Settings dialog box
  5. Submit your job. Upon completion of your job, you will receive an email.
  6. Download the results of your analysis from your Database. The results data will end up under the RSEQtools folder in the Files area of your output database.
    Within that folder, your Analysis Name will be used as a sub-folder to hold the files generated by that run of the tool.
    • Click on your results file(s) in the Data Selector panel.
    • Select the link Click to Download File from the Details panel to download your results file(s).
  7. View plots from the Projects page.
    • Click on your project name in the Data Selector panel.
    • Click on Link to Project in the Details panel to view your Projects page.
  8. If you would like to visualize your signal tracks in the UCSC Genome Browser, follow these steps:
    • Drag your Database to Output Targets panel.
    • Select Data » Databases » Unlock/Lock Database from the Toolset menu.
    • Click Submit in the Setting Dialog box to unlock your database.
    • Clean Output Targets panel.
    • Drag your Database to Input Data panel.
    • Select Visualization » UCSC Genome Browser from the Toolset menu.
    • Select the signal tracks with bigwig files (already made by the pipeline).
    • Click Submit in the Setting Dialog box to create the link to visualize the selected tracks in the UCSC Genome Browser.
    • Click Launch UCSC Genome Browser link in the dialog box.

Example Data for Running RSEQtools

A sample from a deep-sequencing study to analyze the transcriptome changes that occur during the
differentiation of human embryonic stem cells into the neural lineage has been used in this example.

The sample consists of 27 nucleotide single-end reads, that are aligned to human reference genome build hg18
and to a splice junction library generated from the UCSC Known Genes annotation set using Bowtie2.
The mapped reads are then analyzed using various modules in RSEQtools.

Sample datasets with input and output files can be found here:

  • Under the group Examples and Test Data, select the database RSEQtools hg18 - Example Data
  • Input FASTQ file can be found under: Files » sample.fastq.gz
  • Outputs of RSEQtools pipeline can be found under Files » RSEQtools folder of this database
  • QC Plots from FastQC can be found in the Projects page
  • Custom Bowtie2 indexes can be found under Files » indexFiles » bowtie » [Your custom index folder]
  • Signal tracks are uploaded under the Tracks section of this database.

RSEQTools Pipeline - Workflow Implemented in the Genboree Workbench

  1. Input Sequence import: User uploads single or paired-end FASTQ input sequence files to their database in the workbench
  2. QC FastQ reads: Input FastQ sequence reads are checked for quality using FastQC
  3. Map reads to reference genome: Sequence reads are mapped to reference genome using Bowtie 2
  4. Sort alignments: Alignments in SAM format are sorted using Samtools
  5. Convert to Mapped Read Format (MRF): Sorted Alignments in SAM format are converted to MRF using RSEQtools
  6. Downstream analysis using modules in RSEQtools
    • Gene expression values: Calculate gene expression values using module mrfQuantifier
    • Annotation Coverage: Calculate annotation coverage value using module mrfAnnotationCoverage
    • Mapping Bias: Calculate mapping bias for a given annotation set using module mrfMappingBias
    • Signal Tracks: Generate signal tracks in WIG format using module mrf2wig

RSEQtools Modules used in Genboree Implementation

mrfQuantifier

This module calculates expression values (RPKM; read coverage normalized per million mapped nucleotides
and the length of the annotation model per kb). Given a set of mapped reads in MRF and an annotation set
(representing exons, transcripts, or gene models) mrfQuantifier calculates an expression value for each annotation entry.
This is done by counting all the nucleotides from the reads that overlap with a given annotation entry.
Subsequently, this value is normalized per million mapped nucleotides and the length of the annotation item per kb.

mrfMappingBias

Module to calculate mapping bias for a given annotation set. Aggregates mapped reads that overlap with
transcripts (specified in file.annotation) and outputs the counts over a standardized transcript
(divided into 100 equally sized bins) where 0 represents the 5' end of the transcript and
1 denotes the 3' end of the transcripts. This analysis is done in a strand specific way.

mrfAnnotationCoverage

Module to calculate annotation coverage. Sample a set of mapped reads and determine the
fraction of transcripts (specified in annotation file) that have at least -times uniform coverage.

mrf2wig

Generates signal track (WIG) of mapped reads from a MRF file. By default, the values in the
WIG file are normalized by the total number of mapped reads per million.
Only positions with non-zero values are reported.

References and Attributions

  • Lukas Habegger, Andrea Sboner, Tara A. Gianoulis, Joel Rozowsky, Ashish Agarwal, Michael Snyder, Mark Gerstein.
    RSEQtools: A modular framework to analyze RNA-Seq data using compact, anonymized data summaries.
    Bioinformatics. 2010 Dec 5; 27(2) : 281-283 [PubMed]
  • Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nature Methods.
    2012 Mar 4; 9 : 357-359. [PubMed]
  • Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and
    1000 Genome Project Data Processing Subgroup. The Sequence alignment/map (SAM) format and SAMtools.
    Bioinformatics. 2009 25: 2078-9. [Pubmed]
  • RSEQtools was developed by the Gerstein Lab
    at Yale University
  • Integrated into the Genboree Workbench
    by Sai Lakshmi Subramanian
    at the Bioinformatics Research Laboratory, Baylor College of Medicine, Houston, TX.

Also available in: HTML TXT