Index by title

DMRR Demo at the ERCC 4th Investigators' Meeting and ISEV Annual Meeting, April 2015

exceRpt small RNA data analysis pipeline Demo

Date: April 20th & 23rd, 2015.

Presenters

Sai Lakshmi Subramanian (Primary Contact)
William Thistlethwaite
Bioinformatics Research Laboratory, Baylor College of Medicine, Houston, TX
Data Coordination Component (DCC) of exRNA Communication Consortium

Robert Kitchen
Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT
Data Integration and Analysis Component (DiAC) of exRNA Communication Consortium


DMRR Workshop at the ERCC 6th Investigators' Meeting , April 2016

DMRR Data Analysis and Bioinformatics Workshop

Date: April 17th, 2016 - Sunday
Location: Bethesda, MD

Presenters

Sai Lakshmi Subramanian
William Thistlethwaite
Bioinformatics Research Laboratory, Baylor College of Medicine, Houston, TX
Data Coordination Component (DCC) of exRNA Communication Consortium


Overview

Computational Deconvolution Analysis for exRNA Data

Introduction

Tutorial, Part 1: Preliminary Steps

1) Create a Genboree Account

To use our computational deconvolution tool, you will first need a Genboree account.
You can create your Genboree account by visiting the New User Registration page.


2) Log into the Genboree Workbench

After creating your Genboree account, you will need to log into the Genboree Workbench to use the tool.
You can find the Workbench by going to the Genboree homepage and clicking the Workbench button at the top of the page.
Alternatively, you can bookmark the Genboree Workbench directly.


3) Understanding Groups

After logging into your Workbench account, you'll see a screen that looks like this:

On the left side of the screen, in the Data Selector panel, you'll see the different Genboree groups in which you're a member.
Some examples in the screenshot include: Groups are the top-level folder used for organization in the Workbench.

4) Creating a Database

If you look inside your group, you will see that it's currently empty:

Our next step is to create a database to store our data.

After clicking "Refresh" at the top of the Data Selector panel (see below), you should now be able to explore your database:

We're only interested in the Files area for this tool - you can ignore Tracks, Lists & Selections, Sample Sets, and Samples.


Tutorial, Part 2: Processing Raw Sequencing Data

1) Finding Tutorial Sequencing Data

Now that you've set up your Genboree Workbench account, the next step is to process your small RNA-seq data files through exceRpt, our small RNA-seq data processing pipeline.
After completing the tutorial, you'll want to upload your own data files to process and analyze.
For now, though, we've already uploaded a set of data files for you in the Examples and Test Data group:

The deconvolution_test_data.zip archive contains 40 FASTQ files (20 plasma and 20 urine, all healthy subjects) submitted by Alessio Naccarati's group to the exRNA Atlas.
We will drag this archive to the Input Data panel on the right side of the Workbench.
Then, we will drag the database that we created earlier to the Output Targets panel.
The output files from exceRpt will be uploaded to this database.

2) Submitting Sequencing Data for Processing

Next, we'll select exceRpt from the tool menu at the top of the Workbench:

exceRpt has many different options (see our tutorial for more information!), but we're only going to change three of them for this submission.
First, we will update the Analysis Name so that it includes some additional information about our submission.
The analysis name will be used to organize the output files from your submission.
We recommend that you always keep a timestamp of some kind in your analysis name, as it'll help you remember when you submitted each analysis.

Second, we will update 3' Adapter Sequence from "Auto-detect 3' adapter" to "MANUALLY SPECIFY 3' ADAPTER".
Then, in the Manual Input of 3' Adapter Sequence option that appears, we will put AGATCGGAAGAGCACACGTCT.
We are providing the 3' adapter sequence manually (as opposed to having exceRpt guess the sequence) because these samples have a 3' adapter sequence which is not in exceRpt's standard adapter sequence library.

Third, we will enable the Suppress Individual Sample Emails option. Normally, you will receive one email for each sample that is processed - since we are submitting 40 samples, we don't want to receive 40 emails!
This option will suppress these individual sample emails, but you will still receive a few other emails informing you about the progress of your submission.
The option can be found under the Other Advanced Options menu at the bottom of the tool dialog - you will need to expand the menu by clicking the *+*.

After changing these three settings, we'll click Submit.
You should see a notification informing you that your samples have been submitted:

Before proceeding to part 3 of the tutorial, you'll need to wait for your samples to be processed.
Depending on how busy our cluster is, this could take several hours.
If you don't want to wait, you can access the same results via the Examples and Test Data group:


Tutorial, Part 3: Performing Deconvolution

Deconvolution requires two different input files:

We'll describe how to find and/or create both files below.


1) Finding Your exceRpt Results and Input Data File

After your samples have been processed, you'll want to find the results created by exceRpt.
You'll find those results in your database organized by the analysis name you provided:

If you're interested in learning more about your results, you can read our data analysis tutorial.
However, for the deconvolution tool, we're really only interested in one file:

This archive contains RPM-normalized read counts for all of the different ncRNA species mapped by exceRpt (miRNA / piRNA / tRNA / GENCODE annotations / circular RNA).
These read counts are the input data for the deconvolution tool.
Your file will have a slightly different name than mine because your analysis name is different.


2) Creating Your Metadata Text File

The second file required by the deconvolution tool is a text file that contains metadata describing the samples.
You can find an example of this metadata text file in the Examples and Test Data group:

Download this file by clicking on it in the Data Selector and then clicking the "Link to Download File" link in the Details panel (highlighted above).
Upon opening the file in your word processor or Microsoft Excel, you'll notice that:

When you're working with your own samples, you'll create your own metadata file describing your samples and upload it to your database.

IMPORTANTLY, the sample names provided in your metadata file must match the output generated by exceRpt.
During processing, exceRpt will transform the names of your FASTQ files by inserting "sample_" at the beginning and substituting underscores ("_") for any periods, pipes ("|"), or spaces.
To verify that you are providing the correct sample names in your metadata file, you can download the _exceRpt_miRNA_ReadsPerMillion.txt file generated by exceRpt and double-check that the sample names in your metadata file match what is provided there:

For the tutorial, you can just drag our pre-made file into the Input Data panel.
Finally, make sure that you dragged your database to the Output Targets panel.
Your Workbench should now look something like this:


3) Running the Deconvolution Tool

To run the deconvolution tool, select exRNA Computational Deconvolution from the tool menu at the top of the Workbench:

This tool is much simpler than exceRpt - just provide an updated analysis name (much like you did when launching your exceRpt analysis) and click the Submit button.
The tool will likely only take a few minutes to run. Upon completion, you will receive an email informing you that your analysis is ready.


4) Downloading Your Deconvolution Results

Your deconvolution results will be uploaded to your database organized by the analysis name you provided:

You can select any of the output files (explained in more detail below) and then click the "Click to Download File" link in the Details panel to download the output file.
In particular, we recommend downloading the _deconvolutionResults archive, as it will contain all results generated by the tool.


5) Understanding Your Deconvolution Results

Output from the tool includes:

You can ignore the jobFile.json file. This file just contains various internal settings used to process your submission.

Troubleshooting

References and Attributions

  1. Onuchic, V., Hartmaier, R.J., Boone, D.N., Samuels, M.L., Patel, R.Y., White, W.M., Garovic, V.D., Oesterreich, S., Roth, M.E., Lee, A.V., et al. (2016). Epigenomic Deconvolution of Breast Tumors Reveals Metabolic Coupling between Constituent Cell Types. Cell Reports 17, 2075–2086.
  2. Tool designed and implemented by Oscar D. Murillo at the Bioinformatics Research Lab, Baylor College of Medicine, Houston, TX.
  3. Integrated into the Genboree Workbench by William Thistlethwaite at the Bioinformatics Research Lab, Baylor College of Medicine, Houston, TX.

Overview

Fold Change Calculation Using DESeq2

Introduction

This tool will test samples for differential expression using DESeq2 (version 1.6.3).
Currently, the tool allows you to test a given factor (disease, for example) across two different factor levels (control versus Alzheimer's disease, for example).
We will continue to develop this tool and will add new features (like allowing analysis over multiple factors) in the coming months.

Preliminary Steps for Setting Up Any Analysis in the Genboree Workbench

Create a Group for Your Analysis

FAQ

What is a Group? A "Group" contains Databases and Projects and controls access to all content within.
You control access to your Group(s), and who is a member of your group. You can also belong to multiple Groups (i.e. collaborators).

This step is optional. You can also use your default/existing group.

Create a Database for Your Analysis

FAQ

What is a Database? A "Database" contains Tracks, Lists, Sample Sets, Samples, and Files.
Each database can be associated with a reference genome.

This step is required.
Make sure that you pick the proper reference sequence genome (hg19, for example) when creating your database.
Note that we do not have an entry for hg38 or mm10 in our "Reference Sequence" list.
If your data is associated with either of these reference sequence genomes, follow the directions below.

Create a Database for hg38 or mm10

In order to create a database associated with hg38 or mm10, select the "User Will Upload" option for "Reference Sequence"
and provide appropriate values for the Species and Version text boxes as given below:

Your Genome of Interest Species Version
Human genome hg38 Homo sapiens hg38
Mouse genome mm10 Mus musculus mm10

If your genome of interest is not available, please contact William Thistlethwaite for help.

Upload your Data File(s)

FAQ

What types of files can be uploaded?

The Fold Change Calculating Using DESeq2 tool accepts exactly two text files as input.
One file should contain your miRNA read counts, with rows corresponding to miRNA identifiers and columns corresponding to individual sample names.
The other file should contain your sample descriptors, with rows corresponding to individual sample names and columns corresponding to factor names ("condition", "biofluid", etc.).

Step-by-step Instructions to Set Up Job

  1. Drag exactly two text files (with the formatting described above) into the Input Data panel. You can also drag a folder or file entity list if it contains both text files.
  2. Drag a Database to the Output Targets panel to store results.
  3. Select Transcriptome » Differential Expression Analysis » Fold Change Calculating Using DESeq2 from the Toolset menu.
  4. Fill in the analysis name for your tool job. We recommend keeping a timestamp in your analysis name!
  5. Fill in the factor name and the corresponding factor levels for your analysis. For example, if I was using the tutorial files and wanted to examine the "disease" factor and compare "AD" (Alzheimer's disease) to "CONTROL" (healthy controls), I would put the following values:
  6. Select the different ERCC-related submission settings if you are a member of the ERCC. If you are not, then ignore this section.
  7. Choose to upload your results to a remote storage area if you wish to do so. More information about this option can be found here.
  8. Submit your job. Upon completion of your job, you will receive an email.
  9. Download the results of your analysis from your Database. The results data will end up under the DESeq_v1.0.0 folder in the Files area of your output database.

Example Data for Running DESeq2

In this example, we have used a set of miRNA read counts processed by exceRpt for 181 different samples (found in exceRpt_miRNA_ReadCounts.txt).
We have also used a sample descriptor document which contains information about disease and biofluid for each of the 181 samples (found in exceRpt_sample_descriptors.txt).

The sample input files and output results can be found here:

Output Files Generated by Job

After your job successfully completes, you will be able to download 2 different output files:

References and Attributions

  1. M. I. Love, W. Huber, S. Anders: Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology 2014, 15:550. http://dx.doi.org/10.1186/s13059-014-0550-8
  2. Integrated into the Genboree Workbench by William Thistlethwaite and Sai Lakshmi Subramanian at the Bioinformatics Research Laboratory, Baylor College of Medicine, Houston, TX.

This tool has been deployed in the context of the exRNA Communication Consortium (ERCC).
Please contact William Thistlethwaite with questions or comments, or for help using it on your own data.


Overview

ExRNA Data Analysis Using the exceRpt small RNA-seq Pipeline

Much of the material on this page has been taken from the exceRpt GitHub page.
Many of the files generated by exceRpt for a given sample will include that sample's name.
We use the term [sampleName] to refer to this name in a general sense.

Step 1: Look at Your .stats File

The first place to look when trying to understand your results is the .stats file.
Your .stats file can be found directly inside the directory associated with your exceRpt run on Genboree.
It can also be found in the CORE_RESULTS archive generated from your exceRpt run (further described below).
This file summarizes the number of reads that map to each class of targets in the pipeline.
In order to better understand how the pipeline maps to each class of targets (endogenous miRNAs, tRNAs, exogenous miRNAs, etc.), click here.
This link contains a useful infographic and an explanation of how the pipeline works.

Here is an example output for a .stats file:

#STATS from the exceRpt smallRNA-seq pipeline v.4.3.2 for sample sample_C5_non_pregnant5_SRR822437_fastq. Run started at 2016-07-12--12:03:16
Stage    ReadCount
input    1291553
successfully_clipped    1291498
failed_quality_filter    138130
failed_homopolymer_filter    49
calibrator    NA
UniVec_contaminants    32636
rRNA    23020
reads_used_for_alignment    1097663
genome    72996
miRNA_sense    26683
miRNA_antisense    0
miRNAprecursor_sense    386
miRNAprecursor_antisense    1
tRNA_sense    1685
tRNA_antisense    0
piRNA_sense    0
piRNA_antisense    0
gencode_sense    7189
gencode_antisense    3555
circularRNA_sense    0
circularRNA_antisense    0
not_mapped_to_genome_or_libs    1024667
#END OF STATS from the exceRpt smallRNA-seq pipeline. Run completed at 2016-07-12--15:15:46

The .stats file above was generated by an exceRpt run with exogenous mapping disabled (endogenous-only).
Your .stats file will have more information if you choose a different exogenous mapping setting.
If you choose the "endogenous + exogenous (miRNA)" setting, your .stats file will have the following extra lines:

repetitiveElements    2080
endogenous_gapped    11627
input_to_exogenous_miRNA    1010951
exogenous_miRNA    9
input_to_exogenous_rRNA    1010942
exogenous_rRNA    852

Finally, if you choose the most extensive exogenous option, "endogenous + exogenous (miRNA + Genome)", your .stats file will also have these lines:

input_to_exogenous_genomes    1010090
exogenous_genomes 2807

You can learn more about the different exogenous settings by viewing the tutorial on exceRpt's settings.

Step 2: Look at the Contents of your CORE_RESULTS Archive

The next place to look for your analysis is the CORE_RESULTS archive uploaded with every successful exceRpt run.
This archive will be sufficient for most analyses.
We decompress the archive for you in the Genboree Workbench - you can find its contents in the CORE_RESULTS sub-folder associated with a particular run.

When you decompress your CORE_RESULTS archive (or look at the contents on Genboree), you will immediately see the following files in the base directory:

File Name Description of File
[sampleName].log Text file containing logging information for this run
[sampleName].qcResult Text file containing a variety of QC metrics for this sample
[sampleName].stats Text file containing a variety of alignment statistics for this sample

In general, you shouldn't need to look at the .log file. It contains a detailed log of the different steps performed during the course of the pipeline.
We are happy to look at the .log file to help you if something goes wrong with exceRpt or if you have a question for us.
The .qcResult file will contain data for the QC metrics discussed here.
In addition, the transcriptome complexity provided in the .qcResult file is calculated by dividing the total number of unique sequence alignments by the total number of sequence alignments when aligning to the transcriptome.
The alignments used in this calculation are taken from the endogenousAlignments_Accepted.txt.gz file (described in more detail below and only available in the full results archive).

The .stats file was discussed above.

There will be a folder in your CORE_RESULTS archive that matches the name of your sample. That folder will contain the following files:

File Name Description of File
readCounts_*_sense.txt Read counts of each annotated RNA using sense alignments
readCounts_*_antisense.txt Read counts of each annotated RNA using antisense alignments
*.coverage.txt Contains read-depth across all gencode transcripts
*.CIGARstats.txt Summary of the alignment characteristics for genome-mapped reads
[sampleName].*_fastqc.zip FastQC output both before and after UniVec/rRNA contaminant removal
[sampleName].*.readLengths.txt Counts of the number of reads of each length following adapter removal
[sampleName].*.counts Read counts mapped to UniVec & rRNA (and calibrator oligo, if used) sequences
[sampleName].*.knownAdapterSeq 3' adapter sequence guessed (from known adapters) in a given sample
[sampleName].*.adapterSeq 3' adapter used to clip the reads in a given run
[sampleName].*.qualityEncoding PHRED encoding guessed for the input sequence reads

If you chose the "endogenous + exogenous (miRNA)" setting (mapping to exogenous miRNA / rRNA),
there will be an additional subfolder named EXOGENOUS_miRNA which will include some additional readCounts files
for exogenous miRNA. There are no readCounts files for exogenous rRNA.

Finally, if you chose the "endogenous + exogenous (miRNA + Genome)" setting (mapping to all of the above as well as exogenous genomes),
there will be an additional subfolder named EXOGENOUS_genomes which will include a taxonomy tree file
named ExogenousGenomicAlignments.result.taxaAnnotated.txt.
This text file will provide taxonomy information about the different taxons found in your sample.

When looking at the files above, you'll probably be most interested in the readCounts files.
An example of how these files are formatted can be seen below:

ReferenceID uniqueReadCount totalReadCount multimapAdjustedReadCount multimapAdjustedBarcodeCount
hsa-miR-143-3p:MIMAT0000435:Homo:sapiens:miR-143-3p 1235 4147219 4147219.0 0.0
hsa-miR-10b-5p:MIMAT0000254:Homo:sapiens:miR-10b-5p 1430 2420500 2420241.0 0.0
hsa-miR-10a-5p:MIMAT0000253:Homo:sapiens:miR-10a-5p 1115 784863 784600.5 0.0
hsa-miR-192-5p:MIMAT0000222:Homo:sapiens:miR-192-5p 759 559068 558542.5 0.0

Below, you can see a description of each column:

If your exceRpt run didn't map to a given library, there will be no corresponding readCounts file in your CORE_RESULTS archive.
For example, if you didn't have any tRNA sense reads, there will be no [sampleName].readCounts_tRNA_sense.txt file.

Step 3 (Optional): Look at the Contents of Your Full Results Archive

If the files given above are not sufficient, you can select the "Upload Full Results" option when launching your exceRpt job.
This will make your exceRpt job upload an archive containing all files created by exceRpt during the processing of your sample(s).
This means that your full results archive will contain all of the files located in your CORE_RESULTS archive.
Because some of the files inside this archive can be large, it is not recommended to choose this option unless you absolutely need these files.

When you open this archive, you will see a folder with your sample name (just like the CORE_RESULTS archive).
Inside that folder, you will see the following types of files:

Intermediate files containing reads 'surviving' each stage

In order of the exceRpt workflow, these files include the reads remaining after:
  1. 3' adapter clipping
  2. 5'/3' end trimming
  3. read-quality and homopolymer filtering
  4. UniVec contaminant removal
  5. rRNA removal
  6. Transcriptome alignments (ungapped) of reads mapped to the genome
  7. Transcriptome alignments (ungapped) of reads not mapped to the genome
  8. Repetitive elements (only present if exogenous mapping is being done)
  9. Genome allowing gaps / novel splices (only present if exogenous mapping is being done)
  10. Exogenous miRNA (only present if exogenous mapping is being done)
  11. Exogenous rRNA (only present if exogenous mapping is being done)

The names of these files will look like the following:

File Name Description of File
[sampleName].*.fastq.gz Reads remaining after each QC / filtering / alignment step

The one exception is the read file associated with reads remaining after exogenous rRNA alignment.
This file ends in .fq.gz.

Reads aligned at each step of the pipeline

In order of the exceRpt workflow, these files include reads aligned at the following stages:
  1. UniVec
  2. rRNA
  3. endogenous genome
  4. endogenous transcriptome

The names of these files will look like the following:

File Name Description of File
filteringAlignments_*.bam Alignments to the UniVec and rRNA sequences
endogenousAlignments_genome*.bam Alignments (ungapped) to the endogenous genome
endogenousAlignments_genomeMapped_transcriptome*.bam Transcriptome alignments (ungapped) of reads mapped to the genome
endogenousAlignments_genomeUnmapped_transcriptome*.bam Transcriptome alignments (ungapped) of reads not mapped to the genome

Alignment summary information obtained after invoking the library priority

By default, the library priority will choose a miRBase alignment over any other alignment.
For example, if a read is aligned to both a miRNA in miRBase and a miRNA in Gencode, the miRBase alignment is kept and all others discarded.
It is especially important for tRNAs to be chosen in favour of piRNAs, as the latter have quite a large number of misannotations compared to the former.

The names of these files will look like the following:

File Name Description of File
endogenousAlignments_Accepted.txt.gz All compatible alignments against the transcriptome after invoking the library priority
endogenousAlignments_Accepted.dict Contains the ID(s) of the RNA annotations indexed in the fifth column of the .txt.gz file above

Step 4 (Optional): Look at Post-processed Results

If you submitted dozens or even hundreds of samples for processing, you might not want to crawl through each sample's read count files.
In this situation, we recommend looking at your submission's post-processed results.
This tool combines the information from each sample into comprehensive files that cover all of the samples.
For example, if you submitted 100 samples for processing, you could look at the [analysisName]_miRNA_ReadCounts.txt file
in your post-processed results to see miRNA read counts for all 100 samples at the same time.

You can learn more about these results here.


Overview

Detecting Circular and Linear Isoforms from RNA-seq Data Using KNIFE

Introduction

This tool performs statistically based splicing detection for circular and linear isoforms from RNA-Seq data.
The tool's statistical algorithm increases the sensitivity and specificity of circularRNA detection from RNA-Seq data by quantifying circular and linear RNA splicing events at both annotated and un-annotated exon boundaries.

Preliminary Steps for Setting Up Any Analysis in the Genboree Workbench

Create a Group for Your Analysis

FAQ

What is a Group? A "Group" contains Databases and Projects and controls access to all content within.
You control access to your Group(s), and who is a member of your group. You can also belong to multiple Groups (i.e. collaborators).

This step is optional. You can also use your default/existing group.

Create a Database for Your Analysis

FAQ

What is a Database? A "Database" contains Tracks, Lists, Sample Sets, Samples, and Files.
Each database can be associated with a reference genome.

This step is required.
Make sure that you pick the proper reference sequence genome (hg19, for example) when creating your database.

The KNIFE tool currently supports the following genomes: hg19, mm10, rn5, and dm3.
However, we do not have entries for mm10, rn5, or dm3 in the reference sequence list within our Create Database tool.

If you want to create a Database associated with mm10, rn5, or dm3, select the "User Will Upload" option for "Reference Sequence".
Then, provide appropriate values for the Species and Version text boxes, as given below:

Your Genome of Interest Species Version
Mouse genome mm10 Mus musculus mm10
Rat genome rn5 Rattus norvegicus rn5
Fly genome dm3 Drosophila melanogaster dm3

All data (inputs and outputs) associated with a given genome should go into that Database.
For example, if you create an hg19 Database, then all of your hg19 data belongs in that Database.
If you have other data that corresponds to another reference genome (mm10, for example), then you should create a second Database to hold that data.

Upload your Data File(s)

FAQ

What types of files can be uploaded?

The KNIFE Workbench tool accepts any number of single-end or paired-end FASTQ files as input for a single submission.
Each single-end FASTQ file (or pair of paired-end FASTQ files) will be processed separately, with a subfolder created for each input (or pair of inputs).
Paired-end FASTQ files must follow a certain naming convention: read1 must end in _1 or _R1, and read2 must end in the accompanying suffix (_2 or _R2).

Your inputs can be compressed in one large archive, multiple archives, etc.
All that matters for processing is that your paired-end FASTQ file names (not the archives containing the FASTQ files!) must follow the naming convention above.
Single-end files should either NOT include a suffix or should end in _1 or _R1.

Step-by-step Instructions to Set Up KNIFE Submission

  1. Drag one or more FASTQ files to the Input Data panel. The input file(s) can be compressed.
    Optionally, you can also upload one or more compressed archives of multiple FASTQ files. Each FASTQ file can also be compressed inside these archives.
    Then, you can drag these multiple input file archives to the Input Data panel.
    Check out the Notes below for more info on preparing your input data files.
  2. Drag a Database to the Output Targets panel to store results.
  3. Select Transcriptome » Analyze Small RNA-Seq Data » exRNA Data Analysis » KNIFE (Known and Novel IsoForm Explorer) from the Toolset menu.
  4. Fill in the analysis name for your tool job. We recommend keeping a timestamp in your analysis name!
  5. Select the different ERCC-related submission settings if you are a member of the ERCC. If you are not, then ignore this section.
  6. Choose to upload your results to a remote storage area if you wish to do so. More information about this option can be found here.
  7. Submit your job. Upon completion of your job, you will receive an email.
  8. Download the results of your analysis from your Database. The results data will end up under the KNIFE_v1.2 folder in the Files area of your output database.

Notes for Preparing Input Data Files

RECOMMENDATION:

IMPORTANT NOTES:

Example Data for Running KNIFE

In this example, we have used a single sample with paired end read data from a human sample. Raw reads were grabbed from SRA, and
TrimGalore (wrapper for cutadapt) was used to trim poor quality ends and the adapter sequence. The original reads were 60nt - after
trimming, we kept all reads where both mates were at least 50nt. The FASTQ files use phred64 encoding (which is required).

The sample input FASTQ files and output results can be found here:

Summary of Output Files Generated by KNIFE

After your job successfully completes, you will be able to download 3 different output files:

Detailed Explanation of Output Files Generated by KNIFE

Within the full results archive, you will find four different subdirectories located in /outputs/[sample name]. These subdirectories are detailed in full below:

  1. circReads: The primary output files you will be interested in looking at are in the following subdirectories.
  2. sampleStats: Contains 2 txt files with high-level alignment statistics per sample (read1 and read2 reported separately).
  3. orig: contains all sam/bam file output and information used to assign reads to categories.
    In general there is no reason to dig into these files since the results, including the ids of reads that aligned to each junction, are output in report files under circReads as described above,
    but sometimes it is useful to dig back through if you want to trace what happened to a particular read.
  4. denovo_script_out: debugging output generated during creation of de novo index.

The core results archive will contain all of the files above except for the orig directory (which contains all sam/bam file output and is very large).

References and Attributions

  1. Szabo L, Morey R, Palpant NJ, Wang PL, Afari N, Jiang C, Parast MM, Murry CE, Laurent LC, Salzman J. Statistically based splicing detection reveals neural enrichment and tissue-specific induction of circular RNA during human fetal development. Genome Biology. 2015, 16:126. http://www.ncbi.nlm.nih.gov/pubmed/26076956
  2. Integrated into the Genboree Workbench by William Thistlethwaite and Sai Lakshmi Subramanian at the Bioinformatics Research Laboratory, Baylor College of Medicine, Houston, TX.

This tool has been deployed in the context of the exRNA Communication Consortium (ERCC).
Please contact Emily LaPlante with questions or comments, or for help using it on your own data.


Overview

Long RNA-Seq Data Analysis Using RSEQtools in the Genboree Workbench

View Screencast (no audio)

Preliminary Steps to Set Up Any Analysis in the Genboree Workbench

Step-by-step Instructions to Set Up Long RNA-Seq Data Analysis

  1. Drag Single or paired-end FASTQ files to Input Data panel. The input files can be compressed.
  2. Drag a Database and a Project to Output Targets panel to store results.
  3. Select Transcriptome » Analyze RNA-Seq Data » Analyze RNA-Seq data by RSEQtools from the Toolset menu.
  4. Fill in appropriate details in the Tool Settings dialog box
  5. Submit your job. Upon completion of your job, you will receive an email.
  6. Download the results of your analysis from your Database. The results data will end up under the RSEQtools folder in the Files area of your output database.
    Within that folder, your Analysis Name will be used as a sub-folder to hold the files generated by that run of the tool.
  7. View plots from the Projects page.
  8. If you would like to visualize your signal tracks in the UCSC Genome Browser, follow these steps:

Example Data for Running RSEQtools

A sample from a deep-sequencing study to analyze the transcriptome changes that occur during the
differentiation of human embryonic stem cells into the neural lineage has been used in this example.

The sample consists of 27 nucleotide single-end reads, that are aligned to human reference genome build hg18
and to a splice junction library generated from the UCSC Known Genes annotation set using Bowtie2.
The mapped reads are then analyzed using various modules in RSEQtools.

Sample datasets with input and output files can be found here:

RSEQTools Pipeline - Workflow Implemented in the Genboree Workbench

  1. Input Sequence import: User uploads single or paired-end FASTQ input sequence files to their database in the workbench
  2. QC FastQ reads: Input FastQ sequence reads are checked for quality using FastQC
  3. Map reads to reference genome: Sequence reads are mapped to reference genome using Bowtie 2
  4. Sort alignments: Alignments in SAM format are sorted using Samtools
  5. Convert to Mapped Read Format (MRF): Sorted Alignments in SAM format are converted to MRF using RSEQtools
  6. Downstream analysis using modules in RSEQtools

RSEQtools Modules used in Genboree Implementation

mrfQuantifier

This module calculates expression values (RPKM; read coverage normalized per million mapped nucleotides
and the length of the annotation model per kb). Given a set of mapped reads in MRF and an annotation set
(representing exons, transcripts, or gene models) mrfQuantifier calculates an expression value for each annotation entry.
This is done by counting all the nucleotides from the reads that overlap with a given annotation entry.
Subsequently, this value is normalized per million mapped nucleotides and the length of the annotation item per kb.

mrfMappingBias

Module to calculate mapping bias for a given annotation set. Aggregates mapped reads that overlap with
transcripts (specified in file.annotation) and outputs the counts over a standardized transcript
(divided into 100 equally sized bins) where 0 represents the 5' end of the transcript and
1 denotes the 3' end of the transcripts. This analysis is done in a strand specific way.

mrfAnnotationCoverage

Module to calculate annotation coverage. Sample a set of mapped reads and determine the
fraction of transcripts (specified in annotation file) that have at least -times uniform coverage.

mrf2wig

Generates signal track (WIG) of mapped reads from a MRF file. By default, the values in the
WIG file are normalized by the total number of mapped reads per million.
Only positions with non-zero values are reported.

References and Attributions


Demo of small and long RNA-Seq pipelines at the ERCC 2nd Investigator's Meeting, May 2014

Live demos of small and long RNA-Seq pipeline given at the break-out sessions.

Date: Monday, May 19th, 2014.
Time: 10.35 a.m. and 3.15 p.m.
Location: Conf Room C

Presenters

Sai Lakshmi Subramanian – sailakss@bcm.edu (Primary Contact)
Kevin Riehle – riehle@bcm.edu
Bioinformatics Research Laboratory, Baylor College of Medicine, Houston, TX
Data Coordination Component (DCC) of exRNA Communication Consortium

Robert Kitchen - rob.kitchen@yale.edu
Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT
Data Integration and Analysis Component (DiAC) of exRNA Communication Consortium

You can download a copy of the presentation in PDF format from here: Presentation


CIBR RNA-seq workshop - Demo of exceRpt small RNA processing pipeline, May 2015

Small RNA-Seq Data Analysis Tools in the Genboree Workbench

Date: Thursday, May 14th, 2015.
Time: 12.00 p.m. to 4.00 p.m.
Location: M321/323, DeBakey Building, Baylor College of Medicine

Presenter

Sai Lakshmi Subramanian (Primary Contact)
Bioinformatics Research Laboratory, Baylor College of Medicine, Houston, TX
Data Coordination Component (DCC) of exRNA Communication Consortium


Poster Presentation at the ISEV Annual Meeting , May 2016

ISEV Annual Meeting 2016

Dates: 4th-7th May, 2016
Location: Rotterdam, The Netherlands

Presenters

Rocco Lucero (Presenter at the Meeting)
Sai Lakshmi Subramanian
Bioinformatics Research Laboratory, Baylor College of Medicine, Houston, TX
Data Coordination Component (DCC) of exRNA Communication Consortium


DMRR Demo at the ERCC 3rd Investigator's Meeting, November 2014

Live demos of DMRR Data and Metadata Submission Infrastructure given at the break-out sessions.

Date: November 6th & 7th, 2014.

Presenters

William Thistlethwaite
Aaron Baker
Kevin Riehle
Sai Lakshmi Subramanian (Primary Contact)
Bioinformatics Research Laboratory, Baylor College of Medicine, Houston, TX
Data Coordination Component (DCC) of exRNA Communication Consortium

Robert Kitchen
Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT
Data Integration and Analysis Component (DiAC) of exRNA Communication Consortium


DMRR Talk at the ERCC 5th Investigators' Meeting , November 2015

exRNA Profiling Data Submission & Analysis Infrastructure for the ERC Consortium

Date: November 9th, 2015 - Monday
Location: Rockville, MD

Presenters

Sai Lakshmi Subramanian
Bioinformatics Research Laboratory, Baylor College of Medicine, Houston, TX
Data Coordination Component (DCC) of exRNA Communication Consortium


Overview

Introduction to Pathway and Interaction Analysis

We have two different tools available on Genboree for pathway and interaction analysis.

The first, Target Interaction Finder, generates miRNA-protein target interaction files for a set of miRNA identifiers,
which can be imported into downstream tools, such as Cytoscape, for network analysis and visualization.

The second, Pathway Finder, performs a search for pathways either containing miRNAs of interest
or protein targets of those miRNAs.

You can visit each tool's tutorial page below.

Target Interaction Finder

You can view the tutorial page for the Target Interaction Finder tool here.

Pathway Finder

You can view the tutorial page for the Pathway Finder tool here.


Overview

Using Pathway Finder to Perform a Search for Pathways Either Containing miRNAs of Interest or Protein Targets of Those miRNAs

Introduction

The Pathway Finder tool takes a column of miRNA identifiers from an input text file and performs a search for pathways either containing miRNAs of interest or protein targets of those miRNAs.
A table of pathway results and an interactive pathway viewer are displayed in an output window after the tool successfully processes the input text file.

Instructional Video for Pathway Finder

Below, you can view an instructional video for using the Pathway Finder Tool on the Genboree Workbench:

Preliminary Steps for Setting Up Any Analysis in the Genboree Workbench

Create a Group for Your Analysis

FAQ

What is a Group? A "Group" contains Databases and Projects and controls access to all content within.
You control access to your Group(s), and who is a member of your group. You can also belong to multiple Groups (i.e. collaborators).

This step is optional. You can also use your default/existing group.

Create a Database for Your Analysis

FAQ

What is a Database? A "Database" contains Tracks, Lists, Sample Sets, Samples, and Files.
Each database can be associated with a reference genome.

This step is optional.
Because this tool doesn't create any on-disk output, you only need to create a Database if you want to upload your own input file(s) for processing.
If you are using an input file in a previously existing Database (like the "Pathway Finder - Example Data" Database in the "Examples and Test Data" Group),
then you do not need to create a new Database.

If you do create a Database, make sure that you pick the proper reference sequence genome (hg19, for example) when creating your database.
Note that we do not have an entry for hg38 or mm10 in our "Reference Sequence" list.
If your data is associated with either of these reference sequence genomes, follow the directions below.

Create a Database for hg38 or mm10

In order to create a database associated with hg38 or mm10, select the "User Will Upload" option for "Reference Sequence"
and provide appropriate values for the Species and Version text boxes as given below:

Your Genome of Interest Species Version
Human genome hg38 Homo sapiens hg38
Mouse genome mm10 Mus musculus mm10

Upload your Data File(s)

FAQ

What types of files can be uploaded? The Pathway Finder tool accepts a single text file as input.
This text file should have a column of miRNA identifiers as its first column.

Step-by-step Instructions to Set Up Pathway Finder Submission

  1. Drag a single text file (with a first column consisting of miRNA identifiers) into the Input Data panel.
  2. Select Visualization » Pathway Finder from the Toolset menu.
  3. Click Submit. When we finish processing your input file, an output window will pop up with pathway results and an interactive pathway viewer.

Output Generated by Pathway Finder

After your input file is successfully processed, you will be able to view a table of pathway results and an interactive pathway viewer.
The first column of the table lists a clickable pathway title that updates the viewer.
The second column lists pathway identifers that link to WikiPathways.org.
The list is sorted by the number of "miRNAs" (primary) and by "miRNA Targets" (secondary) found on each pathway.
The top 20 results are listed.

References and Attributions

  1. WikiPathways, source for curated pathways and miRNA content: Pico AR, et al. (2008) WikiPathways: Pathway Editing for the People. PLoS Biol 6(7). http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.0060184
  2. miRTarBase source database for experimentally validated miRNA-protein target interactions: Chou et al. miRTarBase 2016: updates to the experimentally validated miRNA-target interactions database. NAR, Database Issue, Vol 44(D1). http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4702890/
  3. Pathway Finder tool was designed and implemented by Anders Riutta and Alexander Pico, and the video tutorial was produced by Kristina Hanspers, all at the Gladstone Institutes, San Francisco, CA.
  4. Integrated into the Genboree Workbench by William Thistlethwaite at the Bioinformatics Research Laboratory, Baylor College of Medicine, Houston, TX.

This tool has been deployed in the context of the exRNA Communication Consortium (ERCC).
Please contact William Thistlethwaite with questions or comments, or for help using it on your own data.


Overview

Small RNA-Seq Data Analysis for exRNA Profiling Using the exceRpt Small RNA-seq Pipeline

Version Updates

Current version: v4.6.2 (as of 10/12/2016)

The newest version of exceRpt is 4.6.2, which contains many updates compared to the previous, 3rd generation version on Genboree (3.3.0).
We currently still give users the option to run jobs using 3rd gen. exceRpt.
Note that some images below may have slightly outdated version numbers, but the content of the images remains otherwise accurate.

To read more about recent updates to exceRpt, view the Version Updates.

Preliminary Steps for Setting Up Any Analysis in the Genboree Workbench

Create a Group for Your Analysis

FAQ

What is a Group? A "Group" contains Databases and Projects and controls access to all content within.
You control access to your Group(s), and who is a member of your group. You can also belong to multiple Groups (i.e. collaborators).

This step is optional. You can also use your default/existing group.

Create a Database for Your Analysis

FAQ

What is a Database? A "Database" contains Tracks, Lists, Sample Sets, Samples, and Files.
Each Database can be associated with a reference genome.

This step is required.
You can leave "Species" and "Version" blank, as those fields are not used by exceRpt.
You can also leave "Reference Sequence" as "** User Will Upload **" - exceRpt uses its own reference files, so it doesn't need to consult your Database's reference sequence.
You will receive a warning message when creating your Database - just ignore this warning and proceed.

Upload your Data File(s)

FAQ

What types of files can be uploaded? The exceRpt Small RNA-Seq Pipeline accepts archives containing one or more single-end FASTQ/SRA file(s) as input.
Your input files for the job MUST be compressed or else the tool will reject your job.
Each submitted archive can contain multiple FASTQ/SRA files,
and within those archives, each FASTQ/SRA can also be compressed.

Step-by-step Instructions for Setting Up Your exceRpt small RNA-Seq Data Analysis

  1. Drag one or more archives containing single-end FASTQ or SRA files to the Input Data panel. The input file(s) must be compressed.
    Each FASTQ/SRA file can also be compressed inside these archives.
    Check out the Notes below for more info on preparing your input data files.
  2. Drag a Database to the Output Targets panel to store results.
  3. Select Transcriptome » Analyze Small RNA-Seq Data » exRNA Data Analysis » exceRpt small RNA-seq Pipeline from the tool menu.
  4. Fill in appropriate details in the Tool Settings dialog box. See Tool Settings for more details.
  5. Submit your job. You will receive several emails as we process your samples.
  6. Download the results of your analysis from your Database. The results data will end up under the exceRptPipeline_v4.6.2 folder in the Files area of your output Database.

Notes for Preparing Input Data Files

RECOMMENDATION:

IMPORTANT NOTES:

Tool Settings

Example Data for Running exceRpt Small RNA-seq Pipeline

In this example, we have used four samples from deep sequencing experiments of barcoded small RNA cDNA libraries to profile microRNAs in
cell-free serum and plasma from human volunteers. These samples were analyzed using the exceRpt Small RNA-seq Pipeline.

The sample input FASTQ files and output results and plots can be found here:

exceRpt Small RNA-seq Pipeline Workflow (Implemented in the Genboree Workbench)

The exceRpt Small RNA-seq Pipeline is for the processing and analysis of RNA-seq data generated to profile small-exRNAs.
The pipeline is highly modular, allowing the user to define the libraries containing smallRNA sequences that are used
during RNA-seq read-mapping, including an option to provide a library of spike-in sequences to allow absolute quantitation
of small-RNA molecules. It also performs automatic detection and removal of 3' adapter sequences.

The output data includes abundance estimates for each of the requested libraries, a variety of quality control metrics
such as read-length distribution, summaries of reads mapped to each library, and detailed mapping information for each read mapped to each library.

exRNA Data Analysis Results

To better understand the results generated by exceRpt, check out the exRNA Data Analysis page.

Finally, after the pipeline finishes processing all submitted samples, a separate post-processing tool (processPipelineRuns)
is run on all successful pipeline outputs. This tool generates useful summary plots and tables that can be used to compare
and contrast different samples.

Currently Supported Genomes

Sources of smallRNA Libraries

Workflow

Example of Core Results from Workflow

Post-processing of Samples

After all samples have been processed through this pipeline, the Generate Summary Report for exceRpt Results tool will take
successful samples and perform post-processing on them.

This post-processing step will generate useful plots and tables that will allow you to compare and contrast samples.
A description of all generated files can be found in the table below (partially taken from the exceRpt GitHub page):

File Name Description of File
QC Data
exceRpt_DiagnosticPlots.pdf All diagnostic plots automatically generated by the tool
exceRpt_readMappingSummary.txt Read-alignment summary including total counts for each library
exceRpt_ReadLengths.txt Read-lengths (after 3' adapters/barcodes are removed)
Raw Transcriptome Quantifications
exceRpt_miRNA_ReadCounts.txt miRNA read-counts quantifications
exceRpt_tRNA_ReadCounts.txt tRNA read-counts quantifications
exceRpt_piRNA_ReadCounts.txt piRNA read-counts quantifications
exceRpt_gencode_ReadCounts.txt gencode read-counts quantifications
exceRpt_circularRNA_ReadCounts.txt circularRNA read-counts quantifications
exceRpt_biotypeCounts.txt read-counts quantified for different biotypes
Normalized Transcriptome Quantifications
exceRpt_miRNA_ReadsPerMillion.txt miRNA RPM quantifications
exceRpt_tRNA_ReadsPerMillion.txt tRNA RPM quantifications
exceRpt_piRNA_ReadsPerMillion.txt piRNA RPM quantifications
exceRpt_gencode_ReadsPerMillion.txt gencode RPM quantifications
exceRpt_circularRNA_ReadsPerMillion.txt circularRNA RPM quantifications
Exogenous Output
exceRpt_exogenousGenomes_TaxonomyTrees_aggregateSamples.pdf aggregate taxonomy tree for exogenous genomes
exceRpt_exogenousGenomes_TaxonomyTrees_perSample.pdf per-sample taxonomy trees for exogenous genomes
exceRpt_exogenousGenomes_taxonomyCumulative_ReadCounts.txt descendant read counts for exogenous genomes
exceRpt_exogenousGenomes_taxonomyCumulative_ReadsPerMillion.txt descendant RPM counts for exogenous genomes
exceRpt_exogenousGenomes_taxonomySpecific_ReadCounts.txt direct read counts for exogenous genomes
exceRpt_exogenousGenomes_taxonomySpecific_ReadsPerMillion.txt direct RPM counts for exogenous genomes
exceRpt_exogenousRibosomal_TaxonomyTrees_aggregateSamples.pdf aggregate taxonomy tree for exogenous rRNAs
exceRpt_exogenousRibosomal_TaxonomyTrees_perSample.pdf per-sample taxonomy trees for exogenous rRNAs
exceRpt_exogenousRibosomal_taxonomyCumulative_ReadCounts.txt descendant read counts for exogenous rRNAs
exceRpt_exogenousRibosomal_taxonomyCumulative_ReadsPerMillion.txt descendant RPM counts for exogenous rRNAs
exceRpt_exogenousRibosomal_taxonomySpecific_ReadCounts.txt direct read counts for exogenous rRNAs
exceRpt_exogenousRibosomal_taxonomySpecific_ReadsPerMillion.txt direct RPM counts for exogenous rRNAs
R Objects
exceRpt_smallRNAQuants_ReadCounts.RData All raw data (binary R object)
exceRpt_smallRNAQuants_ReadsPerMillion.RData All normalized data (binary R object)
Misc.
exceRpt_adapterSequences.txt 3' adapter sequences associated with each sample
exceRpt_sampleGroupDefinitions.txt sample group associated with each sample (not used in Genboree)

Explanation of Exogenous Output

The taxonomy specific read counts file contains the number of reads that map directly to each node in the sample's taxonomy tree.
The taxonomy cumulative read counts file contains the number of reads that map to the descendants of each node in the sample's taxonomy tree.
Importantly, the reads counted for a given node in the taxonomy specific read counts file are NOT counted as part of the cumulative counts in the other file
(all of the node's descendants are counted, but not the node itself).

A couple of examples may be useful here. Say we have this line in the cumulative file: And then this line in the specific file:

This means that no reads mapped directly to the Viruses node, but a considerable number of reads mapped to descendants of Viruses.

Similarly, say we have this line in the cumulative file: And then this line in the specific file:

This means that a considerable number of reads mapped both directly to the cellular organisms node as well as descendants of that node.

In other words, if you want to get a full count of all of the reads that aligned to a given node and its descendants,
you need to ADD TOGETHER the numbers in both the taxonomy cumulative and taxonomy specific files for that node.

You can also find plots of the exogenous trees in the .pdf plot files.
First, the TaxonomyTrees_perSample.pdf file contains a plot for each sample.
The percentage within each node is the ratio of the node's summed cumulative + specific reads to the root node's summed cumulative + specific reads.
Second, the TaxonomyTrees_aggregateSamples.pdf file contains a single plot that condenses all samples into a single, averaged tree.
The percentage within each node is the ratio of that node's summed cumulative + specific reads averaged across all samples to the root node's summed cumulative + specific reads averaged across all samples.

Example of Post-processed Result Files

Below, we can see what the post-processing files look like on the Genboree Workbench (found in the Examples and Test Data Group):

Comparative Plots

We can see some examples below of the comparative plots available in the DiagnosticPlots.pdf generated by the post-processing tool.
Each plot contains data from four different samples (the example data).

Bioinformatics Tools Used in This Pipeline (4th Gen)

References and Attributions


This tool has been deployed in the context of the exRNA Communication Consortium (ERCC).
Please contact Emily LaPlante with questions or comments, or for help using it on your own data.


Overview

Generating miRNA-Protein Target Interaction Files for a Set of miRNA Identifiers Using Target Interaction Finder

Introduction

This tool takes a column of miRNA identifiers from an input TXT file and generates miRNA-protein target interaction files in both a tabular CSV format and a network XGMML format.
The target interactions are sourced from miRTarBase 1, an experimentally validated miRNA-target interaction database.
The CSV and XGMML files can be imported into downstream tools, such as Cytoscape 2, for network analysis and visualization.

From a column of identifiers... to a network of interactions

Instructional Video for Target Interaction Finder

Below, you can view an instruction video for using the Target Interaction Finder Tool on the Genboree Workbench:

Preliminary Steps for Setting Up Any Analysis in the Genboree Workbench

Create a Group for Your Analysis

FAQ

What is a Group? A "Group" contains Databases and Projects and controls access to all content within.
You control access to your Group(s), and who is a member of your group. You can also belong to multiple Groups (i.e. collaborators).

This step is optional. You can also use your default/existing group.

Create a Database for Your Analysis

FAQ

What is a Database? A "Database" contains Tracks, Lists, Sample Sets, Samples, and Files.
Each database can be associated with a reference genome.

This step is required.
Make sure that you pick the proper reference sequence genome (hg19, for example) when creating your database.
Note that we do not have an entry for hg38 or mm10 in our "Reference Sequence" list.
If your data is associated with either of these reference sequence genomes, follow the directions below.

Create a Database for hg38 or mm10

In order to create a database associated with hg38 or mm10, select the "User Will Upload" option for "Reference Sequence"
and provide appropriate values for the Species and Version text boxes as given below:

Your Genome of Interest Species Version
Human genome hg38 Homo sapiens hg38
Mouse genome mm10 Mus musculus mm10

Upload your Data File(s)

FAQ

What types of files can be uploaded? The Target Interaction Finder tool accepts one or more text files as input. Each text file should have a column of miRNA identifiers as its first column.
Your input text files can also be compressed, with each archive containing containing one or more input text files.
For example, you could submit one archive containing 100 different input text files, or even three different archives containing different numbers of input text files.

Step-by-step Instructions to Set Up Target Interaction Finder Submission

  1. Drag one or more text files (each with a first column consisting of miRNA identifiers) into the Input Data panel. These input file(s) can also be compressed, as mentioned above.
  2. Drag a Database to the Output Targets panel to store results.
  3. Select Visualization » Target Interaction Finder from the Toolset menu.
  4. Fill in the analysis name for your tool job. We recommend keeping a timestamp in your analysis name!
  5. Submit your job. Upon completion of your job, you will receive an email.
  6. Download the results of your analysis from your Database. The results data will end up under the targetInteractionFinder folder in the Files area of your output database.

Output Files Generated by Target Interaction Finder

After your job successfully completes, you will be able to download 3 different output files:

References and Attributions

  1. miRTarBase source database for experimentally validated miRNA-protein target interactions: Chou et al. miRTarBase 2016: updates to the experimentally validated miRNA-target interactions database. NAR, Database Issue, Vol 44(D1). http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4702890/
  2. Cytoscape tool for network visualization and analysis: Shannon, P, et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Research, 13(11). http://genome.cshlp.org/content/13/11/2498.full
  3. Target Interaction Finder tool written by Anders Riutta and Alexander Pico at the UCSF Gladstone Institute, San Francisco, CA.
  4. Integrated into the Genboree Workbench by William Thistlethwaite and Sai Lakshmi Subramanian at the Bioinformatics Research Laboratory, Baylor College of Medicine, Houston, TX.

This tool has been deployed in the context of the exRNA Communication Consortium (ERCC).
Please contact William Thistlethwaite with questions or comments, or for help using it on your own data.


Overview

Understanding Your Storage Options with exceRpt

We have recently implemented some storage quotas for using exceRpt.
In particular, these quotas apply when you select the "Upload Full Results" option.
Please read the information below to better understand the storage quotas
and how you can clear up space for your Genboree Group.

Basics on Storage Quotas

Each Genboree Group has two different storage quotas associated with it.
These storage quotas include local space and FTP-backed space.
Note that the quotas are associated with a Group and not a user!
Please do not create multiple Groups as a single user to bypass the quota restriction.
You should only use a single Group for your exceRpt output (unless discussed with us previously).

By default, a Genboree Group is allotted 100 GB of local space and 100 GB of FTP-backed space.
Local space is defined as any space outside of the virtual FTP areas in your Group.
You can learn more about creating virtual FTP areas on the Workbench here.

As mentioned above, these space requirements only apply if you launch an exceRpt job with
the "Upload Full Results" option enabled. This option will upload a large results archive for each
sample containing alignment .bam files (and other similar files).
You can learn more about the specific files present in the full results archive on the exRNA Data Analysis page.

The default storage quotas should be sufficient for most users, especially if you normally run your samples
through exceRpt without the "Upload Full Results" option.

What Files Contribute to Your Quotas?

Any files in your Genboree Group will contribute to either the local or FTP-backed quota.
For example, if I uploaded a large number of FASTQ files that I wanted to process, those would contribute to one of the quotas.
This is one reason why it is very important to compress your files on the Workbench.
You can learn more about this idea below.

When you submit your files for processing through exceRpt (with "uploadFullResults" enabled),
the following factors will be considered:

1) How much space you're currently taking up in your Group (discussed above)
2) An estimation of how much space your current submission will take up when fully uploaded
3) An estimation of how much space your other, currently-running submissions will take up when fully uploaded

In other words, if I launch a huge job with 200 samples, I will likely get an error message when I try to launch another job,
even if my first job hasn't finished yet.

If you do receive an error message, it will clearly indicate how much space is taken up by each of the enumerated factors above.

Recommendations for Meeting the Quotas

Because we only recently instituted these quotas, many Genboree Groups have far exceeded the numbers given above
and cannot launch exceRpt jobs until space is freed up.
We have a number of suggestions for meeting the quotas:

Deleting Files

If you have any files that you no longer need on the Workbench, you can delete those files to free up space.
You should use the Remove File(s) tool (found under Data -> Files) to do so.
Simply drag your files into the Input Data panel and then run the tool to delete them.
The tool also accepts folders as inputs, so if you want to delete an entire folder at once, you can do that also.
Remember that exceRpt is constantly being worked on and improved, so if you have some old results, you could
potentially delete those and re-run your samples through the newest version of exceRpt.

Compressing Files

Another way to make space is to compress your files if they are currently uncompressed.
For example, many users have uncompressed FASTQ files stored on Genboree.
These FASTQ files can be very large, but they get much smaller when they're compressed.

Furthermore, we recently added a restriction to exceRpt so that it no longer accepts uncompressed FASTQ files as input.
This means that if you want to use FASTQ files in exceRpt, you will need to compress them first.
You can compress the files on the Genboree Workbench using the Prepare Archive tool (found under Data -> Files).
Simply drag your uncompressed files into the Input Data panel and the Database where you want your archive created
into the Output Targets panel.

IMPORTANT: After you run the tool and receive an email informing you that the job was successful,
you must manually delete your old uncompressed files. We will implement an option in the near future
that will do this deletion automatically, but for now, you'll need to do it using the Remove File(s) tool (discussed above).

Moving Files to Different Storage Type

As mentioned above, we have two different quotas: one for local storage and one for FTP-backed storage.
Thus, if one quota is completely full, it might make sense to move some files to a different storage type to free up space.
You can accomplish this by using the "Copy / Move File" tool (found under Data -> Files).
In most cases, users will want to move their files from local storage to FTP-backed storage, so we'll explain how to do that below.

  1. Create a virtual FTP area if you haven't already. You can follow the directions here.
  2. Drag the files you want to move into the Input Data panel.
  3. Drag the folder associated with the virtual FTP area into the Output Targets panel.
  4. Use the Copy/Move File tool and select the MOVE option (we do not want to copy the files).

If you do have files in your FTP-backed area that you want to move to local storage, the method is very similar.
Drag the files you want to move into the Input Data panel, and then drag the Database where you want your files to be stored
into the Output Targets panel.

Requesting More Space for Your Group

Before requesting more space, you should try all of the methods given above.
If you still need more space, email Emily and let him know your Genboree Group and why you need more space.
For example, if your Group stores files associated with a collaborative effort between several different labs,
then it might make sense to increase that Group's storage quota.


Using Remote (FTP) Storage for exceRpt

We have recently implemented a new feature that allows users to deposit their result files onto our FTP server, as opposed to our cluster.

Below, we'll go step-by-step through the process of setting up your remote storage area on Genboree and then downloading your exceRpt result files.

Step 1: Creating Your Remote Storage Area

The first step in the process is creating a remote storage area in your Database of choice.
If you're unfamiliar with Genboree, you should first learn about Groups and then learn about creating Databases.
Note that your account comes with a default group named after your user login.

After you have created a Database, you should drag it from the Data Selector panel to the Output Targets panel.
Then, you can select the Create Remote Storage Area tool in the tool menu at the top of the screen:

When you click the Create Remote Storage Area button, a window like the following will appear:

There are two different settings you can change:

  1. Name of Remote Storage Area
  2. Remote Storage Type

You can name the remote storage area anything you want, so long as the name is unique among folders in the Files area of your Database.
This is because the remote storage area is represented as a folder in the Files area.

Under "Remote Storage Type", you can select the particular type of remote storage area that you want to create.
Currently, we only support the Genboree FTP server - thus, you can go ahead and keep the default option of "Genboree Virtual FTP".

After you click "Submit", you will receive notification that your remote storage area was successfully created (or an informative error message if something went wrong).
Your remote storage area will now be available in the Files area under the Database you chose:

You can use this tool to create as many remote storage areas as you like, so long as they all have different names.

Step 2: Submitting Your Samples to exceRpt

Now that you've created your remote storage area, the next step is to submit your files for processing through exceRpt.
This tutorial will provide guidance on submitting your samples.
In particular, you will need to select your newly created remote stage area in the Remote Storage Area menu under Advanced Options.
The default option for this menu is None Selected - you should choose the remote storage area where you want to store your exceRpt result files.

After you click "Submit", your files will be processed through exceRpt.
If any issue arises during processing, you will receive an email notifying you about the issue.
If you have additional questions about your submission, you can always email "Emily for help.

Step 3: Accessing Your Results

After we've finished processing your samples, you'll be able to access your results by both the Genboree Workbench and your FTP client.

Accessing Your Results via the Genboree Workbench

If you want to use the Genboree Workbench, accessing your samples is just like any other exceRpt job, with one small difference.
With normal exceRpt submissions, the base directory for your exceRpt pipeline runs can be found in the Files area.
However, if you use a remote storage area, the base directory for your exceRpt pipeline runs will be located in that remote storage folder.
You can see an example below:

Accessing Your Results via Your FTP Client

Before you can access your results, you will need an account on our FTP server.
You should email Emily and ask her to create a personal FTP account for you so that you can log onto our FTP server.
Please note that creating a Genboree account does not create your FTP account as well.
You will always need to contact Emily in order to create your FTP account.

Please include the following information in your email to Emily:
  1. Your Group name. Ex: william_thistle2_group
  2. The PI name. This is important if you will be doing data submissions. Ex: Dr. Milosavljevic
  3. The name of the remote storage area you created following the above instructions so we know which folder to give FTP access. Ex: virtualFTP-Genboree
  4. Your Genboree username. Ex: william_thistle

Note: If you would like multiple people to have download permissions for this FTP folder you can include the usernames of the other people here

After we have confirmed that your FTP account is created, you can log into our FTP server at ftp.genboree.org with your Genboree username and password.
When you log in, you should see a directory named genboree:genboree.org. You should then follow a series of nested directories that include your
Group name, Database name, and remote storage area name. This final directory is where you will be able to find your exceRpt result files.
If I wanted to navigate to the result files given in the example above, I would go to the following path:

Please note that you must send an email to us for each new Database you want to use for remote storage.
For example, say you create a remote storage area in Database A and then ask us to expose that area to your FTP username.
We will do so, and then you will be able to see that area via your FTP client.
If you create more remote storage areas in the same Database (Database A), then you will also be able to see those areas without emailing us again.
However, if you create a new Database on Genboree (Database B) and then create a remote storage area in that Database,
you will need to email us again with the name of the new Database so we can expose it to your FTP username.


exceRpt Small RNA-seq Data Analysis Pipeline - Version Updates

4th Generation

v4.6.3 (Version available on GitHub)

v4.6.2 (Version associated with exRNA Atlas and currently available on Genboree Workbench)

v4.4.1 through v4.6.1

v4.3.1 through v4.4.0

v4.2.0 through v4.3.0

v4.1.0 through v4.1.9

v4.0.0 through v.4.0.9

3rd Generation

v3.4.1 (not installed on Genboree)

v3.4.0 (not installed on Genboree)

v3.3.0 (Version associated with exRNA Atlas v3 Snapshot)

v3.2.6

v3.1.9

v3.1.5

v3.1.1

2nd Generation (Discontinued)

v2.2.8

v2.2.6

v2.2.2

v1.3.3

v1.0


Overview

Introduction to the Genboree Workbench

Genboree Workbench Home Page

Video Tutorial - Introduction to the Genboree Workbench

FAQs Related to Genboree Workbench Basics

Bioinformatics Tools for Analysis of exRNA Atlas Data

Running Analyses and Viewing Analysis Results Using the exRNA Atlas

Tutorial

Bioinformatics Tools for Analysis of Your Own exRNA Data in the Genboree Workbench

Small RNA-Seq Data Analysis for exRNA Profiling Using exceRpt Small RNA-seq Pipeline

Tutorial

Understanding exceRpt Results

Tool Version Updates

Long RNA-Seq Data Analysis Using RSEQtools

View Screencast (no audio)

Tutorial

Detecting Circular and Linear Isoforms from RNA-seq Data Using KNIFE

Tutorial

Performing Differential Expression Analysis (Fold Change Calculation) Using DESeq2

Tutorial

Performing Pathway and Interaction Analysis Using the Genboree Workbench

Target Interaction Finder

This tool generates miRNA-protein target interaction files for a set of miRNA identifiers,
which can be imported into downstream tools, such as Cytoscape, for network analysis and visualization.

View Screencast below (no audio):

Tutorial

Pathway Finder

This tool performs a search for pathways either containing miRNAs of interest
or protein targets of those miRNAs.

View Screencast below (no audio):

Tutorial

Demos at ExRNA Communication Consortium (ERCC) Meetings