Genboree Microbiome Toolset - Tutorial

Genboree Microbiome Toolset - Home

Downloadable Copy of the Tutorial

Previous Tutorials

We will be going through a tutorial on the Genboree Microbiome Toolset with publicly available data:

Create Sample Meta Data

The first step towards completing work on the Genboree Microbiome Workbench is to produce the sample meta data. The sample meta data reflects the attributes of each sample (i.e. health, body site, BMI, etc.) as well as the necessary information required to extract the sequence data from the original SFF or SRA sequence file.

Requirements:
  • Tab-delimited
  • The first line of the file contains the column headers, as a comment-line. It must start with a '#'.
  • One of the fields MUST be 'name' which should be unique for all Sample records.
  • All records MUST have the same number of fields/columns.
  • Fields:
    • name - [Required] Unique name associated with the Sample.
    • barcode - [Required] The Sample-specific sequence used to barcode the sequences in multiplex sequencing. Will be used to identify which sequence records go with which Samples.
    • region - [optional] The name of the 16S region amplified. Defaults to V3V5 if no 'region', 'proximal', or 'distal' primer is included. The proximal and distal primer pair should amplify the region mentioned here.
    • proximal - [optional] The upstream primer used to amplify the microbial 16S rRNA region. If not provided, then a standard primer pair will be looked up based on the 'region' column. For example, if the user does not know the proximal primer, they can list V3V5 in the 'region' column and the stored primer used to amplify the V3V5 region is assumed; the upstream primer in that case is CCGTCAATTCMTTTRAGT.
    • distal - [optional] The downstream primer used to amplify the microbial 16S rRNA region. If not provided, then a standard primer pair used looked up based on the 'region' column. For example, if the user does not know the distal primer, they can list V3V5 in the 'region' column and the stored primer used to amplify the V3V5 region is assumed; the upstream primer in that case is CTGCTGCCTCCCGTAGG.
  • Also, please avoid any spaces or any other character other than a-zA-Z0-9-_

Sample meta data
  • 10 samples
  • 2 body sites
    • Stool
    • Throat
  • 1 primer region
    • V3V5

#name barcode proximal distal region body_site
S_700033665 CCGTTCCTC CCGTCAATTCMTTTRAGT CTGCTGCCTCCCGTAGG V3V5 Stool
S_700035861 ACCGGCGTTC CCGTCAATTCMTTTRAGT CTGCTGCCTCCCGTAGG V3V5 Stool
S_700095543 ACGAATTAAC CCGTCAATTCMTTTRAGT CTGCTGCCTCCCGTAGG V3V5 Stool
S_700095850 AACCGGATAC CCGTCAATTCMTTTRAGT CTGCTGCCTCCCGTAGG V3V5 Stool
S_700101600 AACGGAACGC CCGTCAATTCMTTTRAGT CTGCTGCCTCCCGTAGG V3V5 Stool
T_700016994 AATAACCGTC CCGTCAATTCMTTTRAGT CTGCTGCCTCCCGTAGG V3V5 Throat
T_700095565 TTAATGGAAC CCGTCAATTCMTTTRAGT CTGCTGCCTCCCGTAGG V3V5 Throat
T_700095872 CGGACCGGAAC CCGTCAATTCMTTTRAGT CTGCTGCCTCCCGTAGG V3V5 Throat
T_700101388 CCGAACGAC CCGTCAATTCMTTTRAGT CTGCTGCCTCCCGTAGG V3V5 Throat
T_700101622 TTCGTTCTTC CCGTCAATTCMTTTRAGT CTGCTGCCTCCCGTAGG V3V5 Throat

Create Group

  • Login or create an account on http://www.genboree.org
  • Click the Groups tab
  • Click the Create tab
  • Enter a Name for the Group (i.e. GMT_Tutorial)
  • Optionally enter a description
  • Click the 'Create' button

Create Database

  • Click the Databases tab
  • Select your newly created Group GMT_Tutorial
  • Click Create tab
  • Enter your Database Name (i.e. gmtDB)
  • Click 'Create' button

Create Project

  • Click the Projects tab
  • Select your newly created Group GMT_Tutorial
  • Click the Create tab
  • Enter your New Project Name (i.e. gmtProject)
  • Click the 'Create' Button

Upload Files

  • Click the Workbench tab
  • Within the Data Selector window expand the Groups -> GMT_Tutorial -> Databases -> gmtDB
  • Drag the gmtDB database into the Output Targets window
  • Click the Data tab, the Files tab, and then the Transfer File tab
  • Browse to the location of tutorial_meta_data.tsv
  • Click 'Submit'
  • Click the Data tab, the Files tab, and then the Transfer File tab
  • Browse to the location of tutorial_sequence_files.tar.gz
  • Check 'Unpack Multi-File Archive'
  • Click 'Submit'

View Uploaded Files

  • Click the Refresh button in the Data Selector window
  • Expand Groups -> GMT_Tutorial -> Databases -> gmtDB -> Files to see that your files have been uploaded and decompressed from the multi-file archive

Import Samples

  • Drag over the tutorial_meta_data.tsv file from the Data Selector window to the Input Data window
  • Drag over the gmtDB database from the Data Selector window to the Output Targets window
  • Click the Data tab, the Samples tab, and finally the Import Samples tab
  • Create a new sample set by entering "tutorial_sample_set" into the 'Assign Samples to new Sample Set'
  • Click the 'Submit' button
  • Wait for confirmation email

Hello Tutorial IMT,

Your Samples Importer job has completed successfully.

JOB SUMMARY:
 JobID          : wbJob-samplesimporter-1312569590_101768
 File Name      : tutorial_meta_data.tsv

The following file(s) has been uploaded as samples:
 tutorial_meta_data.tsv

The Genboree Team

View Imported Samples

  • Click the Refresh button in the Data Selector window
  • Expand Groups -> GMT_Tutorial -> Databases -> gmtDB -> Samples to see that your samples have been uploaded

Link Samples To Sequence Files

  • Remove any items from the Input Data window by selecting the items and clicking the red X
  • Remove any items from the Output Targets window by selecting the items and clicking the red X
  • Expand the Groups -> GMT_Tutorial -> Databases -> gmtDB -> Files
    • Drag the tutorial_sequence_file.sff.gz file from the Data Selector window to the Input Data window
  • Expand the Groups -> GMT_Tutorial -> Databases -> gmtDB -> SampleSets
    • Drag the tutorial_sample_set from the Data Selector window to the Input Data window below the tutorial_sequence_file.sff.gz entry
      • Note: Make sure that the sequence file is always followed by the sample, sample set, or sample folder that is to be linked. You can do this for multiple data sets, just make sure it is always sequence file followed by sample data, sequence file followed by sample data, etc.

  • Click the Data tab, the Samples tab, and finally the Sample - File Linker tab
  • Verify that you have correctly ordered your SFF/SRA files followed by the appropriate Samples and click the 'Submit' button
  • Wait for the confirmation email

Hello Tutorial IMT,

Your Sample - File Linker job has completed successfully.

JOB SUMMARY:
 JobID          : wbJob-samplefilelinker-1312574026_369118

The following file(s) and samples(s) has been linked:
 tutorial_sequence_file.sff.gz(File) -> S_700033665(Sample)
 tutorial_sequence_file.sff.gz(File) -> S_700035861(Sample)
 tutorial_sequence_file.sff.gz(File) -> S_700095543(Sample)
 tutorial_sequence_file.sff.gz(File) -> S_700095850(Sample)
 tutorial_sequence_file.sff.gz(File) -> S_700101600(Sample)
 tutorial_sequence_file.sff.gz(File) -> T_700016994(Sample)
 tutorial_sequence_file.sff.gz(File) -> T_700095565(Sample)
 tutorial_sequence_file.sff.gz(File) -> T_700095872(Sample)
 tutorial_sequence_file.sff.gz(File) -> T_700101388(Sample)
 tutorial_sequence_file.sff.gz(File) -> T_700101622(Sample)

The Genboree Team

Import Sequences

h4.

  • Drag over the SamplesSet tutorial_sample_set from the Data Selector window into the the Input Data window
    • Note: You can drag over multiple samples, SampleSets, or Sample folders (that have been properly linked) into the Input Data window. This allows users to combine interesting data sets without having to import samples, link samples with files, etc. multiple times.
  • Drag over the gmtDB database from the Data Selector window to the Output Targets window

  • After you have your samples in the Input Data window and your database in the Output Targets window, proceed forward
  • Click the Analysis tab, followed by the Microbiome Workbench tab, followed by the Microbiome Sequence Import tab
  • Select your options for sequence import
    • At this time you can sub-select a set of sequences that you wish to import in the 'Select Samples' window. The default action is to select all samples
    • Set a custom 'Sample Set Name' or leave the default (which includes a time stamp)
    • Optionall choose to Trim At Distal Primer, Trim at N/n, Remove sequences which contain an N, set the minimum read length, set the minimum average quality, and set the minimum sequence count
  • Click 'Submit'
  • Wait for confirmation email

Hello Tutorial Imt

Your Microbiome Sequence Import job is complete successfully.

Job Summary:
  JobID                  : wbJob-seqimport-1312574238_616904
  Analysis Name          : Sequence-Import-2011-08-05-14:56:46

Settings:
  minAvgQuality           : 20
  minSeqCount             : 1000
  minSeqLength            : 200
  blastDistalPrimer       : true
  cutAtEnd                : true
  trimLowQualityRun       : false
  removeNSequences        : false

Result File Location in the Genboree Workbench:
  Group : GMT_Tutorial
  DataBase : gmtDB
  Path to File:
     Files
     * MicrobiomeData
        * Sequence-Import-2011-08-05-14:56:46

The Genboree Team

View Imported Sequences

  • Click the Refresh button in the Data Selector window
  • Expand Groups -> GMT_Tutorial -> Databases -> gmtDB -> Files -> MicrobiomeData -> Sequence-Import-2011-08-05-14:56:46 to see that your sequences have been imported
    • fastq
      • fastq files for each uploaded SFF/SRA file
      • fastq is a file format that represents the combination of the fasta and quality score files
    • sample.metadata
      • Sample meta data file representing all samples used for analysis (appended with sequence import parameters, flags, etc. that are used for the pipeline)
    • settings.json
      • Settings in json format for sequence import pipeline
    • fasta.result.tar.gz
      • fasta file for each uploaded SFF/SRA file
    • filtered_fasta.result.tar.gz
      • Final quality filtered fasta file for each sample
    • stats.result.tar.gz
      • Sequence metrics for each sample
    • jobFile.json
      • See settings.json
    • sequences_metrics_summary.xls
      • Sequence metrics broken down into individual samples, summary for all samples, and each meta data label.

sampleName Average_read_length total_sequence_counts_after_filter body_site
S_700033665 505 7008 Stool
S_700101600 506 6716 Stool
T_700101622 515 4658 Throat
T_700016994 512 6794 Throat
S_700035861 511 6819 Stool
S_700095850 500 5879 Stool
T_700095872 516 2543 Throat
S_700095543 503 6191 Stool
T_700101388 510 7527 Throat
T_700095565 516 6294 Throat
Average Sequence Length Total Sequences
508 60429

RDP - Taxonomic Abundance Pipeline

  • Drag Sequence-Import-2011-08-05-14:56:46 from the Data Selector window to the Input Data window
  • Drag over the gmtDB into the Output Targets window
  • Drag over the gmtProject into the Output Targets window
    • This project is visible if you expand Groups -> GMT_Tutorial -> Projects -> gmtProject
  • Click the Analysis tab, followed by the Microbiome Workbench tab, followed by the RDP tab
  • You can optionally fill in a 'Study Name' to organize your individual runs. We will use 'Tutorial_Study' here.
  • Click 'Submit'
  • Wait for confirmation email

  Hello Tutorial Imt

   Your RDP job is complete successfully.

   Job Summary:

   JobID                  : wbJob-rdp-1312579221_728029
   Study Name             : GMT_Tutorial_Study
   Job Name               : RDP-Job-2011-08-05-16:19:52

   Settings:
   rdpVersion: 2.2
   rdpBootstrapCutoff: 0.8

   Result File Location in the Genboree Workbench:
   Group : GMT_Tutorial
   DataBase : gmtDB
   Path to File:
   Files
   * MicrobiomeWorkBench
     * GMT_Tutorial_Study
       *RDP
         *RDP-Job-2011-08-05-16:19:52

Plots URL (click or paste in browser to access file):
   Prj: gmtProject
   URL:
http://genboree.org/java-bin/project.jsp?projectName=gmtProject

   The Genboree Team

RDP Results

  • Click the Refresh button in the Data Selector window
  • Expand Groups -> GMT_Tutorial -> Databases -> gmtDB -> Files -> MicrobiomeWorkBench -> Tutorial_Study -> RDP -> RDP-Job-2011-08-05-16:19:52
  • Domain/Phyla/Class/Order/Family/Genus/Species.result.tar.gz
    • Individual samples separated into results based on separate taxonomic depth
  • counts.xlsx
    • Raw counts of the appearance of each taxonomic depth (per sample) weighted by the RDP bootstrap classification score (i.e. 85% counts for 0.85 of an occurrence, 100% counts for 1.00 of an occurrence, etc.)
  • normalized.xlsx
    • Normalized counts of the appearance of each taxonomic depth (per sample) that sums to approximately 1.00.
  • Heatmaps of each taxonomic depth are accessible via the Tutorial_Study project page

QIIME Pipeline - OTU Table, Phylogenetic Tree, and Beta Diversity

  • Drag Sequence-Import-2011-08-05-14:56:46 from the Data Selector window to the Input Data window
  • Drag over the gmtDB into the Output Targets window
  • Drag over the gmtProject into the Output Targets window
    • This project is visible if you expand Groups -> GMT_Tutorial -> Projects -> gmtProject
  • Click the Analysis tab, followed by the Microbiome Workbench tab, followed by the QIIME tab
  • You can optionally fill in a 'Study Name' to organize your individual runs. We will use 'Tutorial_Study' here.
  • You can optionally choose to remove chimeras with Chimera Slayer
  • Click 'Submit'
  • Wait for confirmation email

Hello Tutorial Imt

Your QIIME job is completed successfully.

Job Summary:
  JobID                  : wbJob-qiime-1312579361_942462
  Study Name             : GMT_Tutorial_Study
  Job Name               : Qiime-Job-2011-08-05-16:22:08

Result File Location in the Genboree Workbench:
  Group : GMT_Tutorial
  DataBase : gmtDB
  Path to File:
     Files
     * MicrobiomeWorkBench
        * GMT_Tutorial_Study
           *QIIME
              *Qiime-Job-2011-08-05-16:22:08

Plots URL (click or paste in browser to access file):
   Prj: gmtProject
   URL:
http://genboree.org/java-bin/project.jsp?projectName=gmtProject

The Genboree Team

QIIME Results

  • Click the Refresh button in the Data Selector window
  • Expand Groups -> GMT_Tutorial -> Databases -> gmtDB -> Files -> MicrobiomeWorkBench -> Tutorial_Study -> QIIME -> Qiime-Job-2011-08-05-16:22:08
    • mapping.txt
      • QIIME sample meta data mapping file
    • raw.results.tar.gz
      • Full compressed results from the pipeline
    • sample.metadata
    • settings.json
    • plots.result.tar.gz
      • 2D and 3D plots
    • fasta.result.tar.gz
      • Representative sequences aligned files
    • taxonomy.result.tar.gz
      • OTU tables separated by taxonomic depth
    • otu.table
    • phylogenetic.result.tar.gz
      • Representative sequence files: aligned, datafile, tree file, itol tree file, and tree file parsed
    • jobFile.json
  • 2D and 3D plots can be viewed at the project page

Alpha Diversity

  • Drag Qiime-Job-2011-08-05-16:22:08 into the Input Data window
    • Accessible via Groups -> GMT_Tutorial -> Databases -> gmtDB -> Files -> MicrobiomeWorkBench -> Tutorial_Study -> QIIME -> Qiime-Job-2011-08-05-16:22:08
  • Drag over the gmtDB into the Output Targets window
  • Drag over the gmtProject into the Output Targets window
  • Click the Analysis tab, followed by the Microbiome Workbench tab, followed by the Alpha Diversity tab
  • Optionally fill in a 'Study Name', here we'll use 'Tutorial_Study'
  • Select one or many feature lists, which was accessible via the user provided sample meta data
  • Optionally remove singletons
    • Singletons are entries in the OTU tables that only exist once in all samples. These elements can falsely raise diversity and have been known to impact alpha diversity curves.
  • Click 'Submit'
  • Wait for confirmation email

Hello Tutorial Imt

Your Alpha Diversity job is complete successfully.

Job Summary:
  JobID                  : wbJob-alphadiversity-1312812652_756847
  Study Name             : GMT_Tutorial_Study
  Job Name               : AD-Job-2011-08-08-09_09_58

Result File Location in the Genboree Workbench:
  Group : GMT_Tutorial
  DataBase : gmtDB
  Path to File:
     Files
     * MicrobiomeData
        * GMT_Tutorial_Study
           *AlphaDiversity
              *AD-Job-2011-08-08-09:09:58

Plots URL (click or paste in browser to access file):
   Prj: gmtProject
   URL:
http://genboree.org/java-bin/project.jsp?projectName=gmtProject

The Genboree Team

Alpha Diversity Results

  • Click the Refresh button in the Data Selector window
  • Expand Groups -> GMT_Tutorial -> Databases -> gmtDB -> Files -> MicrobiomeWorkBench -> Tutorial_Study -> AlphaDiversity -> AD-Job-2011-07-18-10:48:32
    • rankAbundancePlots.result.tar.gz
      • Rank abundance plots for all meta data features selected
    • renyiProfilePlots.result.tar.gz
      • Renyi profile plots for all meta data features selected
    • sample.mapping.txt
    • settings.json
    • raw.result.tar.gz
      • Full output data set including R scripts used to generate plots
    • richnessPlots.result.tar.gz
      • Richness plots for all meta data features selected
    • jobFile.json

Machine Learning

  • Drag Qiime-Job-2011-08-05-16:22:08 into the Input Data window
    • Accessible via Groups -> Databases -> gmtDB -> Files -> MicrobiomeWorkBench -> Tutorial_Study -> QIIME -> Qiime-Job-2011-08-05-16:22:08
  • Drag over the gmtDB into the Output Targets window
  • Click the Analysis tab, followed by the Microbiome Workbench tab, followed by the Machine Learning tab
  • Optionally fill in a 'Study Name', here we'll use 'Tutorial_Study'
  • Select one or many feature lists, which was accessible via the user provided sample meta data
  • Click 'Submit'
  • Wait for confirmation email

Hello Tutorial Imt

Your Machine Learning job is complete successfully.

Job Summary:
  JobID                  : wbJob-machinelearning-1312812804_274522
  Study Name             : GMT_Tutorial_Study
  Job Name               : ML-Job-2011-08-08-09_13_04

Result File Location in the Genboree Workbench:
  Group : GMT_Tutorial
  DataBase : gmtDB
  Path to File:
     Files
     * MicrobiomeData
        * GMT_Tutorial_Study
           *MachineLearning
              *ML-Job-2011-08-08-09:13:04

Plots URL (click or paste in browser to access file):
   Prj: gmtProject
   URL:
http://genboree.org/java-bin/project.jsp?projectName=gmtProject

The Genboree Team

Machine Learning Results

  • Click the Refresh button in the Data Selector window
  • Expand Groups -> GMT_Tutorial -> Databases -> gmtDB -> Files -> MicrobiomeWorkBench -> Tutorial_Study -> MachineLearning -> ML-Job-2011-08-08-09_13_04
    • jobFile.json
    • sample.mapping.txt
    • settings.json
    • otu_abundance_cutoff_(5/25/100/500).result.tar.gz
      • (5/25/100/500)_bag.txt
        • randomForest classification result
      • (5/25/100/500)_sortedImportance.txt
        • randomForest imMportance sorted by 'MeanDecreaseGini'
    • raw.result.tar.gz
      • Full results from machine learning pipeline
      • Summary reports exist within raw.result -> RF_Boruta -> body_site -> RandomForest -> (5/25/100/500)_sortedImportanceforcombine.gini_trends_3sorted
      • Or you can take advantage of the summary xls sheet which summarizes the OOB error estimate RF_Summary.xls


5 25 100 500
body_site 0.0 0.0 0.0 0.0

Also available in: HTML TXT