Tab Separated Value Formats

GenboreeKB documents can be uploaded or downloaded in a tab separated value format.
There are two variants of the tabbed delimited formats, which mainly differ by how nested property names are represented/communicated.

The tabbed formats have the same information as the JSON formats - it is just represented differently.

Both models and documents can be represented in the tab separated value format.

Format Definition for Data Documents

  • One line per property.
  • Property name column - Column name should be #property.
  • Additional columns for one or more property-definition fields supported by the modeling schema.
  • One or more columns with name value containing data values for each property.
    • Each value column corresponds to a single data document in GenboreeKB.

Format Definition for Data Models

  • One line per property.
  • Property name column - Column name should be #name.
  • Additional columns for each property-definition field supported by the modeling schema.

Nesting Property Path Format

NOTE: Currently, this is the preferred and more comprehensively supported tabbed format.

Data Document - Nesting Property Path

Two Columns

Description

  1. The property column will contain all property paths in the document, using the "nesting-prefix" followed by a space and then the property name.
    • The characters - and * are used to indicate property depth/nesting within the document.
      • A - in the nesting prefix indicates a regular property, which may or may not have sub-property info. It does NOT contain a sub-items list.
      • A * in the nesting prefix indicates that this property can contain a sub-items list.
    • There is one nesting character per depth or nesting-level.
  2. The value column will contain associated values for each property in the document.

Example

You can download the example given below here.

#property value
Run exRNARun0000001
- Experiment Type
-- Directionality non-strand-specific
-- Run Type Single end
- Sequencing Instrument Illumina2000
* Related Documents
*- Related Document exRNASample0000001
*-- Type Sample
*-- DocURL coll/Samples/doc/exRNASample0000001
*- Related Document exRNAStudy0000001
*-- Type Study
*-- DocURL coll/Studies/doc/exRNAStudy0000001
- Experimental Design barcoded_small_RNA_cDNA_PMID_23440203-3' adapter: TGTGTTCGTATGCCGTCTTCTGCTTG
* Raw data files 1
*- Raw data file SRR822433.fastq.gz
*-- MD5 Checksum e2a01d56815f13c0d6c723a211a738e01cd34031
*-- Type FASTQ
- Maximum Read length 51
* Aliases
*- Alias SRR822433
*-- URL http://trace.ddbj.nig.ac.jp/DRASearch/run?acc=SRR822433
*-- dbName DDBJ

More than one document

This representation can provide more than one document, if appropriate.
  • Simply repeat the #property value header line to begin the next document.
    • i.e. the documents appear one after the other, separated by the #property value header.
  • But also consider the multi-column version of this format, as it may be more convenient.

Multi-column

This format can have more than 1 value column, each value column has the contents of a different doc.

  1. The property column will contain all property paths in the document, using the "nesting-prefix" followed by a space and then the property name.
    • The characters - and * are used to indicate property depth/nesting within the document.
      • A - in the nesting prefix indicates a regular property, which may or may not have sub-property info. It does NOT contain a sub-items list.
      • A * in the nesting prefix indicates that this property can contain a sub-items list.
    • There is one nesting character per depth or nesting-level.
  2. There can be multiple value columns - each will correspond to one data document.
  3. We also require a domain column, which will indicate the domain of each property. This column can be copied from the data model.
  4. Your property column will contain all properties that occur in any document in your tab separated file. If some documents contain properties that other documents do not, please adhere to the following rules to make sure that your documents are processed correctly:
    • If a given document (corresponding to a value column) is missing a property and that property's domain allows for a blank value, then you must put #MISSING# as the associated value for that property.
      • Domains that qualify: "string", "[valueless]", "regexp()", "url", "fileUrl", "autoID"
      • It is only necessary to complete this task for parent properties. For example, if I have a parent property named "Biosample" that has a value of #MISSING#, then I don't have to put #MISSING# for "Biosample.Tissue Type" even if that property has a domain that's present in the list above.
    • If a given document (corresponding to a value column) is missing a property and that property does not allow for a blank value, then you can just leave the associated value blank.

Simple

First, we will look at a simple example where two documents contain the exact same properties but have different values for some of those properties. You can download this example here.

#property value value domain
Run EXR-TEST00-RU EXR-TEST01-RU string
Run.Experiment Details string
Run.Experiment Type.Directionality non-strand-specific non-strand-specific enum(Strand-specific, Non-strand-specific)
Run.Experiment Type.Run Type Single end Single end enum(Single-end, Paired-End)
Run.Sequencing Instrument Illumina2000 Illumina2000 bioportalTerm(http://data.bioontology.org/search?ontology=EFO&subtree_root=http%3A%2F%2Fwww.ebi.ac.uk%2Fefo%2FEFO_0000001)
Run.Related Documents [valueless]
Run.Related Documents.Related Document exRNASample0000001 exRNASample0000002 string
Run.Related Documents.Related Document.Type Sample Sample string
Run.Related Documents.Related Document.DocURL coll/Samples/doc/exRNASample0000001 coll/Samples/doc/exRNASample0000002 url
Run.Related Documents.Related Document exRNAStudy0000001 exRNAStudy0000002 string
Run.Related Documents.Related Document.Type Study Study string
Run.Related Documents.Related Document.DocURL coll/Studies/doc/exRNAStudy0000001 coll/Studies/doc/exRNAStudy0000002 url
Run.Experimental Design barcoded_small_RNA_cDNA_PMID_23440203-3' adapter: TGTGTTCGTATGCCGTCTTCTGCTTG barcoded_small_RNA_cDNA_PMID_23440203-3' adapter: TGTGTTCGTATGCCGTCTTCTGCTTG string
Run.Raw data files 1 1 posInt
Run.Raw data files.Raw data file SRR822433.fastq.gz SRR822434.fastq.gz string
Run.Raw data files.Raw data file.MD5 Checksum e2a01d56815f13c0d6c723a211a738e01cd34031 a1a23a56815f13c0e6d723a211a738e01fd35022 string
Run.Raw data files.Raw data file.Type FASTQ FASTQ string
Run.Maximum Read length 51 54 posInt
Run.Aliases [valueless]
Run.Aliases.Alias SRR822433 SRR822434 string
Run.Aliases.Alias.URL http://trace.ddbj.nig.ac.jp/DRASearch/run?acc=SRR822433 http://trace.ddbj.nig.ac.jp/DRASearch/run?acc=SRR822434 url
Run.Aliases.Alias.dbName DDBJ DDBJ enum(SRA, GEO, DDBJ, ENCODE, dbGaP)

Since these documents are very similar, compiling them into one multi-column document is very efficient.

Complex

Now, imagine that one document, EXR-TEST01-RU, is missing the "Run.Aliases" property. You can download this example here.

#property value value domain
Run EXR-TEST00-RU EXR-TEST01-RU string
Run.Experiment Details string
Run.Experiment Type.Directionality non-strand-specific non-strand-specific enum(Strand-specific, Non-strand-specific)
Run.Experiment Type.Run Type Single end Single end enum(Single-end, Paired-End)
Run.Sequencing Instrument Illumina2000 Illumina2000 bioportalTerm(http://data.bioontology.org/search?ontology=EFO&subtree_root=http%3A%2F%2Fwww.ebi.ac.uk%2Fefo%2FEFO_0000001)
Run.Related Documents [valueless]
Run.Related Documents.Related Document exRNASample0000001 exRNASample0000002 string
Run.Related Documents.Related Document.Type Sample Sample string
Run.Related Documents.Related Document.DocURL coll/Samples/doc/exRNASample0000001 coll/Samples/doc/exRNASample0000002 url
Run.Related Documents.Related Document exRNAStudy0000001 exRNAStudy0000002 string
Run.Related Documents.Related Document.Type Study Study string
Run.Related Documents.Related Document.DocURL coll/Studies/doc/exRNAStudy0000001 coll/Studies/doc/exRNAStudy0000002 url
Run.Experimental Design barcoded_small_RNA_cDNA_PMID_23440203-3' adapter: TGTGTTCGTATGCCGTCTTCTGCTTG barcoded_small_RNA_cDNA_PMID_23440203-3' adapter: TGTGTTCGTATGCCGTCTTCTGCTTG string
Run.Raw data files 1 1 posInt
Run.Raw data files.Raw data file SRR822433.fastq.gz SRR822434.fastq.gz string
Run.Raw data files.Raw data file.MD5 Checksum e2a01d56815f13c0d6c723a211a738e01cd34031 a1a23a56815f13c0e6d723a211a738e01fd35022 string
Run.Raw data files.Raw data file.Type FASTQ FASTQ string
Run.Maximum Read length 51 54 posInt
Run.Aliases #MISSING [valueless]
Run.Aliases.Alias SRR822433 string
Run.Aliases.Alias.URL http://trace.ddbj.nig.ac.jp/DRASearch/run?acc=SRR822433 url
Run.Aliases.Alias.dbName DDBJ enum(SRA, GEO, DDBJ, ENCODE, dbGaP)

We can see that we put #MISSING# in for the value of "Run.Aliases" in our second value column. Furthermore, note that we did not put #MISSING# for the value of "Run.Aliases.Alias" or "Run.Aliases.Alias.URL" for the document, even though those properties are domain "string". This is because the parent property, "Run.Aliases", has already been declared as #MISSING#.


Data Model - Nesting Property Path

Multi-column

  1. The property column will contain all property paths in the model, using the "nesting-prefix" followed by a space and then the property name.
    • The characters - and * are used to indicate property depth/nesting within the model.
      • A - in the nesting prefix indicates a regular property, which may or may not have sub-property info. It does NOT contain a sub-items list.
      • A * in the nesting prefix indicates that this property can contain a sub-items list.
    • There is one nesting character per depth or nesting-level.
  2. In addition, the model will contain columns that define specific attributes for each property (domain, required, description, category, etc.).

You can download the example given below here.

#name domain default identifier required unique units category fixed index description
Run string true true Document describing information about the sequencing run including raw data files
- Experimental Design string true Description of experimental design
- Sequencing Instrument enum(Illumina2500, Illumina2000, IonTorrent, IonProton, MiSeq, Solid, 454, other) true Name, model of the sequencing instrument
-- Other string If 'other', can provide name of instrument here.
- Experiment Type string true true Category -- Experiment Type
-- Directionality enum(Strand-specific, non-strand-specific) true Strand specificity of the run
-- Run Type enum(Single end, Paired End) true Type of run, single end or paired end
- Maximum Read length int true Length of reads in base pairs or nt
* Raw data files int 0 true Raw data files - Items
*- Raw data file string true Name of file
*-- Type enum(FASTA, FASTQ, SFF) true File type -- FASTA, FASTQ, SFF
*-- MD5 Checksum string true MD5 checksum value of file
* Related Documents string true true Category -- Related documents -- Run is related to Sample, Study and Experiment
*- Related Document string Name or ID of related document
*-- Type string Type of related document
*-- DocURL url Relative ID of doc, provide Document URL
* Aliases string true true Aliases - Items
*- Alias string true Alias of this run in other databases, say SRA, GEO
*-- dbName string Database name of alias -- SRA, GEO
*-- URL url URL that points to this run in alias db


Full Property Path Format

NOTE: This format currently is not fully supported. Please consider the Nesting Property Path Format instead.

Data Document - Full Property Path Format

Two Columns

  1. The property column will contain all dot-delimited property paths in the document.
  2. The value column will contain associated values for each property in the document.

You can download the example given below here.

#property value
Run exRNARun0000001
Run.Experiment Type
Run.Experiment Type.Directionality non-strand-specific
Run.Experiment Type.Run Type Single end
Run.Sequencing Instrument Illumina2000
Run.Related Documents
Run.Related Documents.Related Document exRNASample0000001
Run.Related Documents.Related Document.Type Sample
Run.Related Documents.Related Document.DocURL coll/Samples/doc/exRNASample0000001
Run.Related Documents.Related Document exRNAStudy0000001
Run.Related Documents.Related Document.Type Study
Run.Related Documents.Related Document.DocURL coll/Studies/doc/exRNAStudy0000001
Run.Experimental Design barcoded_small_RNA_cDNA_PMID_23440203-3' adapter: TGTGTTCGTATGCCGTCTTCTGCTTG
Run.Raw data files 1
Run.Raw data files.Raw data file SRR822433.fastq.gz
Run.Raw data files.Raw data file.MD5 Checksum e2a01d56815f13c0d6c723a211a738e01cd34031
Run.Raw data files.Raw data file.Type FASTQ
Run.Maximum Read length 51
Run.Aliases
Run.Aliases.Alias SRR822433
Run.Aliases.Alias.URL http://trace.ddbj.nig.ac.jp/DRASearch/run?acc=SRR822433
Run.Aliases.Alias.dbName DDBJ

More than one document

This representation can provide more than one document, if appropriate.
  • Simply repeat the #property value header line to begin the next document.
    • i.e. the documents appear one after the other, separated by the #property value header.

Multi-column

Multi-column support is not yet incorporated for the full property path format. Coming soon!


Data Model - Full Property Path Format

Multi-column

  1. The property column will contain all dot-delimited property paths in the model.
  2. In addition, the model will contain columns that define specific attributes for each property (domain, required, description, category, etc.).
  3. Note the special isItemList column, which indicates whether the property contains a list of sub-items rather than some sub-properties.
    • This column is not needed in the format=json nor in the format=tabbed_prop_nesting formats.

You can download the example given below here.

#name domain default identifier required unique units category fixed index description isItemList
Run string true true Document describing information about the sequencing run including raw data files false
Run.Experimental Design string true Description of experimental design false
Run.Sequencing Instrument enum(Illumina2500, Illumina2000, IonTorrent, IonProton, MiSeq, Solid, 454, other) true Name, model of the sequencing instrument false
Run.Sequencing Instrument.Other string If 'other', can provide name of instrument here. false
Run.Experiment Type string true true Category -- Experiment Type false
Run.Experiment Type.Directionality enum(Strand-specific, non-strand-specific) true Strand specificity of the run false
Run.Experiment Type.Run Type enum(Single end, Paired End) true Type of run, single end or paired end false
Run.Maximum Read length int true Length of reads in base pairs or nt false
Run.Raw data files int 0 true Raw data files - Items true
Run.Raw data files.Raw data file string true Name of file false
Run.Raw data files.Raw data file.Type enum(FASTA, FASTQ, SFF) true File type -- FASTA, FASTQ, SFF false
Run.Raw data files.Raw data file.MD5 Checksum string true MD5 checksum value of file false
Run.Related Documents string true true Category -- Related documents -- Run is related to Sample, Study and Experiment true
Run.Related Documents.Related Document string Name or ID of related document false
Run.Related Documents.Related Document.Type string Type of related document false
Run.Related Documents.Related Document.DocURL url Relative ID of doc, provide Document URL false
Run.Aliases string true true Aliases - Items true
Run.Aliases.Alias string true Alias of this run in other databases, say SRA, GEO false
Run.Aliases.Alias.dbName string Database name of alias -- SRA, GEO false
Run.Aliases.Alias.URL url URL that points to this run in alias db false