Tab Separated Value Formats¶
GenboreeKB documents can be uploaded or downloaded in a tab separated value format.
There are two variants of the tabbed delimited formats, which mainly differ by how nested property names are represented/communicated.
The tabbed formats have the same information as the JSON formats - it is just represented differently.
Both models and documents can be represented in the tab separated value format.
Format Definition for Data Documents¶
- One line per property.
- Property name column - Column name should be
#property
. - Additional columns for one or more property-definition fields supported by the modeling schema.
- One or more columns with name
value
containing data values for each property.- Each
value
column corresponds to a single data document in GenboreeKB.
- Each
Format Definition for Data Models¶
- One line per property.
- Property name column - Column name should be
#name
. - Additional columns for each property-definition field supported by the modeling schema.
Nesting Property Path Format¶
NOTE: Currently, this is the preferred and more comprehensively supported tabbed format.
Data Document - Nesting Property Path¶
Two Columns¶
Description¶
- The
property
column will contain all property paths in the document, using the "nesting-prefix" followed by a space and then the property name.- The characters
-
and*
are used to indicate property depth/nesting within the document.- A
-
in the nesting prefix indicates a regular property, which may or may not have sub-property info. It does NOT contain a sub-items list. - A
*
in the nesting prefix indicates that this property can contain a sub-items list.
- A
- There is one nesting character per depth or nesting-level.
- The characters
- The
value
column will contain associated values for each property in the document.
Example¶
You can download the example given below here.
#property | value |
---|---|
Run | exRNARun0000001 |
- Experiment Type | |
-- Directionality | non-strand-specific |
-- Run Type | Single end |
- Sequencing Instrument | Illumina2000 |
* Related Documents | |
*- Related Document | exRNASample0000001 |
*-- Type | Sample |
*-- DocURL | coll/Samples/doc/exRNASample0000001 |
*- Related Document | exRNAStudy0000001 |
*-- Type | Study |
*-- DocURL | coll/Studies/doc/exRNAStudy0000001 |
- Experimental Design | barcoded_small_RNA_cDNA_PMID_23440203-3' adapter: TGTGTTCGTATGCCGTCTTCTGCTTG |
* Raw data files | 1 |
*- Raw data file | SRR822433.fastq.gz |
*-- MD5 Checksum | e2a01d56815f13c0d6c723a211a738e01cd34031 |
*-- Type | FASTQ |
- Maximum Read length | 51 |
* Aliases | |
*- Alias | SRR822433 |
*-- URL | http://trace.ddbj.nig.ac.jp/DRASearch/run?acc=SRR822433 |
*-- dbName | DDBJ |
More than one document¶
This representation can provide more than one document, if appropriate.- Simply repeat the
#property value
header line to begin the next document.- i.e. the documents appear one after the other, separated by the
#property value
header.
- i.e. the documents appear one after the other, separated by the
- But also consider the multi-column version of this format, as it may be more convenient.
Multi-column¶
This format can have more than 1 value
column, each value column has the contents of a different doc.
- The
property
column will contain all property paths in the document, using the "nesting-prefix" followed by a space and then the property name.- The characters
-
and*
are used to indicate property depth/nesting within the document.- A
-
in the nesting prefix indicates a regular property, which may or may not have sub-property info. It does NOT contain a sub-items list. - A
*
in the nesting prefix indicates that this property can contain a sub-items list.
- A
- There is one nesting character per depth or nesting-level.
- The characters
- There can be multiple
value
columns - each will correspond to one data document. - We also require a
domain
column, which will indicate the domain of each property. This column can be copied from the data model. - Your
property
column will contain all properties that occur in any document in your tab separated file. If some documents contain properties that other documents do not, please adhere to the following rules to make sure that your documents are processed correctly:- If a given document (corresponding to a
value
column) is missing a property and that property's domain allows for a blank value, then you must put #MISSING# as the associatedvalue
for that property.- Domains that qualify: "string", "[valueless]", "regexp()", "url", "fileUrl", "autoID"
- It is only necessary to complete this task for parent properties. For example, if I have a parent property named "Biosample" that has a value of #MISSING#, then I don't have to put #MISSING# for "Biosample.Tissue Type" even if that property has a domain that's present in the list above.
- If a given document (corresponding to a
value
column) is missing a property and that property does not allow for a blank value, then you can just leave the associatedvalue
blank.
- If a given document (corresponding to a
Simple¶
First, we will look at a simple example where two documents contain the exact same properties but have different values for some of those properties. You can download this example here.
#property | value | value | domain |
---|---|---|---|
Run | EXR-TEST00-RU | EXR-TEST01-RU | string |
Run.Experiment Details | string | ||
Run.Experiment Type.Directionality | non-strand-specific | non-strand-specific | enum(Strand-specific, Non-strand-specific) |
Run.Experiment Type.Run Type | Single end | Single end | enum(Single-end, Paired-End) |
Run.Sequencing Instrument | Illumina2000 | Illumina2000 | bioportalTerm(http://data.bioontology.org/search?ontology=EFO&subtree_root=http%3A%2F%2Fwww.ebi.ac.uk%2Fefo%2FEFO_0000001) |
Run.Related Documents | [valueless] | ||
Run.Related Documents.Related Document | exRNASample0000001 | exRNASample0000002 | string |
Run.Related Documents.Related Document.Type | Sample | Sample | string |
Run.Related Documents.Related Document.DocURL | coll/Samples/doc/exRNASample0000001 | coll/Samples/doc/exRNASample0000002 | url |
Run.Related Documents.Related Document | exRNAStudy0000001 | exRNAStudy0000002 | string |
Run.Related Documents.Related Document.Type | Study | Study | string |
Run.Related Documents.Related Document.DocURL | coll/Studies/doc/exRNAStudy0000001 | coll/Studies/doc/exRNAStudy0000002 | url |
Run.Experimental Design | barcoded_small_RNA_cDNA_PMID_23440203-3' adapter: TGTGTTCGTATGCCGTCTTCTGCTTG | barcoded_small_RNA_cDNA_PMID_23440203-3' adapter: TGTGTTCGTATGCCGTCTTCTGCTTG | string |
Run.Raw data files | 1 | 1 | posInt |
Run.Raw data files.Raw data file | SRR822433.fastq.gz | SRR822434.fastq.gz | string |
Run.Raw data files.Raw data file.MD5 Checksum | e2a01d56815f13c0d6c723a211a738e01cd34031 | a1a23a56815f13c0e6d723a211a738e01fd35022 | string |
Run.Raw data files.Raw data file.Type | FASTQ | FASTQ | string |
Run.Maximum Read length | 51 | 54 | posInt |
Run.Aliases | [valueless] | ||
Run.Aliases.Alias | SRR822433 | SRR822434 | string |
Run.Aliases.Alias.URL | http://trace.ddbj.nig.ac.jp/DRASearch/run?acc=SRR822433 | http://trace.ddbj.nig.ac.jp/DRASearch/run?acc=SRR822434 | url |
Run.Aliases.Alias.dbName | DDBJ | DDBJ | enum(SRA, GEO, DDBJ, ENCODE, dbGaP) |
Since these documents are very similar, compiling them into one multi-column document is very efficient.
Complex¶
Now, imagine that one document, EXR-TEST01-RU, is missing the "Run.Aliases" property. You can download this example here.
#property | value | value | domain |
---|---|---|---|
Run | EXR-TEST00-RU | EXR-TEST01-RU | string |
Run.Experiment Details | string | ||
Run.Experiment Type.Directionality | non-strand-specific | non-strand-specific | enum(Strand-specific, Non-strand-specific) |
Run.Experiment Type.Run Type | Single end | Single end | enum(Single-end, Paired-End) |
Run.Sequencing Instrument | Illumina2000 | Illumina2000 | bioportalTerm(http://data.bioontology.org/search?ontology=EFO&subtree_root=http%3A%2F%2Fwww.ebi.ac.uk%2Fefo%2FEFO_0000001) |
Run.Related Documents | [valueless] | ||
Run.Related Documents.Related Document | exRNASample0000001 | exRNASample0000002 | string |
Run.Related Documents.Related Document.Type | Sample | Sample | string |
Run.Related Documents.Related Document.DocURL | coll/Samples/doc/exRNASample0000001 | coll/Samples/doc/exRNASample0000002 | url |
Run.Related Documents.Related Document | exRNAStudy0000001 | exRNAStudy0000002 | string |
Run.Related Documents.Related Document.Type | Study | Study | string |
Run.Related Documents.Related Document.DocURL | coll/Studies/doc/exRNAStudy0000001 | coll/Studies/doc/exRNAStudy0000002 | url |
Run.Experimental Design | barcoded_small_RNA_cDNA_PMID_23440203-3' adapter: TGTGTTCGTATGCCGTCTTCTGCTTG | barcoded_small_RNA_cDNA_PMID_23440203-3' adapter: TGTGTTCGTATGCCGTCTTCTGCTTG | string |
Run.Raw data files | 1 | 1 | posInt |
Run.Raw data files.Raw data file | SRR822433.fastq.gz | SRR822434.fastq.gz | string |
Run.Raw data files.Raw data file.MD5 Checksum | e2a01d56815f13c0d6c723a211a738e01cd34031 | a1a23a56815f13c0e6d723a211a738e01fd35022 | string |
Run.Raw data files.Raw data file.Type | FASTQ | FASTQ | string |
Run.Maximum Read length | 51 | 54 | posInt |
Run.Aliases | #MISSING | [valueless] | |
Run.Aliases.Alias | SRR822433 | string | |
Run.Aliases.Alias.URL | http://trace.ddbj.nig.ac.jp/DRASearch/run?acc=SRR822433 | url | |
Run.Aliases.Alias.dbName | DDBJ | enum(SRA, GEO, DDBJ, ENCODE, dbGaP) |
We can see that we put #MISSING# in for the value of "Run.Aliases" in our second value
column. Furthermore, note that we did not put #MISSING# for the value of "Run.Aliases.Alias" or "Run.Aliases.Alias.URL" for the document, even though those properties are domain "string". This is because the parent property, "Run.Aliases", has already been declared as #MISSING#.
Data Model - Nesting Property Path¶
Multi-column¶
- The
property
column will contain all property paths in the model, using the "nesting-prefix" followed by a space and then the property name.- The characters
-
and*
are used to indicate property depth/nesting within the model.- A
-
in the nesting prefix indicates a regular property, which may or may not have sub-property info. It does NOT contain a sub-items list. - A
*
in the nesting prefix indicates that this property can contain a sub-items list.
- A
- There is one nesting character per depth or nesting-level.
- The characters
- In addition, the model will contain columns that define specific attributes for each property (
domain, required, description, category
, etc.).
You can download the example given below here.
#name | domain | default | identifier | required | unique | units | category | fixed | index | description |
---|---|---|---|---|---|---|---|---|---|---|
Run | string | true | true | Document describing information about the sequencing run including raw data files | ||||||
- Experimental Design | string | true | Description of experimental design | |||||||
- Sequencing Instrument | enum(Illumina2500, Illumina2000, IonTorrent, IonProton, MiSeq, Solid, 454, other) | true | Name, model of the sequencing instrument | |||||||
-- Other | string | If 'other', can provide name of instrument here. | ||||||||
- Experiment Type | string | true | true | Category -- Experiment Type | ||||||
-- Directionality | enum(Strand-specific, non-strand-specific) | true | Strand specificity of the run | |||||||
-- Run Type | enum(Single end, Paired End) | true | Type of run, single end or paired end | |||||||
- Maximum Read length | int | true | Length of reads in base pairs or nt | |||||||
* Raw data files | int | 0 | true | Raw data files - Items | ||||||
*- Raw data file | string | true | Name of file | |||||||
*-- Type | enum(FASTA, FASTQ, SFF) | true | File type -- FASTA, FASTQ, SFF | |||||||
*-- MD5 Checksum | string | true | MD5 checksum value of file | |||||||
* Related Documents | string | true | true | Category -- Related documents -- Run is related to Sample, Study and Experiment | ||||||
*- Related Document | string | Name or ID of related document | ||||||||
*-- Type | string | Type of related document | ||||||||
*-- DocURL | url | Relative ID of doc, provide Document URL | ||||||||
* Aliases | string | true | true | Aliases - Items | ||||||
*- Alias | string | true | Alias of this run in other databases, say SRA, GEO | |||||||
*-- dbName | string | Database name of alias -- SRA, GEO | ||||||||
*-- URL | url | URL that points to this run in alias db |
Full Property Path Format¶
NOTE: This format currently is not fully supported. Please consider the Nesting Property Path Format instead.
Data Document - Full Property Path Format¶
Two Columns¶
- The
property
column will contain all dot-delimited property paths in the document. - The
value
column will contain associated values for each property in the document.
You can download the example given below here.
#property | value |
---|---|
Run | exRNARun0000001 |
Run.Experiment Type | |
Run.Experiment Type.Directionality | non-strand-specific |
Run.Experiment Type.Run Type | Single end |
Run.Sequencing Instrument | Illumina2000 |
Run.Related Documents | |
Run.Related Documents.Related Document | exRNASample0000001 |
Run.Related Documents.Related Document.Type | Sample |
Run.Related Documents.Related Document.DocURL | coll/Samples/doc/exRNASample0000001 |
Run.Related Documents.Related Document | exRNAStudy0000001 |
Run.Related Documents.Related Document.Type | Study |
Run.Related Documents.Related Document.DocURL | coll/Studies/doc/exRNAStudy0000001 |
Run.Experimental Design | barcoded_small_RNA_cDNA_PMID_23440203-3' adapter: TGTGTTCGTATGCCGTCTTCTGCTTG |
Run.Raw data files | 1 |
Run.Raw data files.Raw data file | SRR822433.fastq.gz |
Run.Raw data files.Raw data file.MD5 Checksum | e2a01d56815f13c0d6c723a211a738e01cd34031 |
Run.Raw data files.Raw data file.Type | FASTQ |
Run.Maximum Read length | 51 |
Run.Aliases | |
Run.Aliases.Alias | SRR822433 |
Run.Aliases.Alias.URL | http://trace.ddbj.nig.ac.jp/DRASearch/run?acc=SRR822433 |
Run.Aliases.Alias.dbName | DDBJ |
More than one document¶
This representation can provide more than one document, if appropriate.- Simply repeat the
#property value
header line to begin the next document.- i.e. the documents appear one after the other, separated by the
#property value
header.
- i.e. the documents appear one after the other, separated by the
Multi-column¶
Multi-column support is not yet incorporated for the full property path format. Coming soon!
Data Model - Full Property Path Format¶
Multi-column¶
- The
property
column will contain all dot-delimited property paths in the model. - In addition, the model will contain columns that define specific attributes for each property (
domain, required, description, category
, etc.). - Note the special
isItemList
column, which indicates whether the property contains a list of sub-items rather than some sub-properties.- This column is not needed in the
format=json
nor in theformat=tabbed_prop_nesting
formats.
- This column is not needed in the
You can download the example given below here.
#name | domain | default | identifier | required | unique | units | category | fixed | index | description | isItemList |
---|---|---|---|---|---|---|---|---|---|---|---|
Run | string | true | true | Document describing information about the sequencing run including raw data files | false | ||||||
Run.Experimental Design | string | true | Description of experimental design | false | |||||||
Run.Sequencing Instrument | enum(Illumina2500, Illumina2000, IonTorrent, IonProton, MiSeq, Solid, 454, other) | true | Name, model of the sequencing instrument | false | |||||||
Run.Sequencing Instrument.Other | string | If 'other', can provide name of instrument here. | false | ||||||||
Run.Experiment Type | string | true | true | Category -- Experiment Type | false | ||||||
Run.Experiment Type.Directionality | enum(Strand-specific, non-strand-specific) | true | Strand specificity of the run | false | |||||||
Run.Experiment Type.Run Type | enum(Single end, Paired End) | true | Type of run, single end or paired end | false | |||||||
Run.Maximum Read length | int | true | Length of reads in base pairs or nt | false | |||||||
Run.Raw data files | int | 0 | true | Raw data files - Items | true | ||||||
Run.Raw data files.Raw data file | string | true | Name of file | false | |||||||
Run.Raw data files.Raw data file.Type | enum(FASTA, FASTQ, SFF) | true | File type -- FASTA, FASTQ, SFF | false | |||||||
Run.Raw data files.Raw data file.MD5 Checksum | string | true | MD5 checksum value of file | false | |||||||
Run.Related Documents | string | true | true | Category -- Related documents -- Run is related to Sample, Study and Experiment | true | ||||||
Run.Related Documents.Related Document | string | Name or ID of related document | false | ||||||||
Run.Related Documents.Related Document.Type | string | Type of related document | false | ||||||||
Run.Related Documents.Related Document.DocURL | url | Relative ID of doc, provide Document URL | false | ||||||||
Run.Aliases | string | true | true | Aliases - Items | true | ||||||
Run.Aliases.Alias | string | true | Alias of this run in other databases, say SRA, GEO | false | |||||||
Run.Aliases.Alias.dbName | string | Database name of alias -- SRA, GEO | false | ||||||||
Run.Aliases.Alias.URL | url | URL that points to this run in alias db | false |