Prepare your Manifest File¶

Prepare your Manifest File
Step 1. Download Template Manifest File
Step 2. Open Your Manifest File
Step 3. Compute the MD5 Checksum of your Data Archive
Step 4. Fill Out the Top Section of Your Manifest
Step 5. Fill Out the Sample-Specific Section of Your Manifest
Step 6. Fill Out the Settings Section of Your Manifest
Step 7. Validate and Save Your Manifest File
Summary

After you have finished preparing your data archive and metadata archive, you have to complete the third and final part of your submission: the manifest file.
The manifest file is the "glue" that links together all of your metadata and data. It also provides some important, additional information required to process your submission.

Your manifest file name will have the same prefix as your other files (data archive, metadata file) and will end in ".manifest.json".
For example, if my data archive was named "samples_data.zip", then my manifest file would be named "samples.manifest.json".
As you work on your manifest file, make sure that you save regularly so you don't lose your progress!

Step 1. Download Template Manifest File¶

First, you will want to download a template of the manifest file.
You can find that template here.
You will complete your manifest file by filling in values between the quotation marks for each property.

Below, you can see what the template looks like:

 1 {
 2   "studyName": "",
 3   "userLogin": "",
 4   "md5CheckSum": "",
 5   "runMetadataFileName": "",
 6   "submissionMetadataFileName": "",
 7   "studyMetadataFileName": "",
 8   "experimentMetadataFileName": "",
 9   "biosampleMetadataFileName": "",
10   "donorMetadataFileName": "",
11   "manifest": 
12   [
13     {
14       "dataFileName": "",
15       "sampleName": "" 
16     }
17   ],
18   "settings":
19   {
20     "adapterSequence": "",
21     "analysisName": "" 
22   }
23 }

Step 2. Open Your Manifest File¶

Next, you will need to open your manifest file in your favorite text editor.
You can find some recommendations below:

In Windows: Notepad++ or Wordpad (with "word wrap" turned off)
In Linux/Unix: gedit
In Mac OSX: "TextEdit" program
Command Line: You can also always use the terminal to edit files (vim, nano, etc.).

Step 3. Compute the MD5 Checksum of your Data Archive¶

You already know most of the information for your manifest file, but you'll need to compute the MD5 checksum of your data archive before you proceed.
Every file has an MD5 checksum associated with it. This checksum is based on the exact contents of the file, so two different files will basically never have the same MD5 checksum.
The data archive is normally a large file (sometimes many gigabytes). When you transfer the data archive over to our FTP server, it is possible that the transfer will fail for some reason.
That failure could occur due to a connection failure, a computer malfunction, or many other reasons.
By computing the MD5 checksum of your version of the data archive and then providing that checksum to us, you give us a way of checking that the file transfer completed successfully.
When processing your files, we compute our own MD5 checksum of your data archive and compare it to the checksum that you gave us.
If the checksums don't match, that means that the entire file did not transfer properly to us (or that you supplied the wrong checksum).

To compute the MD5 checksum on Linux/Unix for a given file, open up a terminal and type "md5sum [fileName]",
where [fileName] is a path to your file. The md5sum will be displayed in the terminal, and you can just copy / paste it into the appropriate field.
For OS X: in the terminal "md5 [fileName]"
For Windows: Windows Command Processor (cmd): "certutil -hashfile [fileName] MD5"

cd /home/myHomeDir/myDataDir
md5sum samples_data.tar.gz

If you're using Windows or are uncomfortable with using the terminal, there are a number of different stand-alone programs that will help you
compute the MD5 checksum for a given file. You can see some examples here.
IMPORTANT NOTE: If you edit any files in your data archive, you will have to recompute your MD5 checksum
before submitting your files for processing (because the contents of the data archive have changed).

Step 4. Fill Out the Top Section of Your Manifest¶

The top section of your manifest contains information that applies to all samples in your submission.
Below, we'll go through each property and tell you how to fill them all out.

studyName: This is the name of your study. Name your study something which captures the overall "feel" of the submission.
- EXAMPLE: Since I want to compare CSF versus serum samples for Parkinson's patients, I wrote "CSF vs. Serum Parkinson's June 2017".
userLogin: This is your Genboree user login.
- EXAMPLE: I wrote "william_thistle" because that's the name I use to log in to Genboree.
md5CheckSum: This is the MD5 checksum of the data archive (not the metadata archive and not the manifest file). We give directions above on how to compute the MD5 checksum.
- EXAMPLE: I wrotee "b9355772f35516837a06666f7c56afdd" because I got that value when I computed the MD5 checksum of my data archive.
runMetadataFileName: This is the file name of your Runs metadata file.
- EXAMPLE: I wrote "testRun.metadata.tsv" because that's the name of my Runs metadata file.
submissionMetadataFileName: This is the file name of your Submissions metadata file.
- EXAMPLE: I wrote "testSubmissions.metadata.tsv" because that's the name of my Submissions metadata file.
studyMetadataFileName: This is the file name of your Studies metadata file.
- EXAMPLE: I wrote "testStudies.metadata.tsv" because that's the name of my Studies metadata file.
experimentMetadataFileName: This is the file name of your Experiments metadata file.
- EXAMPLE: I wrote "testExperiments.metadata.tsv" because that's the name of my Experiments metadata file.
donorMetadataFileName: This is the file name of your Donors metadata file.
- EXAMPLE: I wrote "testDonors.metadata.tsv" because that's the name of my Donors metadata file.
biosampleMetadataFileName: This is the file name of your Biosamples metadata file.
- EXAMPLE: I wrote "testBiosamples.metadata.tsv" because that's the name of my Biosamples metadata file.

Important Please make sure the file name includes the extension (.tsv) as well

So far, our template should look something like this:

 1 {
 2   "studyName": "CSF vs. Serum Parkinson's June 2017",
 3   "userLogin": "william_thistle",
 4   "md5CheckSum": "b9355772f35516837a06666f7c56afdd",
 5   "runMetadataFileName": "testRun.metadata.tsv",
 6   "submissionMetadataFileName": "testSubmissions.metadata.tsv",
 7   "studyMetadataFileName": "testStudies.metadata.tsv",
 8   "experimentMetadataFileName": "testExperiments.metadata.tsv",
 9   "biosampleMetadataFileName": "testBiosamples.metadata.tsv",
10   "donorMetadataFileName": "testDonors.metadata.tsv",
11   "manifest": 
12   [
13     {
14       "dataFileName": "",
15       "sampleName": "" 
16     }
17   ],
18   "settings":
19   { 
20     "adapterSequence": "",
21     "analysisName": "" 
22   }
23 }

Step 5. Fill Out the Sample-Specific Section of Your Manifest¶

Next, we'll tackle the part of the manifest file that deals with your individual samples.
For each sample, you will need to fill out a dataFileName and sampleFileName.
Currently, the template only has space to fill out information about one sample.
To add more samples, all you need to do is copy-paste the existing set of dataFileName and sampleFileName properties.
For example, this is what the (relevant part of the) template currently looks like:

 1 {
 2   "manifest": 
 3   [
 4     {
 5       "dataFileName": "",
 6       "sampleName": "" 
 7     }
 8   ],
 9 }

If I had five samples, It would look like this:

 1 {
 2   "manifest": 
 3   [
 4     {
 5       "dataFileName": "",
 6       "sampleName": "" 
 7     },
 8     {
 9       "dataFileName": "",
10       "sampleName": "" 
11     },
12     {
13       "dataFileName": "",
14       "sampleName": "" 
15     },
16     {
17       "dataFileName": "",
18       "sampleName": "" 
19     },
20     {
21       "dataFileName": "",
22       "sampleName": "" 
23     }
24   ],
25 }

IMPORTANT NOTE: I added a comma between each pair of dataFileName / sampleName properties. This is required (or else your file will not be valid JSON).

Next, we'll go over how to fill out the dataFileName and sampleName for each sample.
It might be easiest to first see how this section will look when properly filled out:

 1 {
 2   "manifest": 
 3   [
 4     {
 5       "dataFileName": "test1.fastq.gz",
 6       "sampleName": "Test 1" 
 7     },
 8     {
 9       "dataFileName": "test2.fastq.gz",
10       "sampleName": "Test 2" 
11     },
12     {
13       "dataFileName": "test3.fastq.gz",
14       "sampleName": "Test 3" 
15     },
16     {
17       "dataFileName": "test4.fastq.gz",
18       "sampleName": "Test 4" 
19     },
20     {
21       "dataFileName": "test5.fastq.gz",
22       "sampleName": "Test 5" 
23     }
24   ],
25 }

The dataFileName property refers to a given sample's data file name in the data archive.

In the above example, I have 5 data files in my data archive, and their names are "test1.fastq.gz", "test2.fastq.gz", etc.
- Make sure that you provide the name of the data files directly placed into the data archive (and not their uncompressed names).
- For example, one of my data files is named "test1.fastq.gz". This file is an archive that contains an uncompressed FASTQ file (test1.fastq).
  I want to write "test1.fastq.gz" and NOT "test1.fastq" for my dataFileName.

Next, we'll explain the sampleName property.

This property connects biosample metadata with biosample data.
Each data file you provided in your data archive has an accompanying column of metadata in the Biosamples metadata file.
For example, take the data file "test1.fastq.gz" referenced above. This data file has an accompanying column of metadata in the Biosamples metadata file,
and in that column of metadata, the "- Name" property has a value of "Test 1". Thus, we would write "Test 1" for the "sampleName".
You will need to link each data file to its biosample metadata column in this fashion (five times in total, for the above manifest).

Now, our manifest file looks like the following:

 1 {
 2   "studyName": "CSF vs. Serum Parkinson's June 2017",
 3   "userLogin": "william_thistle",
 4   "md5CheckSum": "b9355772f35516837a06666f7c56afdd",
 5   "runMetadataFileName": "testRun.metadata.tsv",
 6   "submissionMetadataFileName": "testSubmissions.metadata.tsv",
 7   "studyMetadataFileName": "testStudies.metadata.tsv",
 8   "experimentMetadataFileName": "testExperiments.metadata.tsv",
 9   "biosampleMetadataFileName": "testBiosamples.metadata.tsv",
10   "donorMetadataFileName": "testDonors.metadata.tsv",
11   "manifest": 
12   [
13     {
14       "dataFileName": "test1.fastq.gz",
15       "sampleName": "Test 1" 
16     },
17     {
18       "dataFileName": "test2.fastq.gz",
19       "sampleName": "Test 2" 
20     },
21     {
22       "dataFileName": "test3.fastq.gz",
23       "sampleName": "Test 3" 
24     },
25     {
26       "dataFileName": "test4.fastq.gz",
27       "sampleName": "Test 4" 
28     },
29     {
30       "dataFileName": "test5.fastq.gz",
31       "sampleName": "Test 5" 
32     }
33   ],
34   "settings":
35   {
36     "adapterSequence": "",
37     "analysisName": "" 
38   }
39 }

Here is a manifest file filler helper that could help you create all of the sampleName and dataFileName pairs in JSON format.
Make sure you are in the smRNAseq tab and remember to remove the final comma "," after the last sampleName, dataFileName pair in the JSON file.

Step 6. Fill Out the Settings Section of Your Manifest¶

The "settings" section at the bottom of the manifest file provides some ability to customize how your submission is processed.
Below, we'll go over the different options and describe briefly what they do.

Setting Name	Description and Possible Values
adapterSequence	value of 3' adapter sequence. Default of "autoDetect" (will try to auto-detect adapter sequence). Other possible values include "none" (adapter sequence already clipped) and the actual value of the adapter sequence (for example, "AGATCGGAAGAGCACACGTCT"). Note that you can provide a different 3' adapter sequence for each sample by including the adapterSequence field with each sample's information (dataFileName / sampleName). If you do so, don't include the adapterSequence field in the general settings section.
randomBarcodeLength	indicates random barcode length used in samples. Default of "0" (no random barcodes).
randomBarcodeLocation	indicates location of random barcodes. Default of "-5p -3p". Other possible values include "-5p" and "-3p".
randomBarcodeStats	sets whether we should compute frequency and enrichment statistics for samples with random barcodes (useful for identifying ligation/amplification biases in some cases). Default of "false" (recommended). Other possible values include "true".
analysisName	analysis name - used for naming job-specific folder on Genboree and for naming certain files in your results. Default uses timestamp to indicate when the job was submitted (this is a good idea!).
genomeVersion	genome version of your output database / your data. Default is hg19. Other supported genomes are mm10.
useLibrary	indicates whether you are using a spike-in library. Default value of "noOligo", which means no spike-in library. Other possible values are "uploadNewLibrary" (you included a FASTA file in your data archive).
suppressRunExceRptEmails	indicates whether you want to suppress all runExceRpt emails sent by successfully processed samples. Note that failure emails will be sent regardless. This setting will significantly reduce the number of emails you receive. Default: false. Other possible values include "true".

IMPORTANT NOTES

You MUST specify an analysisName in your manifest file, as this setting provides valuable information for organizing your submission.
We recommend that you structure your analysisName in the following way:

First, put your PI ID followed by -. This is the first letter of your PI's first name, followed by the first four letters of your PI's last name, followed by a 1.
For example, my PI ID is AMILO1, since my PI is Aleksandar MILOsavljevic.
Second, put some kind of label for your submission followed by -.
For example, I might put "Serum_vs_Plasma_Controls" if I was comparing healthy controls in serum and plasma.
Third, put the date of your submission in the format YYYY-MM-DD.
For example, I would put 2017-06-01 if I was submitting my files on June 1, 2017.
Our final analysisName would look like the following: AMILO1-Serum_vs_Plasma_Controls-2017-06-01.

Make sure that you include "useLibrary": "uploadNewLibrary" if you are providing a spike-in library with your data files.

Make sure that you specify "genomeVersion": "mm10" if your samples use one of these alternative reference genomes (hg19 is the default).

Make sure that you specify randomBarcodeLength and randomBarcodeLocation if your samples have random barcodes (we recommend not using randomBarcodeStats).

Now, our (completed) manifest file looks like the following:

 1 {
 2   "studyName": "CSF vs. Serum Parkinson's June 2017",
 3   "userLogin": "william_thistle",
 4   "md5CheckSum": "b9355772f35516837a06666f7c56afdd",
 5   "runMetadataFileName": "testRun.metadata.tsv",
 6   "submissionMetadataFileName": "testSubmissions.metadata.tsv",
 7   "studyMetadataFileName": "testStudies.metadata.tsv",
 8   "experimentMetadataFileName": "testExperiments.metadata.tsv",
 9   "biosampleMetadataFileName": "testBiosamples.metadata.tsv",
10   "donorMetadataFileName": "testDonors.metadata.tsv",
11   "manifest": 
12   [
13     {
14       "dataFileName": "test1.fastq.gz",
15       "sampleName": "Test 1" 
16     },
17     {
18       "dataFileName": "test2.fastq.gz",
19       "sampleName": "Test 2" 
20     },
21     {
22       "dataFileName": "test3.fastq.gz",
23       "sampleName": "Test 3" 
24     },
25     {
26       "dataFileName": "test4.fastq.gz",
27       "sampleName": "Test 4" 
28     },
29     {
30       "dataFileName": "test5.fastq.gz",
31       "sampleName": "Test 5" 
32     }
33   ],
34   "settings":
35   {
36     "adapterSequence": "AGATCGGAAGAGCACACGTCT",
37     "analysisName": "AMILO1-Serum_vs_Plasma_Controls-2017-06-01" 
38   }
39 }

If you remove or add a setting, make sure that your terms are still separated sensibly by commas.
For example, if I removed analysisName above, I would delete the comma after adapterSequence (because adapterSequence is now the final property).
Likewise, if I added another property like genomeVersion after analysisName, I would put a comma after analysisName (but no comma after genomeVersion).

You can download this example manifest file here.

Step 7. Validate and Save Your Manifest File¶

After you've finished working on your manifest file, you should make sure that the file is formatted correctly by using a JSON validator like JSONLint.
Simply copy-paste your manifest content into the text box and then click "Validate" to see if there are any errors in your manifest file.
If there are any errors, use the error messages provided by the JSON validator to fix your manifest file.
You're now done with creating your manifest file! Save it a final time and you're ready to upload your submission for processing.

Summary¶

Download template manifest file
Open your manifest file
Compute the MD5 checksum of your data archive (not your manifest file, not your metadata archive)
Fill out the top section of your manifest
1. Make sure file names are typed in exactly as how it is named, including file extension.
Fill out the sample-specific section of your manifest
Fill out the settings section of your manifest
Validate and save your manifest file

Also available in: HTML TXT

exRNA Data Coordination Center

Wiki