The file is an LFF table whose columns are tab-delimited. The 13th column contain attribute-value pairs for custom fields; the attribute-value pairs are semi-colon (;) delimited. [ File format description ]
Each line describes the genomic location of and information about a methylation probe from a common Illumina array.
One of the pieces of information is how far the probed location is from the Transcription Start Site (TSS) of the nearby gene. This is provided by the Distance_to_TSS attribute-value pair within the 13th column (e.g. Distance_to_TSS=1135;).
Task: Write a Ruby program that:
Reads each line of this file and separates the probes into two files:
If the probe is within 250 bp of the nearby TSS, it is output by the program to a file named close.probes.txt
Otherwise, if the probe is more than 250 bp from the TSS, it is output by the program to a file named far.probes.txt
Exercise 2 - Computing Average Methylation Per Gene
Referring to the probe definition file for Exercise 1, notice that the attribute-value pairs also contain a Symbol attribute whose value is a gene name.
Computes the average score per gene using both the probe data file and the probe definition file.
Remember that some (many) genes have multiple probes and thus will have multiple values in probe data file.
For each unique gene, output a simple file with two tab-delimited columns:
The gene (Symbol) name
The average of that scores for the probes associated with that gene.
Obviously, each gene (Symbol) will appear only once in your output file.
Hint: Hashes are very useful for this kind of thing. If you determine how many Hashes you need to track the necessary information, this becomes an easy task.