Genboree Discovery System - Project: CAD Course 2011/7

Ruby Exercise - Lecture 1

Write the 2 Ruby programs described below.
Hand in just the Ruby code, not results or anything (email to: andrewj@bcm.edu).
I will run your program on the data file.
Comment your code in your own words using # comments.
Due Date: Feb 2, 2011 (but I don't recommend waiting, there will be exercises for other labs in just this course)

Exercise 1 - Read/Write a File, subsetting a Illumina probe definition file

Download and uncompress this methylation probe definition file.
The file is an LFF table whose columns are tab-delimited. The 13th column contain attribute-value pairs for custom fields; the attribute-value pairs are semi-colon (;) delimited. [ File format description ]
Each line describes the genomic location of and information about a methylation probe from a common Illumina array.
One of the pieces of information is how far the probed location is from the Transcription Start Site (TSS) of the nearby gene. This is provided by the Distance_to_TSS attribute-value pair within the 13th column (e.g. Distance_to_TSS=1135;).
Task: Write a Ruby program that:
- Reads each line of this file and separates the probes into two files:
  1. If the probe is within 250 bp of the nearby TSS, it is output by the program to a file named close.probes.txt
  2. Otherwise, if the probe is more than 250 bp from the TSS, it is output by the program to a file named far.probes.txt

Exercise 2 - Computing Average Methylation Per Gene

Referring to the probe definition file for Exercise 1, notice that the attribute-value pairs also contain a Symbol attribute whose value is a gene name.
Some genes have multiple probes.
Say we have obtained some probe values from a run of the array. Here is a data file with a score value for each probe.
Task: Write a Ruby program that:
- Computes the average score per gene using both the probe data file and the probe definition file.
- Remember that some (many) genes have multiple probes and thus will have multiple values in probe data file.
- For each unique gene, output a simple file with two tab-delimited columns:
  1. The gene (Symbol) name
  2. The average of that scores for the probes associated with that gene.
- Obviously, each gene (Symbol) will appear only once in your output file.
Hint: Hashes are very useful for this kind of thing. If you determine how many Hashes you need to track the necessary information, this becomes an easy task.