Ruby Exercise - Lecture 1
  • Write the 2 Ruby programs described below.
  • Hand in just the Ruby code, not results or anything (email to: andrewj@bcm.edu).
  • I will run your program on the data file.
  • Comment your code in your own words using # comments.
  • Due Date: Feb 2, 2011 (but I don't recommend waiting, there will be exercises for other labs in just this course)

 
Exercise 1 - Read/Write a File, subsetting a Illumina probe definition file
  • Download and uncompress this methylation probe definition file.
  • The file is an LFF table whose columns are tab-delimited. The 13th column contain attribute-value pairs for custom fields; the attribute-value pairs are semi-colon (;) delimited. [ File format description ]
  • Each line describes the genomic location of and information about a methylation probe from a common Illumina array.
  • One of the pieces of information is how far the probed location is from the Transcription Start Site (TSS) of the nearby gene. This is provided by the Distance_to_TSS attribute-value pair within the 13th column (e.g. Distance_to_TSS=1135;).
  • Task: Write a Ruby program that:
    • Reads each line of this file and separates the probes into two files:
      1. If the probe is within 250 bp of the nearby TSS, it is output by the program to a file named close.probes.txt
      2. Otherwise, if the probe is more than 250 bp from the TSS, it is output by the program to a file named far.probes.txt

 
Exercise 2 - Computing Average Methylation Per Gene
  • Referring to the probe definition file for Exercise 1, notice that the attribute-value pairs also contain a Symbol attribute whose value is a gene name.
  • Some genes have multiple probes.
  • Say we have obtained some probe values from a run of the array. Here is a data file with a score value for each probe.
  • Task: Write a Ruby program that:
    • Computes the average score per gene using both the probe data file and the probe definition file.
    • Remember that some (many) genes have multiple probes and thus will have multiple values in probe data file.
    • For each unique gene, output a simple file with two tab-delimited columns:
      1. The gene (Symbol) name
      2. The average of that scores for the probes associated with that gene.
    • Obviously, each gene (Symbol) will appear only once in your output file.
  • Hint: Hashes are very useful for this kind of thing. If you determine how many Hashes you need to track the necessary information, this becomes an easy task.