Reproducible Research

  • Update: Tue Oct 15 11:03:17 CEST 2019

Exercise: Reformat a simple table. Download the table file either from the course website here or use the terminal program curl.

# Open your terminal.
# Make sure you are in the right directory
pwd # Where am I?
# Change directory if needed
cd my/working/directory/
# Now download the table file - note use option -O (use remote name)
curl -O https://gdc-docs.ethz.ch/UniBS/HS2019/BioInf/data/Table.txt
# Have a look at the file
cat Table.txt

First, use your favourite method (e.g. Excel) to reformat the table from the version you have into the format below. In detail, you have to change the column order, exclude some columns (e.g. Size), change sample id (SID) for the first 9 samples (e.g. T1 > T01), change groups into capital letters, and rename project (e.g. P777 > P779).

Write a short protocol to describe what you did. Exchange your protocol with your neighbour and try to covert the table again with her/his description.

Was it easy for you to describe your steps and was it difficult to follow the instruction given to you?

There are many correct solutionto the reformatting problem but not all might be simple and easy to reproduce. Below a simple terminal based idea using the scripting language awk. Make sure the Table.txt File is in the working (current) directory.

awk '{
  if(NR==1) print $1,$4,$2,$5;                           # exclude header line
  else if(length($1)>2) print $1,$4,toupper($2),"p779";  # change project number
  else print "T0"substr($1,2,3),$4,toupper($2),"p779"    # rename sample < 10
}' Table.txt > Table_new.txt

There is also a bash script and it can be found here or download it from the server using curl.

curl --verbose -O https://gdc-docs.ethz.ch/UniBS/HS2019/BioInf/script/reformat_table.sh
# Make the script is executable
chmod a+x reformat_table.sh
# run the script but make sure the Table.txt file is in the same directory
./reformat_table.sh Table.txt

Provide Code for Simplicity

Reproducibility is important and sometimes easier to achieve as you might think. Describe all processing steps in great detail is often time consuming. It might be easier to provide a well-documented script instead.

Exercise #2: Download the graph.zip file using curl from the server https://gdc-docs.ethz.ch/UniBS/HS2019/BioInf/data/ and unzip (e.g. unzip graph.zip) it.

curl -O https://gdc-docs.ethz.ch/UniBS/HS2019/BioInf/data/graph.zip
unzip graph.zip
rm -fr __MACOSX/ # remove unwanted folder
cd graph

You should find 5 files:

  • data1.txt (data text file)
  • data2.txt (data text file)
  • figure1.xlsx (Excel file)
  • figure1.lsx (Excel 97 format without figure)
  • figure1.R (R script)

Open one of the spreadsheet files (xls or xlsx) with Excel and have a look at the content. The data in column A and B correspond to the data in data1.txt. A T-Test was used to compare the two datasets and the p-value is shown above the boxplot. Create a new boxplot and perform a T-Test on data2 using the existing excel file as a template or create a new one.

Now, do the same but using the R script as a template. You can either use the data2.txt file you just downloaded or you download it again un-commenting line 5 in the R script. At the end of the script there is an R function (compare2.boxplot) to automise the process even further. Load the function and execute it:

compare2.boxplot(d1$GroupA, d1$GroupB, "Group A", "Group B")

compare2.boxplot(d2$GroupA, d2$GroupB, "Group A", "Group B")