Reproducible Research


The concept of reproducible research is that data analysis is transparent for other researcher to verify the findings and build upon them. The need for reproducibility is increasing as data analyses become more complex. The focus in on the content of a data analysis, rather than on (superficial) description alone.

We start with two simple problems. Although these might not be the most practical examples of reproducible research it will show you some interesting aspects:

  • It is often a challenge to describe even a simple procedure in words for others do understand. You can wasted a lot of effort.
  • Click-manipulations are convenient but tricky to reproduce and error prone. Once the manipulations are done it is difficult to trace possible errors back. Troubleshooting is difficult once it is done.
  • A simple description together with a script is easy to understand and reproducible.
  • It might take time to find a reproducible solution for a problem. Once you have established a solution you can re-use it. You can use a previous solution to solve a current problem faster and with more certainty. So, reproducible research is also helping you.
  • Sharing solutions can be fun and educational.

Exercise #1 - Reformat

Reformat a simple table. Download the table file either from the course website here or use the terminal program curl.

# Open your terminal.
# Make sure you are in the right directory
pwd # Where am I?
# Change directory if needed
cd my/working/directory/
# Now download the table file - note use option -O (use remote name)
curl -O https://gdc-docs.ethz.ch/UniBS/HS2020/BioInf/data/Table.txt
# Have a look at the file
cat Table.txt

First, use your favourite method (e.g. Excel) to reformat the table from the version you have into the format below. In detail, you have to change the column order, exclude some columns (e.g. Size), change sample id (SID) for the first 9 samples (e.g. T1 > T01), change groups into capital letters, and rename project (e.g. P777 > P779).

Write a short protocol to describe what you did. Exchange your protocol with your neighbour and try to covert the table again with her/his description.

Was it easy for you to describe your steps and was it difficult to follow the instruction given to you?

There are many correct solutionto the reformatting problem but not all might be simple and easy to reproduce. Below a simple terminal based idea using the scripting language awk. Make sure the Table.txt File is in the working (current) directory.

awk '{
  if(NR==1) print $1,$4,$2,$5;                           # exclude header line
  else if(length($1)>2) print $1,$4,toupper($2),"p779";  # change project number
  else print "T0"substr($1,2,3),$4,toupper($2),"p779"    # rename sample < 10
}' Table.txt > Table_new.txt

There is also a bash script and it can be found here or download it from the server using curl.

curl --verbose -O https://gdc-docs.ethz.ch/UniBS/HS2020/BioInf/script/reformat_table.sh
# Make the script is executable
chmod a+x reformat_table.sh
# run the script but make sure the Table.txt file is in the same directory
./reformat_table.sh Table.txt

Provide Code for Simplicity

Reproducibility is important and sometimes easier to achieve as you might think. Describe all processing steps in great detail is often time consuming. It might be easier to provide a well-documented script instead.

Keep The Original Safe

Keep the original file safe and work with a copy. It is safer to use a copy of the file you like to manipulate. Wrong manipulations cannot be undone and the file might be lost.

Exercise #2 - Recycling

Download the graph.zip file using curl from the server https://gdc-docs.ethz.ch/UniBS/HS2020/BioInf/data/ and unzip (e.g. unzip graph.zip) it.

curl -O https://gdc-docs.ethz.ch/UniBS/HS2020/BioInf/data/graph.zip
unzip graph.zip
rm -fr __MACOSX/ # remove unwanted folder
cd graph

You should find 5 files:

  • data1.txt (data text file)
  • data2.txt (data text file)
  • figure1.xlsx (Excel file)
  • figure1.lsx (Excel 97 format without figure)
  • figure1.R (R script)

Open one of the spreadsheet files (xls or xlsx) with Excel and have a look at the content. The data in column A and B correspond to the data in data1.txt. A T-Test was used to compare the two datasets and the p-value is shown above the boxplot. Create a new boxplot and perform a T-Test on data2 using the existing excel file as a template or create a new one.

Now, do the same but using the R script as a template. You can either use the data2.txt file you just downloaded or you download it again un-commenting line 5 in the R script. At the end of the script there is an R function (compare2.boxplot) to automise the process even further. Load the function and execute it:

compare2.boxplot(d1$GroupA, d1$GroupB, "Group A", "Group B")

compare2.boxplot(d2$GroupA, d2$GroupB, "Group A", "Group B")