Reproducible Research
- Update: Tue Oct 15 11:03:17 CEST 2019
Exercise: Reformat a simple table. Download the table file either from the course website here or use the terminal program curl.
# Open your terminal. # Make sure you are in the right directory pwd # Where am I? # Change directory if needed cd my/working/directory/ # Now download the table file - note use option -O (use remote name) curl -O https://gdc-docs.ethz.ch/UniBS/HS2019/BioInf/data/Table.txt # Have a look at the file cat Table.txt
First, use your favourite method (e.g. Excel) to reformat the table from the version you have into the format below. In detail, you have to change the column order, exclude some columns (e.g. Size), change sample id (SID) for the first 9 samples (e.g. T1 > T01), change groups into capital letters, and rename project (e.g. P777 > P779).
Write a short protocol to describe what you did. Exchange your protocol with your neighbour and try to covert the table again with her/his description.
Was it easy for you to describe your steps and was it difficult to follow the instruction given to you?
There are many correct solutionto the reformatting problem but not all might be simple and easy to reproduce. Below a simple terminal based idea using the scripting language awk. Make sure the Table.txt File is in the working (current) directory.
awk '{ if(NR==1) print $1,$4,$2,$5; # exclude header line else if(length($1)>2) print $1,$4,toupper($2),"p779"; # change project number else print "T0"substr($1,2,3),$4,toupper($2),"p779" # rename sample < 10 }' Table.txt > Table_new.txt
There is also a bash script and it can be found here or download it from the server using curl.
curl --verbose -O https://gdc-docs.ethz.ch/UniBS/HS2019/BioInf/script/reformat_table.sh # Make the script is executable chmod a+x reformat_table.sh # run the script but make sure the Table.txt file is in the same directory ./reformat_table.sh Table.txt
Provide Code for Simplicity
Reproducibility is important and sometimes easier to achieve as you might think. Describe all processing steps in great detail is often time consuming. It might be easier to provide a well-documented script instead.
Exercise #2: Download the graph.zip file using curl from the server https://gdc-docs.ethz.ch/UniBS/HS2019/BioInf/data/
and unzip (e.g. unzip graph.zip
) it.
curl -O https://gdc-docs.ethz.ch/UniBS/HS2019/BioInf/data/graph.zip unzip graph.zip rm -fr __MACOSX/ # remove unwanted folder cd graph
You should find 5 files:
- data1.txt (data text file)
- data2.txt (data text file)
- figure1.xlsx (Excel file)
- figure1.lsx (Excel 97 format without figure)
- figure1.R (R script)
Open one of the spreadsheet files (xls or xlsx) with Excel and have a look at the content. The data in column A and B correspond to the data in data1.txt. A T-Test was used to compare the two datasets and the p-value is shown above the boxplot. Create a new boxplot and perform a T-Test on data2 using the existing excel file as a template or create a new one.
Now, do the same but using the R script as a template. You can either use the data2.txt file you just downloaded or you download it again un-commenting line 5 in the R script. At the end of the script there is an R function (compare2.boxplot) to automise the process even further. Load the function and execute it:
compare2.boxplot(d1$GroupA, d1$GroupB, "Group A", "Group B")
compare2.boxplot(d2$GroupA, d2$GroupB, "Group A", "Group B")