Biocomputing

Get going with statistics in functional genomics

  • Date: Monday, 06.03.19
  • Version: 190219

The lecture notes can be opened here. You might use the right mouse button and choose the "Download linked files" option if you prefer to safe a copy on your computer.

1. The GDC

The Genetic Diversity Centre (GDC) is a well established knowledge and technology platform of the D-USYS Department. The team GDC provides scientific and technical support for research related to genetic and genomic diversity in a wide range of organisms and for a wide range of researcher. Visit our webpage to learn more about our services.

2. Biocomputing

Try to answer the following questions.

Q1: What do you need to know about a new application before you are using it? Q2: Where can you get help? Q3: Is it better to write your own script or use somebody elses?

3. Reproducible Research

Exercise: Reformat a simple table. Download the table file either from the course website here or use the terminal program curl.

# First make sure you are in the right directory
pwd # Where am I?
# Change directory if needed
cd my/working/directory/
# Now download the table file
curl https://gdc-docs.ethz.ch/UniZH/psc_graduate_course_FS_2019/data/Table.txt
# Have a look at the file
cat Table.txt

Assignment #1: Use your favourite method (e.g. Excel) to reformat the table from the version you have into the format below. In detail, you have to change the column order, exclude some columns (e.g. Size), change sample id (SID) for the first 9 samples (e.g. T1 > T01), change groups into capital letters, and rename project (e.g. P777 > P779). > Write a short protocol to describe what you did.

SID CC Group Project
T01 1 A p779
T02 1 A p779
T03 3 A p779
T04 5 A p779
T05 5 A p779
T06 1 B p779
T07 1 B p779
T08 1 B p779
T09 1 B p779
T10 3 B p779
T11 1 C p779
T12 2 C p779
T13 2 C p779
T14 4 C p779
T15 5 C p779

There are many correct solution to the reformatting task but not all might be simple and easy to reproduce. Below a simple terminal based idea using the scripting language awk. Make sure the Table.txt File is in the working (current) directory.

awk '{
  if(NR==1) print $1,$4,$2,$5;                           # exclude header line
  else if(length($1)>2) print $1,$4,toupper($2),"p779";  # change project number
  else print "T0"substr($1,2,3),$4,toupper($2),"p779"    # rename sample < 10
}' Table.txt > Table_new.txt

There is also a bash script and it can be found here or downloaded using curl.

curl --verbose https://gdc-docs.ethz.ch/UniZH/psc_graduate_course_FS_2019/data/reformat_table.sh
# Make the script is executable
chmod a+x reformat_table.sh
# run the script but make sure the Table.txt file is in the same directory
./reformat_table.sh Table.txt

Assignment #2: Download the graph.zip file using curl and unzip (e.g. unzip graph.zip) it. You should find 5 files:

  • data1.txt (data text file)
  • data2.txt (data text file)
  • figure1.xlsx (Excel file)
  • figure1.lsx (Excel 97 format without figure)
  • figure1.R (R script)

Open the Excel spreadsheet file and have a look at the content. The data in column A and B correspond to the data in data1.txt. A T-Test was used to compare the two datasets and the P-value is shown above the boxplot. Create a new boxplot and perform a T-Test on data2 using the existing excel file as a template or create a new one.

Now do the same but using the R script as a template. You can either use the data2.txt file you just downloaded or you download it again un-commenting line 5 in the R script. At the end of the script there is also an R function (compare2.boxplot) to automise the process even further. Load the function and execute it:

compare2.boxplot(d1$GroupA, d1$GroupB, "Group A", "Group B")

compare2.boxplot(d2$GroupA, d2$GroupB, "Group A", "Group B")

4. R Code with Style

4.1 Style Guides

4.2 R Code Folding

RStudio supports both automatic and user-defined folding for regions of code. Code folding allows you to easily show and hide blocks of code to make it easier to navigate your source file and focus on the coding task at hand.

To insert a new code section you can use the Code > Insert Section command. Alternatively, any comment line which includes at least four trailing dashes (-), equal signs (=), or pound signs (#) automatically creates a code section.

### -------------------------------------
### Setup
### -------------------------------------

## clean/reset environment
rm(list=ls())

## R and Bioconductor libraries
require("ggplot2")

### -------------------------------------
### Data Import in R
### -------------------------------------

## ZOTU_c99 data set

otufile.ZOTU_c99      <- "p451_run190204_16S_ZOTU_c99_Count_Sintax.txt"
mapfile.ZOTU_c99      <- "p451_run190204_16S_MapFile.txt"

## Import data into Phyloseq
d.ZOTU_c99 <- import_qiime(otufilename = otufile.ZOTU_c99, mapfilename = mapfile.ZOTU_c99)
d.ZOTU_c99

4.3 R Packages

5. Markdown