Reproducible Research


Lecture notes

Rep Research (pdf)

Demo Code (html)

Additional information

The basic philosophy of Tidyverse (Recommended by Séb)

Tidyverse book

Exercise

Please work in groups. Use the R markdown editor and reproducible R code. During the exercise fill your suggestions directly in the GoogleDoc file using the link here.

R Code with Style

Cheat-Sheets PDFs

1. Markdown

(1) Most of the primary data analysis is done in the linux environment and you will mostly likely use bash coding. To incrase reproducibility of your logs or scripts, we going to use a markdown editor. Download and install Haroopad or MacDown. Explore the markdown editor yourself. Try to generate titles of different sizes, add plain text and code and save the markdown files as pdf. Use either the features in the insert menu or the cheat sheet.

(2) Set up a new R project in Rstudio including a markdown document. Explore the markdown editor yourself using the example. We will need it later.

2. Data manipulation

Download the table from the course website here. It is a subset of the sex of life database. For each plant species the sexual system is provided.

The goal is to reformat the table from the version you have into the format below.

Genus Species Sexual_system
Silene acutifolia hermaphrodite
Silene adenocalyx hermaphrodite
Silene aegaea hermaphrodite
Silene aegyptiaca hermaphrodite
Silene aellenii hermaphrodite
Silene almolae hermaphrodite
Silene alpestris hermaphrodite
Silene acaulis non_hermaphrodite
Silene dioica non_hermaphrodite
Silene latifolia non_hermaphrodite
Silene vulgaris non_hermaphrodite
Silene ammophila NA
Silene amoena NA

(1) Write a short log file to describe what you did using your favourite editor or excel.

(2) Use R to reformat your table (read.table,gsub,order). Advanced users can use tidyverse.

Suggestion

## import data
sex <- read.csv("https://www.gdc-docs.ethz.ch/MDA/data/R_sex.csv", header = F, na.strings = c("NA", ""), col.names = c("Genus", "Species", "Sexual_System"))

### correct Genus names
sex$Genus <- gsub("silene", "Silene", sex$Genus)

### correct Sex system
sex$Sexual_System <- gsub("gynodioecy", "non_hermaphrodite", sex$Sexual_System)
sex$Sexual_System <- gsub("trioecy", "non_hermaphrodite", sex$Sexual_System)
sex$Sexual_System <- gsub("dioecy", "non_hermaphrodite", sex$Sexual_System)

## reorder table
sex_sort <- sex[order(sex$Sexual_System, sex$Species), ]

###Tidyversion

library(tidyverse)

sex <- read_csv("https://www.gdc-docs.ethz.ch/MDA/data/R_sex.csv", col_names = c("Genus", "Species", "Sexual_system")) %>%
  mutate_if(is.character, str_replace_all, "gynodioecy|dioecy|trioecy", "non_hermaphrodite") %>%
  mutate_if(is.character, str_replace_all, "silene", "Silene") %>%
  arrange(Sexual_system, Species)

(3) Write an R function to import and reformat the table.

Suggestion

reformat_sex <- function(data) {
  read_csv(data, col_names = c("Genus", "Species", "Sexual_system")) %>%
    mutate_if(is.character, str_replace_all, "gynodioecy|dioecy|trioecy", "non_hermaphrodite") %>%
    mutate_if(is.character, str_replace_all, "silene", "Silene") %>%
    arrange(Sexual_system, Species)
}

reformat_sex("https://www.gdc-docs.ethz.ch/MDA/data/R_sex.csv")

`

3. Data visualisation

Download the tables from the course website here. The zip file contains 4 different tables (SL_female, SL_male, SD_female, SD_male), which corresponds to 4 conditions (S. latifolia females and males and S. dioica females and males). For each genomic location (Chromosome and Position) the feature nucleotide diversity pi is shown. The goal is to visualise the data.

CHROM POS PI
SL_stacks5 26 0.213904
SL_stacks5 39 0.607843

(1) Import the tables into R (read.table), add the condition (SL_female, SL_male, SD_female, SD_male) to each observation and concatenate the tables (rbind).

Suggestion

#### Read in tables and add condition to each observation
SD_female     <-     read.table("https://www.gdc-docs.ethz.ch/MDA/data/SD_female.sites.pi", header = T)
SD_female$stat  <-   "SD_female"

SD_male       <-     read.table("https://www.gdc-docs.ethz.ch/MDA/data/SD_male.sites.pi", header = T)
SD_male$stat    <-   "SD_male"

SL_female     <-     read.table("https://www.gdc-docs.ethz.ch/MDA/data/SL_female.sites.pi", header = T)
SL_female$stat  <-   "SL_female"

SL_male       <-     read.table("https://www.gdc-docs.ethz.ch/MDA/data/SL_male.sites.pi", header = T)
SL_male$stat <-      "SL_male"

### combine dataframes
sex_species          <-      rbind(SD_female, SD_male, SL_female, SL_male)


###Tidyversion

####download data
R.url <- "https://www.gdc-docs.ethz.ch/MDA/data/R_pi.zip"
utils::download.file(R.url, destfile = "R_pi.zip")
unzip("R_pi.zip")

sex_species <- list.files(pattern = ".sites.pi") %>%
  map(~ read_delim(., delim = "\t")) %>%
  set_names(list.files(pattern = ".sites.pi")) %>%
  bind_rows(., .id = "sex_species") 

(2) Generate a boxplot with each stat (boxplot). Are there any differences among the conditions? Advanced users can use ggplot2. What would be an appropriate test statistic?

Suggestion

### boxplots
boxplot(sex_species$PI ~ sex_species$stat)

### Tidyversion

sex_species %>%
  ggplot(., aes(x = sex_species, y = PI, color = sex_species)) +
  geom_boxplot()

(3) Write an R function to import the data and generate the boxplots directly.

Suggestion

sex_species_boxplot <- function(pat) {
  list.files(pattern = pat) %>%
    map(~ read_delim(., delim = "\t")) %>%
    set_names(list.files(pattern = ".sites.pi")) %>%
    bind_rows(., .id = "sex_species") %>%
    ggplot(., aes(x = sex_species, y = PI, color = sex_species)) +
    geom_boxplot()
}

sex_species_boxplot(".sites.pi")