Reproducible Research¶

Lecture notes¶

Additional information¶

The basic philosophy of Tidyverse (Recommended by Séb)

Exercise¶

Please work in groups. Use the R markdown editor and reproducible R code. During the exercise fill your suggestions directly in the GoogleDoc file using the link here.

R Code with Style

Cheat-Sheets PDFs

1. Markdown¶

(1) Most of the primary data analysis is done in the linux environment and you will mostly likely use bash coding. To incrase reproducibility of your logs or scripts, we going to use a markdown editor. Download and install Haroopad or MacDown. Explore the markdown editor yourself. Try to generate titles of different sizes, add plain text and code and save the markdown files as pdf. Use either the features in the insert menu or the cheat sheet.

(2) Set up a new R project in Rstudio including a markdown document. Explore the markdown editor yourself using the example. We will need it later.

2. Data manipulation¶

Download the table from the course website here. It is a subset of the sex of life database. For each plant species the sexual system is provided.

The goal is to reformat the table from the version you have into the format below.

Genus	Species	Sexual_system
Silene	acutifolia	hermaphrodite
Silene	adenocalyx	hermaphrodite
Silene	aegaea	hermaphrodite
Silene	aegyptiaca	hermaphrodite
Silene	aellenii	hermaphrodite
Silene	almolae	hermaphrodite
Silene	alpestris	hermaphrodite
Silene	acaulis	non_hermaphrodite
Silene	dioica	non_hermaphrodite
Silene	latifolia	non_hermaphrodite
Silene	vulgaris	non_hermaphrodite
Silene	ammophila	NA
Silene	amoena	NA

(1) Write a short log file to describe what you did using your favourite editor or excel.

(2) Use R to reformat your table (read.table,gsub,order). Advanced users can use tidyverse.

Suggestion

## import data
sex <- read.csv("https://www.gdc-docs.ethz.ch/MDA/data/R_sex.csv", header = F, na.strings = c("NA", ""), col.names = c("Genus", "Species", "Sexual_System"))

### correct Genus names
sex$Genus <- gsub("silene", "Silene", sex$Genus)

### correct Sex system
sex$Sexual_System <- gsub("gynodioecy", "non_hermaphrodite", sex$Sexual_System)
sex$Sexual_System <- gsub("trioecy", "non_hermaphrodite", sex$Sexual_System)
sex$Sexual_System <- gsub("dioecy", "non_hermaphrodite", sex$Sexual_System)

## reorder table
sex_sort <- sex[order(sex$Sexual_System, sex$Species), ]

###Tidyversion

library(tidyverse)

sex <- read_csv("https://www.gdc-docs.ethz.ch/MDA/data/R_sex.csv", col_names = c("Genus", "Species", "Sexual_system")) %>%
  mutate_if(is.character, str_replace_all, "gynodioecy|dioecy|trioecy", "non_hermaphrodite") %>%
  mutate_if(is.character, str_replace_all, "silene", "Silene") %>%
  arrange(Sexual_system, Species)

(3) Write an R function to import and reformat the table.

Suggestion

reformat_sex <- function(data) {
  read_csv(data, col_names = c("Genus", "Species", "Sexual_system")) %>%
    mutate_if(is.character, str_replace_all, "gynodioecy|dioecy|trioecy", "non_hermaphrodite") %>%
    mutate_if(is.character, str_replace_all, "silene", "Silene") %>%
    arrange(Sexual_system, Species)
}

reformat_sex("https://www.gdc-docs.ethz.ch/MDA/data/R_sex.csv")

`

3. Data visualisation¶

Download the tables from the course website here. The zip file contains 4 different tables (SL_female, SL_male, SD_female, SD_male), which corresponds to 4 conditions (S. latifolia females and males and S. dioica females and males). For each genomic location (Chromosome and Position) the feature nucleotide diversity pi is shown. The goal is to visualise the data.

CHROM	POS	PI
SL_stacks5	26	0.213904
SL_stacks5	39	0.607843

(1) Import the tables into R (read.table), add the condition (SL_female, SL_male, SD_female, SD_male) to each observation and concatenate the tables (rbind).

Suggestion

#### Read in tables and add condition to each observation
SD_female     <-     read.table("https://www.gdc-docs.ethz.ch/MDA/data/SD_female.sites.pi", header = T)
SD_female$stat  <-   "SD_female"

SD_male       <-     read.table("https://www.gdc-docs.ethz.ch/MDA/data/SD_male.sites.pi", header = T)
SD_male$stat    <-   "SD_male"

SL_female     <-     read.table("https://www.gdc-docs.ethz.ch/MDA/data/SL_female.sites.pi", header = T)
SL_female$stat  <-   "SL_female"

SL_male       <-     read.table("https://www.gdc-docs.ethz.ch/MDA/data/SL_male.sites.pi", header = T)
SL_male$stat <-      "SL_male"

### combine dataframes
sex_species          <-      rbind(SD_female, SD_male, SL_female, SL_male)


###Tidyversion

####download data
R.url <- "https://www.gdc-docs.ethz.ch/MDA/data/R_pi.zip"
utils::download.file(R.url, destfile = "R_pi.zip")
unzip("R_pi.zip")

sex_species <- list.files(pattern = ".sites.pi") %>%
  map(~ read_delim(., delim = "\t")) %>%
  set_names(list.files(pattern = ".sites.pi")) %>%
  bind_rows(., .id = "sex_species")

(2) Generate a boxplot with each stat (boxplot). Are there any differences among the conditions? Advanced users can use ggplot2. What would be an appropriate test statistic?

Suggestion

### boxplots
boxplot(sex_species$PI ~ sex_species$stat)

### Tidyversion

sex_species %>%
  ggplot(., aes(x = sex_species, y = PI, color = sex_species)) +
  geom_boxplot()

(3) Write an R function to import the data and generate the boxplots directly.

Suggestion

sex_species_boxplot <- function(pat) {
  list.files(pattern = pat) %>%
    map(~ read_delim(., delim = "\t")) %>%
    set_names(list.files(pattern = ".sites.pi")) %>%
    bind_rows(., .id = "sex_species") %>%
    ggplot(., aes(x = sex_species, y = PI, color = sex_species)) +
    geom_boxplot()
}

sex_species_boxplot(".sites.pi")