Reproducible Research¶
Lecture notes¶
Additional information¶
The basic philosophy of Tidyverse (Recommended by Séb)
Exercise¶
Please work in groups. Use the R markdown editor and reproducible R code. During the exercise fill your suggestions directly in the GoogleDoc file using the link here.
1. Markdown¶
(1) Most of the primary data analysis is done in the linux environment and you will mostly likely use bash coding. To incrase reproducibility of your logs or scripts, we going to use a markdown editor. Download and install Haroopad or MacDown. Explore the markdown editor yourself. Try to generate titles of different sizes, add plain text and code and save the markdown files as pdf. Use either the features in the insert menu or the cheat sheet.
(2) Set up a new R project in Rstudio including a markdown document. Explore the markdown editor yourself using the example. We will need it later.
2. Data manipulation¶
Download the table from the course website here. It is a subset of the sex of life database. For each plant species the sexual system is provided.
The goal is to reformat the table from the version you have into the format below.
Genus | Species | Sexual_system |
---|---|---|
Silene | acutifolia | hermaphrodite |
Silene | adenocalyx | hermaphrodite |
Silene | aegaea | hermaphrodite |
Silene | aegyptiaca | hermaphrodite |
Silene | aellenii | hermaphrodite |
Silene | almolae | hermaphrodite |
Silene | alpestris | hermaphrodite |
Silene | acaulis | non_hermaphrodite |
Silene | dioica | non_hermaphrodite |
Silene | latifolia | non_hermaphrodite |
Silene | vulgaris | non_hermaphrodite |
Silene | ammophila | NA |
Silene | amoena | NA |
(1) Write a short log file to describe what you did using your favourite editor or excel.
(2) Use R to reformat your table (read.table
,gsub
,order
). Advanced users can use tidyverse.
Suggestion
## import data sex <- read.csv("https://www.gdc-docs.ethz.ch/MDA/data/R_sex.csv", header = F, na.strings = c("NA", ""), col.names = c("Genus", "Species", "Sexual_System")) ### correct Genus names sex$Genus <- gsub("silene", "Silene", sex$Genus) ### correct Sex system sex$Sexual_System <- gsub("gynodioecy", "non_hermaphrodite", sex$Sexual_System) sex$Sexual_System <- gsub("trioecy", "non_hermaphrodite", sex$Sexual_System) sex$Sexual_System <- gsub("dioecy", "non_hermaphrodite", sex$Sexual_System) ## reorder table sex_sort <- sex[order(sex$Sexual_System, sex$Species), ] ###Tidyversion library(tidyverse) sex <- read_csv("https://www.gdc-docs.ethz.ch/MDA/data/R_sex.csv", col_names = c("Genus", "Species", "Sexual_system")) %>% mutate_if(is.character, str_replace_all, "gynodioecy|dioecy|trioecy", "non_hermaphrodite") %>% mutate_if(is.character, str_replace_all, "silene", "Silene") %>% arrange(Sexual_system, Species)
(3) Write an R function to import and reformat the table.
Suggestion
reformat_sex <- function(data) { read_csv(data, col_names = c("Genus", "Species", "Sexual_system")) %>% mutate_if(is.character, str_replace_all, "gynodioecy|dioecy|trioecy", "non_hermaphrodite") %>% mutate_if(is.character, str_replace_all, "silene", "Silene") %>% arrange(Sexual_system, Species) } reformat_sex("https://www.gdc-docs.ethz.ch/MDA/data/R_sex.csv")
`
3. Data visualisation¶
Download the tables from the course website here. The zip file contains 4 different tables (SL_female
, SL_male
, SD_female
, SD_male
), which corresponds to 4 conditions (S. latifolia females and males and S. dioica females and males). For each genomic location (Chromosome and Position) the feature nucleotide diversity pi is shown. The goal is to visualise the data.
CHROM | POS | PI |
---|---|---|
SL_stacks5 | 26 | 0.213904 |
SL_stacks5 | 39 | 0.607843 |
(1) Import the tables into R (read.table
), add the condition (SL_female
, SL_male
, SD_female
, SD_male
) to each observation and concatenate the tables (rbind
).
Suggestion
#### Read in tables and add condition to each observation SD_female <- read.table("https://www.gdc-docs.ethz.ch/MDA/data/SD_female.sites.pi", header = T) SD_female$stat <- "SD_female" SD_male <- read.table("https://www.gdc-docs.ethz.ch/MDA/data/SD_male.sites.pi", header = T) SD_male$stat <- "SD_male" SL_female <- read.table("https://www.gdc-docs.ethz.ch/MDA/data/SL_female.sites.pi", header = T) SL_female$stat <- "SL_female" SL_male <- read.table("https://www.gdc-docs.ethz.ch/MDA/data/SL_male.sites.pi", header = T) SL_male$stat <- "SL_male" ### combine dataframes sex_species <- rbind(SD_female, SD_male, SL_female, SL_male) ###Tidyversion ####download data R.url <- "https://www.gdc-docs.ethz.ch/MDA/data/R_pi.zip" utils::download.file(R.url, destfile = "R_pi.zip") unzip("R_pi.zip") sex_species <- list.files(pattern = ".sites.pi") %>% map(~ read_delim(., delim = "\t")) %>% set_names(list.files(pattern = ".sites.pi")) %>% bind_rows(., .id = "sex_species")
(2) Generate a boxplot with each stat (boxplot
). Are there any differences among the conditions? Advanced users can use ggplot2. What would be an appropriate test statistic?
Suggestion
### boxplots boxplot(sex_species$PI ~ sex_species$stat) ### Tidyversion sex_species %>% ggplot(., aes(x = sex_species, y = PI, color = sex_species)) + geom_boxplot()
(3) Write an R function to import the data and generate the boxplots directly.
Suggestion
sex_species_boxplot <- function(pat) { list.files(pattern = pat) %>% map(~ read_delim(., delim = "\t")) %>% set_names(list.files(pattern = ".sites.pi")) %>% bind_rows(., .id = "sex_species") %>% ggplot(., aes(x = sex_species, y = PI, color = sex_species)) + geom_boxplot() } sex_species_boxplot(".sites.pi")