Learning Objectives
Main
◇ Be able to process data that haven been generated in the terminal.
◇ Be able to write reproducible R-scripts.
Minor
◇ Learn to use AIs.
◇ Learn to find "better" solution.
Lecture Notes
⬇︎ R
R code to get ready for the R demo
- Open Rstudio
- Paste the code below in the R console of Rstudio and execute it. Windows users might need to adjust the working directory path.
### set working directory
setwd("~/Desktop")
### generate folder
dir.create("Demo")
setwd("Demo")
### download scripts for Demo
utils::download.file("https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA/scripts/Demo.R", destfile = "Demo.R")
utils::download.file("https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA/scripts/My_functions.R", destfile = "My_functions.R")
### open R scripts
file.edit("My_functions.R")
file.edit("Demo.R")
First and Last Lines
This is a personal recommendation for starting and ending an R script. Before starting a new project, we should make sure that there are no unwanted environment variables.
rm(list = ls()) # clean/reset environment
Finally, all the information about the R session should be stored in a log file. It is important to record which version of R or which R package was used. This information is also useful for troubleshooting.
writeLines(capture.output(sessionInfo()), "sessionInfo.txt")
Challenges
In R Step-By-Step (for beginners) you will find some more basic information, including challenges that are not essential for the challenges below. Advanced users can use tidyverse, less advanced users should try to understand the commands. If you have your own scripts, use them.
Data manipulation
We will be working with the table here. It is a subset of the "Tree of Sex" database. For each plant species the sexual system is provided.
The aim is to reformat the table from the version you have to the format below.
Genus | Species | Sexual_system |
---|---|---|
Silene | acutifolia | hermaphrodite |
Silene | adenocalyx | hermaphrodite |
Silene | aegaea | hermaphrodite |
Silene | aegyptiaca | hermaphrodite |
Silene | aellenii | hermaphrodite |
Silene | almolae | hermaphrodite |
Silene | alpestris | hermaphrodite |
Silene | acaulis | non_hermaphrodite |
Silene | dioica | non_hermaphrodite |
Silene | latifolia | non_hermaphrodite |
Silene | vulgaris | non_hermaphrodite |
Silene | ammophila | NA |
Silene | amoena | NA |
## Set working directory
dir.create("Sex"); setwd("Sex")
## Install required libraries
install.packages("tidyverse")
## Remove all variables
rm(list = ls())
## set seed
set.seed(1000)
## load library
library(tidyverse)
❖ Challenge #1 Use R to reformat your table (read.table
,gsub
,order
). Advanced users can use tidyverse.
Suggestion #1: Base-version
## import data
sex <- read.csv("https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA20/data/R_sex.csv", header = F, na.strings = c("NA", ""), col.names = c("Genus", "Species", "Sexual_System"))
### correct Genus names
sex$Genus <- gsub("silene", "Silene", sex$Genus)
## Correct Sex system
sex$Sexual_System <- gsub("gynodioecy", "non_hermaphrodite", sex$Sexual_System)
sex$Sexual_System <- gsub("trioecy", "non_hermaphrodite", sex$Sexual_System)
sex$Sexual_System <- gsub("dioecy", "non_hermaphrodite", sex$Sexual_System)
## Reorder table
sex_sort <- sex[order(sex$Sexual_System, sex$Species), ]
Suggestion #1: Tidy-version
sex <- read_csv("https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA/data/R_sex.csv", col_names = c("Genus", "Species", "Sexual_system"),show_col_types = FALSE) %>%
mutate_if(is.character, str_replace_all, "gynodioecy|dioecy|trioecy", "non_hermaphrodite") %>%
mutate_if(is.character, str_replace_all, "silene", "Silene") %>%
arrange(Sexual_system, Species)
❖ Challenge #2 Write an R function to import and reformat the table.
Suggestion #2: Tidy-version
reformat_sex <- function(data) {
read_csv(data, col_names = c("Genus", "Species", "Sexual_system"),show_col_types = FALSE) %>%
mutate_if(is.character, str_replace_all, "gynodioecy|dioecy|trioecy", "non_hermaphrodite") %>%
mutate_if(is.character, str_replace_all, "silene", "Silene") %>%
arrange(Sexual_system, Species)
}
source("functions.R")
reformat_sex("https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA20/data/R_sex.csv")
Data visualization
Download the tables from the course website here. The zip file contains 4 different tables (SL_female
, SL_male
, SD_female
, SD_male
) corresponding to 4 conditions (S. latifolia females and males and S. dioica females and males). For each genomic location (chromosome and position) the feature (characteristic) nucleotide diversity \pi is shown. The aim is to visualize the data.
CHROM | POS | \pi |
---|---|---|
SL_stacks5 | 26 | 0.213904 |
SL_stacks5 | 39 | 0.607843 |
First, we need to generate a new directory and set it as working directory. Then we download the data and unzip the file.
## Create new working directory
dir.create("Pi"); setwd("Pi")
## Download data
R.url <- "https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA/data/R_pi.zip"
utils::download.file(R.url, destfile = "R_pi.zip")
## Unzip file
unzip("R_pi.zip")
❖ Challenge #3 Import the tables into R (read.table
), add the condition (SL_female
, SL_male
, SD_female
, SD_male
) to each observation and concatenate the tables (rbind
).
Suggestion #3: Base-version
#### Read in tables and add condition to each observation
SD_female <- read.table("SD_female.sites.pi", header = T)
SD_female$stat <- "SD_female"
SD_male <- read.table("SD_male.sites.pi", header = T)
SD_male$stat <- "SD_male"
SL_female <- read.table("SL_female.sites.pi", header = T)
SL_female$stat <- "SL_female"
SL_male <- read.table("SL_male.sites.pi", header = T)
SL_male$stat <- "SL_male"
### combine dataframes
sex_species <- rbind(SD_female, SD_male, SL_female, SL_male)
Suggestion #3: Tidy-version
sex_species <- list.files(pattern = ".sites.pi") %>%
map(~ read_delim(., delim = "\t",show_col_types = FALSE)) %>%
set_names(str_remove(list.files(pattern =".sites.pi"),".sites.pi" )) %>%
bind_rows(., .id = "sex_species")
### Save table as R data
save(sex_species, file = "sex_species.RData")
❖ Challenge #4 Generate a boxplot with each stat (boxplot
). Are there any differences among the conditions? Advanced users can use ggplot2.
Suggestion #4: Base-version
## boxplots
boxplot(sex_species$PI ~ sex_species$stat)
Suggestion #4: Tidy-version
sex_species %>%
ggplot(., aes(x = sex_species, y = PI, color = sex_species)) +
geom_boxplot()
❖ Challenge #5 Write an R function to import the data and generate the boxplots directly.
Suggestion #5: Tidy-version
sex_species_boxplot <- function(pat) {
list.files(pattern = pat) %>%
map(~ read_delim(., delim = "\t",show_col_types = FALSE)) %>%
set_names(set_names(str_remove(list.files(pattern =".sites.pi"),".sites.pi" ))) %>%
bind_rows(., .id = "sex_species") %>%
ggplot(., aes(x = sex_species, y = PI, color = sex_species)) +
geom_boxplot()
}
source("functions.R")
sex_species_boxplot(".sites.pi")
Data analysis
We use the data set from the previous challenge and try to statistically test if there are differences in nucleotide diversity pi between the 4 conditions.
❖ Challenge #6: Add to the table a factor sex (male, female) and species (SL, SD) to each observation.
Suggestion #6
###load data and add factor levels
load("sex_species.RData")
sex_species_added <- sex_species %>%
separate(col = "sex_species", into = c("species", "sex"), sep = "_") %>%
mutate(Species = as_factor(species)) %>%
mutate(Sex = as_factor(sex)) %>%
select(Species, Sex, PI)
❖ Challenge #7: Use a simple linear model (lm) to find out if there are sex and/or species effects?
Suggestion #7
##simple linear model
m_sex <- lm(PI~Sex, data=sex_species_added)
m_species <- lm(PI~Species,data=sex_species_added)
m_sex_species <- lm(PI~Sex*Species, data=sex_species_added)
##model summary
summary(m_sex)
summary(m_species)
summary(m_sex_species)
##if you like to extract easily coefficients use the broom package
library(broom)
tidy(m_sex)
tidy(m_species)
tidy(m_sex_species)
❖ Challenge #8: Have a look at the diagnostic plots. Is there a problem with the data?
Suggestion #8
##diagnostic plots
plot(m_sex)
2. Normal Q-Q: Residuals are normally distributed -> points on the diagonal
3. Scale-Location: equal variance (homoscedasticity). -> no patterns, randome scatter
4. Residuals vs Leverage. Influential observation -> no strong outliers
If you are violating these assumptions p-values are inflated. In our case all factors seems to be significant even based on the boxplot there are not so different. The larger number of data points, which are not independent, could lower the p-values even more.
Transforming the respond value (log, ...) or using another methods (e.g. Kruskal–Wallis or resampling methods) would be a more appropriate test statistic than a normal linear model.
I have a pretty pragmatic view as you rarely will be able to fulfil all assumptions but it is crucial that you know when the p-values are inflated.
Additional Information
R Packages
Repositories
- CRAN - official R repository
- Bioconductor - topic specific R repository
- devtools - Functions from Github
Add additional library paths
myPaths <- .libPaths()
myPaths <- c(myPaths, ‘C:/USER/RpackagesR4’)
.libPaths(myPaths) # add new path
Package installtion
The commands for the installation of R packages depends on the repository.
#### CRAN Repository
install.packages("package")
install.packages(c("packageA", "packageB"))
#### Bioconductor Repository
## R version 3.6+
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("package")
#### GitHub
library(devtools)
devtools::install_github("link/to/package")
Cheat-Sheets and other sources
- R Base (3/15)
- R Data Import (8/19)
- R Data Table (1/19)
- R Data Transformation (8/19)
- RMarkdown Reference Guide (10/14)
- RMarkdown Cheat Sheet (2/16)
- R Data Visualization with ggplot2 (12/19)
- R bloggers
- Advanced R
- R for Data Science
- BiocManager
- Installing R packages
- R Packages: A Beginner's Guide
- The basic philosophy of Tidyverse
- Installing R packages
- R Packages: A Beginner's Guide