Biocomputing with R
Lecture Notes
⬇︎ R
Helpful
⬇︎ R Step-By-Step (for beginners)
R code to get ready for the R demo
- Open Rstudio
- Paste the code below in the R console of Rstudio and execute it. Windows users might need to adjust the working directory path.
### set working directory setwd("~/Desktop") ### generate folder dir.create("Demo") setwd("Demo") ### download scripts for Demo utils::download.file("https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA21/scripts/Demo.R", destfile = "Demo.R") utils::download.file("https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA21/scripts/My_functions.R", destfile = "My_functions.R") ### open R scripts file.edit("My_functions.R") file.edit("Demo.R")
Info
If you have problems with the CA certificate use http instead of https.
Challenges
In "R Sep-By-Step" you find some more basic information including challenges, which are not essential for the challenges below. Advanced user can use tidyverse, less advanced user should try to understand the commands. If you have your own sripts use these.
Post your comments or questions directly on the white board.
Data manipulation
We are going to work with the table here. It is a subset of the sex of life database. For each plant species the sexual system is provided.
The goal is to reformat the table from the version you have into the format below.
Genus | Species | Sexual_system |
---|---|---|
Silene | acutifolia | hermaphrodite |
Silene | adenocalyx | hermaphrodite |
Silene | aegaea | hermaphrodite |
Silene | aegyptiaca | hermaphrodite |
Silene | aellenii | hermaphrodite |
Silene | almolae | hermaphrodite |
Silene | alpestris | hermaphrodite |
Silene | acaulis | non_hermaphrodite |
Silene | dioica | non_hermaphrodite |
Silene | latifolia | non_hermaphrodite |
Silene | vulgaris | non_hermaphrodite |
Silene | ammophila | NA |
Silene | amoena | NA |
## Set working directory
dir.create("Sex"); setwd("Sex")
## Install required libraries
install.packages("tidyverse")
## Remove all variables
rm(list = ls())
## set seed
set.seed(1000)
## load library
library(tidyverse)
(1) Use R to reformat your table (read.table
,gsub
,order
). Advanced users can use tidyverse.
Base-version
## import data
sex <- read.csv("https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA20/data/R_sex.csv", header = F, na.strings = c("NA", ""), col.names = c("Genus", "Species", "Sexual_System"))
### correct Genus names
sex$Genus <- gsub("silene", "Silene", sex$Genus)
## Correct Sex system
sex$Sexual_System <- gsub("gynodioecy", "non_hermaphrodite", sex$Sexual_System)
sex$Sexual_System <- gsub("trioecy", "non_hermaphrodite", sex$Sexual_System)
sex$Sexual_System <- gsub("dioecy", "non_hermaphrodite", sex$Sexual_System)
## Reorder table
sex_sort <- sex[order(sex$Sexual_System, sex$Species), ]
sessionInfo()
Tidy-version
sex <- read_csv("https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA20/data/R_sex.csv", col_names = c("Genus", "Species", "Sexual_system")) %>%
mutate_if(is.character, str_replace_all, "gynodioecy|dioecy|trioecy", "non_hermaphrodite") %>%
mutate_if(is.character, str_replace_all, "silene", "Silene") %>%
arrange(Sexual_system, Species)
(3) Write an R function to import and reformat the table.
Tidy-version
reformat_sex <- function(data) {
read_csv(data, col_names = c("Genus", "Species", "Sexual_system")) %>%
mutate_if(is.character, str_replace_all, "gynodioecy|dioecy|trioecy", "non_hermaphrodite") %>%
mutate_if(is.character, str_replace_all, "silene", "Silene") %>%
arrange(Sexual_system, Species)
}
source("functions.R")
reformat_sex("https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA20/data/R_sex.csv")
Data visualization
Download the tables from the course website here. The zip file contains 4 different tables (SL_female
, SL_male
, SD_female
, SD_male
), which corresponds to 4 conditions (S. latifolia females and males and S. dioica females and males). For each genomic location (Chromosome and Position) the feature nucleotide diversity pi is shown. The goal is to visualize the data.
CHROM | POS | PI |
---|---|---|
SL_stacks5 | 26 | 0.213904 |
SL_stacks5 | 39 | 0.607843 |
(1) Import the tables into R (read.table
), add the condition (SL_female
, SL_male
, SD_female
, SD_male
) to each observation and concatenate the tables (rbind
).
First, we need to generate a new directory and set it as working directory. Then we download the data and unzip the file.
## Create new working directory
dir.create("Pi"); setwd("Pi")
R.url <- "https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA20/data/R_pi.zip"
utils::download.file(R.url, destfile = "R_pi.zip")
unzip("R_pi.zip")
Base-version
#### Read in tables and add condition to each observation
SD_female <- read.table("SD_female.sites.pi", header = T)
SD_female$stat <- "SD_female"
SD_male <- read.table("SD_male.sites.pi", header = T)
SD_male$stat <- "SD_male"
SL_female <- read.table("SL_female.sites.pi", header = T)
SL_female$stat <- "SL_female"
SL_male <- read.table("SL_male.sites.pi", header = T)
SL_male$stat <- "SL_male"
### combine dataframes
sex_species <- rbind(SD_female, SD_male, SL_female, SL_male)
Tidy-version
sex_species <- list.files(pattern = ".sites.pi") %>%
map(~ read_delim(., delim = "\t")) %>%
set_names(str_remove(list.files(pattern =".sites.pi"),".sites.pi" )) %>%
bind_rows(., .id = "sex_species")
###Save table as R data
save(sex_species, file = "sex_species.RData")
(2) Generate a boxplot with each stat (boxplot
). Are there any differences among the conditions? Advanced users can use ggplot2.
Base-version
## boxplots
boxplot(sex_species$PI ~ sex_species$stat)
Tidy-version
sex_species %>%
ggplot(., aes(x = sex_species, y = PI, color = sex_species)) +
geom_boxplot()
(3) Write an R function to import the data and generate the boxplots directly.
Tidy-version
sex_species_boxplot <- function(pat) {
list.files(pattern = pat) %>%
map(~ read_delim(., delim = "\t")) %>%
set_names(set_names(str_remove(list.files(pattern =".sites.pi"),".sites.pi" ))) %>%
bind_rows(., .id = "sex_species") %>%
ggplot(., aes(x = sex_species, y = PI, color = sex_species)) +
geom_boxplot()
}
source("functions.R")
sex_species_boxplot(".sites.pi")
Data analysis
We use the dataset from the challenge before and try to statistically test if there are differences in nucleotide diversity pi among the 4 conditions.
(1) Add to the table a factor sex (male, female) and species (SL, SD) to each observation.
Suggestion
###load data and add factor levels
load("sex_species.RData")
sex_species_added <- sex_species %>%
separate(col = "sex_species", into = c("species", "sex"), sep = "_") %>%
mutate(Species = as_factor(species)) %>%
mutate(Sex = as_factor(sex)) %>%
select(Species, Sex, PI)
(2) Use a simple linear model (lm) to find out if there are sex and/or species effects?
Suggestion
##simple linear model
m_sex <- lm(PI~Sex, data=sex_species_added)
m_species <- lm(PI~Species,data=sex_species_added)
m_sex_species <- lm(PI~Sex*Species, data=sex_species_added)
##model summary
summary(m_sex)
summary(m_species)
summary(m_sex_species)
##if you like to extract easily coefficients use the broom package
library(broom)
tidy(m_sex)
tidy(m_species)
tidy(m_sex_species)
(3) Have a look at the diagnostic plots. Is there a problem with the data?
Suggestion
##diagnostic plots
plot(m_sex)
2. Normal Q-Q: Residuals are normally distributed -> points on the diagonal
3. Scale-Location: equal variance (homoscedasticity). -> no patterns, randome scatter
4. Residuals vs Leverage. Influential observation -> no strong outliers
If you are violating these assumptions p-values are inflated. In our case all factors seems to be significant even based on the boxplot there are not so different. The larger number of data points, which are not independent, could lower the p-values even more.
Transforming the respond value (log, ...) or using another methods (e.g. Kruskal–Wallis or resampling methods) would be a more appropriate test statistic than a normal linear model.
I have a pretty pragmatic view as you rarely will be able to fullfill all assumptions but it is crucial that you know when the p-values are inflated.
Additional Information
First and Last Lines
This is a personal recommendation for the start and end of an R script. Before starting a new project we should make sure that there are no unwanted environment variables.
rm(list = ls()) # clean/reset environment
Finally, all information about the R session should be stored in a log file. It is important to record which R version or which R package was used. This information is also important for troubleshooting.
project.log <- sessionInfo()
write("project.log")
Cheat-Sheets PDFs
- R Base (3/15)
- R Data Import (8/19)
- R Data Table (1/19)
- R Data Transformation (8/19)
- RMarkdown Reference Guide (10/14)
- RMarkdown Cheat Sheet (2/16)
- R Data Visualization with ggplot2 (12/19)
More information
- R bloggers
- Advanced R
- R for Data Science
- BiocManager
- Installing R packages
- R Packages: A Beginner's Guide
- The basic philosophy of Tidyverse
- Tidyverse book
R Packages
Repositories
Packages are the fundamental units of R code. They include R functions, documentation, and sample data. R packages can be found in repositories and these are the most popular ones:
- CRAN - official R repository
- Bioconductor - topic specific R repository
- Github - most popular repository for open source projects, not R specific
Installing Packages via devtools
The different repositories have unique ways to install a package. This might be a bit confusing if you are using packages from different repositories. You could use devtools package instead to unify this process. For more information, visit the devtools readme page.
Package Installation
The commands for the installation of R packages depends on the repository.
#### CRAN Repository
install.packages("package")
install.packages(c("packageA", "packageB"))
#### Bioconductor Repository
## R version 3.6+
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("package")
#### GitHub
library(devtools)
devtools::install_github("link/to/package")
Find Packages
# Searching for packages on CRAN
install.packages("packagefinder"); library(packagefinder)
findPackage("tea")
packageDetails("tea")
Package Info/Help
It is a superb idea to look at the package information.
packageDescription("package")
help(package = "package")
Manage Packages
# List all installed packages
installed.packages()
# Get Package version
packageVersion("fun")
# Update a package
update.packages("fun")
# Load a package
library("fun")
# Un-load a package
detach("package:fun", unload=TRUE)
# Remove a package
remove.packages("fun")
Example(s)
## Alternative package version
# Install dplyr by installing tidyverse (collection of data science tools):
install.packages("tidyverse")
search()
# Alternatively, install just dplyr:
install.packages("dplyr")
# Or the development version from GitHub:
install.packages("devtools")
devtools::install_github("tidyverse/dplyr")
### Load multiple CRAN packages
# Package list
package.list = c("ggplot2","RColorBrewer","ggpubr")
package.manager <- lapply(
package.list,
FUN <- function(x) {
# Load multiple packages and
# install missing packages
if (!require(x, character.only = TRUE)) {
install.packages(x, dependencies = TRUE)
library(x, character.only = TRUE)
}
}
)