Biocomputing with R
Introduction Notes¶
⬇︎ R Step-By-Step (for beginners)
⬇︎ Workflow
R code to get ready for the R demo¶
- Open Rstudio
- Paste the code below in the R console of Rstudio and execute it. Windows users might need to adjust the working directory.
## set working directory setwd("~/Desktop") ## generate folder dir.create("Demo") setwd("Demo") ## download scripts for Demo utils::download.file("https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA20/scripts/Demo.R", destfile = "Demo.R") utils::download.file("https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA20/scripts/My_functions.R", destfile = "My_functions.R") ## open R scripts file.edit("My_functions.R") file.edit("Demo.R")
Challenges¶
In R Sep-By-Step you find some more basic information including excercise, which are not essential for the challenges below. Advanced user can use tidyverse, less advanced user should try to understand the commands.
Post your comments or questions directly in the Google group. We are always interested in other solutions.
Data manipulation¶
We are going to work with the table here. It is a subset of the sex of life database. For each plant species the sexual system is provided.
The goal is to reformat the table from the version you have into the format below.
Genus | Species | Sexual_system |
---|---|---|
Silene | acutifolia | hermaphrodite |
Silene | adenocalyx | hermaphrodite |
Silene | aegaea | hermaphrodite |
Silene | aegyptiaca | hermaphrodite |
Silene | aellenii | hermaphrodite |
Silene | almolae | hermaphrodite |
Silene | alpestris | hermaphrodite |
Silene | acaulis | non_hermaphrodite |
Silene | dioica | non_hermaphrodite |
Silene | latifolia | non_hermaphrodite |
Silene | vulgaris | non_hermaphrodite |
Silene | ammophila | NA |
Silene | amoena | NA |
## Set working directory dir.create("Sex"); setwd("Sex")
## Install required libraries install.packages("tidyverse")
## Remove all variables rm(list = ls())
## set seed set.seed(1000)
## load library library(tidyverse)
(1) Use R to reformat your table (read.table
,gsub
,order
). Advanced users can use tidyverse.
Base-version
## import data sex <- read.csv("https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA20/data/R_sex.csv", header = F, na.strings = c("NA", ""), col.names = c("Genus", "Species", "Sexual_System")) ### correct Genus names sex$Genus <- gsub("silene", "Silene", sex$Genus) ## Correct Sex system sex$Sexual_System <- gsub("gynodioecy", "non_hermaphrodite", sex$Sexual_System) sex$Sexual_System <- gsub("trioecy", "non_hermaphrodite", sex$Sexual_System) sex$Sexual_System <- gsub("dioecy", "non_hermaphrodite", sex$Sexual_System) ## Reorder table sex_sort <- sex[order(sex$Sexual_System, sex$Species), ]
sessionInfo()
Tidy-version
sex <- read_csv("https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA20/data/R_sex.csv", col_names = c("Genus", "Species", "Sexual_system")) %>% mutate_if(is.character, str_replace_all, "gynodioecy|dioecy|trioecy", "non_hermaphrodite") %>% mutate_if(is.character, str_replace_all, "silene", "Silene") %>% arrange(Sexual_system, Species)
(3) Write an R function to import and reformat the table.
Tidy-version
reformat_sex <- function(data) { read_csv(data, col_names = c("Genus", "Species", "Sexual_system")) %>% mutate_if(is.character, str_replace_all, "gynodioecy|dioecy|trioecy", "non_hermaphrodite") %>% mutate_if(is.character, str_replace_all, "silene", "Silene") %>% arrange(Sexual_system, Species) }
source("functions.R") reformat_sex("https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA20/data/R_sex.csv")
Data visualisation¶
Download the tables from the course website here. The zip file contains 4 different tables (SL_female
, SL_male
, SD_female
, SD_male
), which corresponds to 4 conditions (S. latifolia females and males and S. dioica females and males). For each genomic location (Chromosome and Position) the feature nucleotide diversity pi is shown. The goal is to visualise the data.
CHROM | POS | PI |
---|---|---|
SL_stacks5 | 26 | 0.213904 |
SL_stacks5 | 39 | 0.607843 |
(1) Import the tables into R (read.table
), add the condition (SL_female
, SL_male
, SD_female
, SD_male
) to each observation and concatenate the tables (rbind
).
First, we need to generate a new directory and set it as working directory. Then we download the data and unzip the file.
## Create new working directory dir.create("Pi"); setwd("Pi") R.url <- "https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA20/data/R_pi.zip" utils::download.file(R.url, destfile = "R_pi.zip") unzip("R_pi.zip")
Base-version
#### Read in tables and add condition to each observation SD_female <- read.table("SD_female.sites.pi", header = T) SD_female$stat <- "SD_female" SD_male <- read.table("SD_male.sites.pi", header = T) SD_male$stat <- "SD_male" SL_female <- read.table("SL_female.sites.pi", header = T) SL_female$stat <- "SL_female" SL_male <- read.table("SL_male.sites.pi", header = T) SL_male$stat <- "SL_male" ### combine dataframes sex_species <- rbind(SD_female, SD_male, SL_female, SL_male)
Tidy-version
sex_species <- sex_species <- list.files(pattern = ".sites.pi") %>% map(~ read_delim(., delim = "\t")) %>% set_names(set_names(str_remove(list.files(pattern =".sites.pi"),".sites.pi" ))) %>% bind_rows(., .id = "sex_species")
###Save table as R data save(sex_species, file = "sex_species.RData")
(2) Generate a boxplot with each stat (boxplot
). Are there any differences among the conditions? Advanced users can use ggplot2.
Base-version
## boxplots boxplot(sex_species$PI ~ sex_species$stat)
Tidy-version
sex_species %>% ggplot(., aes(x = sex_species, y = PI, color = sex_species)) + geom_boxplot()
(3) Write an R function to import the data and generate the boxplots directly.
Tidy-version
sex_species_boxplot <- function(pat) { list.files(pattern = pat) %>% map(~ read_delim(., delim = "\t")) %>% set_names(set_names(str_remove(list.files(pattern =".sites.pi"),".sites.pi" ))) %>% bind_rows(., .id = "sex_species") %>% ggplot(., aes(x = sex_species, y = PI, color = sex_species)) + geom_boxplot() }
source("functions.R") sex_species_boxplot(".sites.pi")
Data analysis¶
We use the dataset from the challenge before and try to statistically test if there are differences in nucleotide diversity pi among the 4 conditions.
(1) Add to the table a factor sex (male, female) and species (SL, SD) to each observation.
Suggestion
###load data and add factor levels load("sex_species.RData") sex_species <- sex_species %>% separate(col = "sex_species", into = c("species", "sex"), sep = "_") %>% mutate(Species = as_factor(species)) %>% mutate(Sex = as_factor(sex)) %>% select(Species, Sex, PI)
(2) Use a simple linear model (lm) to find out if there are sex and/or species effects?
Suggestionn
##simple linear model m_sex <- lm(PI~Sex, data=sex_species) m_species <- lm(PI~Species,data=sex_species) m_sex_species <- lm(PI~Sex*Species, data=sex_species) ##model summary summary(m_sex) summary(m_species) summary(m_sex_species) ##if you like to extract easily coefficients use the broom package library(broom) tidy(m_sex) tidy(m_species) tidy(m_sex_species)
(3) Have a look at the diagnostic plots. Is there a problem with the data?
Suggestion
##diagnostic plots plot(m_sex)
2. Normal Q-Q: Residuals are normally distributed -> points on the diagonal
3. Scale-Location: equal variance (homoscedasticity). -> no patterns, randome scatter
4. Residuals vs Leverage. Influential observation -> no strong outliers
If you are violating these assumptions the p-values are inflated. In our case all factors seems to be significant even based on the boxplot there are not so different. The larger number of data points, which are not independent, could even lower the p-values even more.
Transforming the respond value (log, ...) or using another methodis (e.g. Kruskal–Wallis Tests) would be a more appropriate test statistic than a normal linear model.
I have a pretty pragmatic view as you rarely will be able to full fill all assumptions but it is crucial that you know when the p-values are inflated.
Additional Information¶
Cheat-Sheets PDFs¶
- R Base (3/15)
- R Data Import (8/19)
- R Data Table (1/19)
- R Data Transformation (8/19)
- RMarkdown Reference Guide (10/14)
- RMarkdown Cheat Sheet (2/16)
- R Data Visualization with ggplot2 (12/19)
More information¶
- R bloggers
- Advanced R
- R for Data Science
- BiocManager
- Installing R packages
- R Packages: A Beginner's Guide
- The basic philosophy of Tidyverse
- Tidyverse book
R Packages¶
Repositories¶
Packages are the fundamental units of R code. They include R functions, documentation, and sample data. R packages can be found in repositories and these are the most popular ones:
- CRAN - official R repository
- Bioconductor - topic specific R repository
- Github - most popular repository for open source projects, not R specific
Installing Packages via devtools¶
The different repositories have unique ways to install a package. This might be a bit confusing if you are using packages from different repositories. You could use devtools package instead to unify this process. For more information, visit the devtools readme page.
Package Installation¶
The commands for the installation of R packages depends on the repository.
#### CRAN Repository install.packages("package") install.packages(c("packageA", "packageB")) #### Bioconductor Repository ## R version 3.6+ if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("package") #### GitHub library(devtools) devtools::install_github("link/to/package")
Find Packages¶
# Searching for packages on CRAN install.packages("packagefinder"); library(packagefinder) findPackage("tea") packageDetails("tea")
Package Info/Help¶
It is a superb idea to look at the package information.
packageDescription("package") help(package = "package")
Manage Packages¶
# List all installed packages installed.packages() # Get Package version packageVersion("fun") # Update a package update.packages("fun") # Load a package library("fun") # Un-load a package detach("package:fun", unload=TRUE) # Remove a package remove.packages("fun")
Example(s)¶
## Alternative package version # Install dplyr by installing tidyverse (collection of data science tools): install.packages("tidyverse") search() # Alternatively, install just dplyr: install.packages("dplyr") # Or the development version from GitHub: install.packages("devtools") devtools::install_github("tidyverse/dplyr") ### Load multiple CRAN packages # Package list package.list = c("ggplot2","RColorBrewer","ggpubr") package.manager <- lapply( package.list, FUN <- function(x) { # Load multiple packages and # install missing packages if (!require(x, character.only = TRUE)) { install.packages(x, dependencies = TRUE) library(x, character.only = TRUE) } } )