Biocomputing with R

Lecture Notes

⬇︎ R

Helpful

R code to get ready for the R demo

Open Rstudio

Paste the code below in the R console of Rstudio and execute it. Windows users might need to adjust the working directory path.

### set working directory
setwd("~/Desktop")

### generate folder
dir.create("Demo")
setwd("Demo")

### download scripts for Demo
utils::download.file("https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA21/scripts/Demo.R", destfile = "Demo.R")
utils::download.file("https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA21/scripts/My_functions.R", destfile = "My_functions.R")

### open R scripts
file.edit("My_functions.R")
file.edit("Demo.R")

Info

If you have problems with the CA certificate use http instead of https.

Challenges

In "R Sep-By-Step" you find some more basic information including challenges, which are not essential for the challenges below. Advanced user can use tidyverse, less advanced user should try to understand the commands. If you have your own sripts use these.

Post your comments or questions directly on the white board.

Data manipulation

We are going to work with the table here. It is a subset of the sex of life database. For each plant species the sexual system is provided.

The goal is to reformat the table from the version you have into the format below.

Genus	Species	Sexual_system
Silene	acutifolia	hermaphrodite
Silene	adenocalyx	hermaphrodite
Silene	aegaea	hermaphrodite
Silene	aegyptiaca	hermaphrodite
Silene	aellenii	hermaphrodite
Silene	almolae	hermaphrodite
Silene	alpestris	hermaphrodite
Silene	acaulis	non_hermaphrodite
Silene	dioica	non_hermaphrodite
Silene	latifolia	non_hermaphrodite
Silene	vulgaris	non_hermaphrodite
Silene	ammophila	NA
Silene	amoena	NA

## Set working directory
dir.create("Sex"); setwd("Sex")

## Install required libraries
install.packages("tidyverse")

## Remove all variables
rm(list = ls())

## set seed
set.seed(1000)

## load library
library(tidyverse)

(1) Use R to reformat your table (read.table,gsub,order). Advanced users can use tidyverse.

Base-version

## import data
sex <- read.csv("https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA20/data/R_sex.csv", header = F, na.strings = c("NA", ""), col.names = c("Genus", "Species", "Sexual_System"))

### correct Genus names
sex$Genus <- gsub("silene", "Silene", sex$Genus)

## Correct Sex system
sex$Sexual_System <- gsub("gynodioecy", "non_hermaphrodite", sex$Sexual_System)
sex$Sexual_System <- gsub("trioecy", "non_hermaphrodite", sex$Sexual_System)
sex$Sexual_System <- gsub("dioecy", "non_hermaphrodite", sex$Sexual_System)


## Reorder table
sex_sort <- sex[order(sex$Sexual_System, sex$Species), ]

sessionInfo()

Tidy-version

sex <- read_csv("https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA20/data/R_sex.csv", col_names = c("Genus", "Species", "Sexual_system")) %>%
        mutate_if(is.character, str_replace_all, "gynodioecy|dioecy|trioecy", "non_hermaphrodite") %>%
        mutate_if(is.character, str_replace_all, "silene", "Silene") %>%
        arrange(Sexual_system, Species)

(3) Write an R function to import and reformat the table.

Tidy-version

reformat_sex <- function(data) {
  read_csv(data, col_names = c("Genus", "Species", "Sexual_system")) %>%
    mutate_if(is.character, str_replace_all, "gynodioecy|dioecy|trioecy", "non_hermaphrodite") %>%
    mutate_if(is.character, str_replace_all, "silene", "Silene") %>%
    arrange(Sexual_system, Species)
}

source("functions.R")
reformat_sex("https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA20/data/R_sex.csv")

Data visualization

Download the tables from the course website here. The zip file contains 4 different tables (SL_female, SL_male, SD_female, SD_male), which corresponds to 4 conditions (S. latifolia females and males and S. dioica females and males). For each genomic location (Chromosome and Position) the feature nucleotide diversity pi is shown. The goal is to visualize the data.

CHROM	POS	PI
SL_stacks5	26	0.213904
SL_stacks5	39	0.607843

(1) Import the tables into R (read.table), add the condition (SL_female, SL_male, SD_female, SD_male) to each observation and concatenate the tables (rbind).

First, we need to generate a new directory and set it as working directory. Then we download the data and unzip the file.

## Create new working directory 
dir.create("Pi"); setwd("Pi")
R.url <- "https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA20/data/R_pi.zip"
utils::download.file(R.url, destfile = "R_pi.zip")
unzip("R_pi.zip")

Base-version

#### Read in tables and add condition to each observation
SD_female     <-     read.table("SD_female.sites.pi", header = T)
SD_female$stat  <-   "SD_female"

SD_male       <-     read.table("SD_male.sites.pi", header = T)
SD_male$stat    <-   "SD_male"

SL_female     <-     read.table("SL_female.sites.pi", header = T)
SL_female$stat  <-   "SL_female"

SL_male       <-     read.table("SL_male.sites.pi", header = T)
SL_male$stat <-      "SL_male"

### combine dataframes
sex_species  <-    rbind(SD_female, SD_male, SL_female, SL_male)

Tidy-version

sex_species <- list.files(pattern = ".sites.pi") %>%
    map(~ read_delim(., delim = "\t")) %>%
    set_names(str_remove(list.files(pattern =".sites.pi"),".sites.pi" )) %>%
    bind_rows(., .id = "sex_species")

###Save table as R data 
save(sex_species, file = "sex_species.RData")

(2) Generate a boxplot with each stat (boxplot). Are there any differences among the conditions? Advanced users can use ggplot2.

Base-version

## boxplots
boxplot(sex_species$PI ~ sex_species$stat)

Tidy-version

sex_species %>%
  ggplot(., aes(x = sex_species, y = PI, color = sex_species)) +
  geom_boxplot()

(3) Write an R function to import the data and generate the boxplots directly.

Tidy-version

sex_species_boxplot <- function(pat) {
  list.files(pattern = pat) %>%
    map(~ read_delim(., delim = "\t")) %>%
    set_names(set_names(str_remove(list.files(pattern =".sites.pi"),".sites.pi" ))) %>%
    bind_rows(., .id = "sex_species") %>%
    ggplot(., aes(x = sex_species, y = PI, color = sex_species)) +
    geom_boxplot()
}

source("functions.R")
sex_species_boxplot(".sites.pi")

Data analysis

We use the dataset from the challenge before and try to statistically test if there are differences in nucleotide diversity pi among the 4 conditions.

(1) Add to the table a factor sex (male, female) and species (SL, SD) to each observation.

Suggestion

###load data and add factor levels
load("sex_species.RData")

sex_species_added <- sex_species %>%
        separate(col = "sex_species", into = c("species", "sex"), sep = "_") %>%
        mutate(Species = as_factor(species)) %>%
        mutate(Sex = as_factor(sex)) %>%
        select(Species, Sex, PI)

(2) Use a simple linear model (lm) to find out if there are sex and/or species effects?

Suggestion

##simple linear model
m_sex         <-  lm(PI~Sex, data=sex_species_added)
m_species     <-  lm(PI~Species,data=sex_species_added)
m_sex_species <-  lm(PI~Sex*Species, data=sex_species_added)

##model summary
summary(m_sex)
summary(m_species)
summary(m_sex_species)

##if you like to extract easily coefficients use the broom package

library(broom)
tidy(m_sex)
tidy(m_species)
tidy(m_sex_species)

(3) Have a look at the diagnostic plots. Is there a problem with the data?

Suggestion

##diagnostic plots
plot(m_sex)

1. Residuals vs Fitted:This plot shows if residuals have non-linear patterns. -> ideal scattered around 0
2. Normal Q-Q: Residuals are normally distributed -> points on the diagonal
3. Scale-Location: equal variance (homoscedasticity). -> no patterns, randome scatter
4. Residuals vs Leverage. Influential observation -> no strong outliers
If you are violating these assumptions p-values are inflated. In our case all factors seems to be significant even based on the boxplot there are not so different. The larger number of data points, which are not independent, could lower the p-values even more.
Transforming the respond value (log, ...) or using another methods (e.g. Kruskal–Wallis or resampling methods) would be a more appropriate test statistic than a normal linear model.
I have a pretty pragmatic view as you rarely will be able to fullfill all assumptions but it is crucial that you know when the p-values are inflated.

Additional Information

First and Last Lines

This is a personal recommendation for the start and end of an R script. Before starting a new project we should make sure that there are no unwanted environment variables.

rm(list = ls()) # clean/reset environment

Finally, all information about the R session should be stored in a log file. It is important to record which R version or which R package was used. This information is also important for troubleshooting.

project.log <- sessionInfo()
write("project.log")

Cheat-Sheets PDFs

More information

R Packages

Repositories

Packages are the fundamental units of R code. They include R functions, documentation, and sample data. R packages can be found in repositories and these are the most popular ones:

CRAN - official R repository
Bioconductor - topic specific R repository
Github - most popular repository for open source projects, not R specific

Installing Packages via devtools

The different repositories have unique ways to install a package. This might be a bit confusing if you are using packages from different repositories. You could use devtools package instead to unify this process. For more information, visit the devtools readme page.

Package Development - devtools - Cheat Sheet

Package Installation

The commands for the installation of R packages depends on the repository.

#### CRAN Repository

install.packages("package")
install.packages(c("packageA", "packageB"))

#### Bioconductor Repository

## R version 3.6+
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("package")

#### GitHub
library(devtools)
devtools::install_github("link/to/package")

Find Packages

# Searching for packages on CRAN
install.packages("packagefinder"); library(packagefinder)
findPackage("tea")
packageDetails("tea")

Package Info/Help

It is a superb idea to look at the package information.

packageDescription("package")
help(package = "package")

Manage Packages

# List all installed packages
installed.packages()
# Get Package version
packageVersion("fun")
# Update a package
update.packages("fun")
# Load a package
library("fun")
# Un-load a package
detach("package:fun", unload=TRUE)
# Remove a package
remove.packages("fun")

Example(s)

## Alternative package version

# Install dplyr by installing tidyverse (collection of data science tools):
install.packages("tidyverse")
search()

# Alternatively, install just dplyr:
install.packages("dplyr")

# Or the development version from GitHub:
install.packages("devtools")
devtools::install_github("tidyverse/dplyr")

### Load multiple CRAN packages

# Package list
package.list = c("ggplot2","RColorBrewer","ggpubr")

package.manager <- lapply(
  package.list,
  FUN <- function(x) {
    # Load multiple packages and
    # install missing packages
    if (!require(x, character.only = TRUE)) {
      install.packages(x, dependencies = TRUE)
      library(x, character.only = TRUE)
    }
  }
)