BioComputing with R - Genetic Diversity Centre (GDC)

Learning Objectives

Main
◇ Be able to process data that haven been generated in the terminal.
◇ Be able to write reproducible R-scripts.
Minor
◇ Learn to use AIs.
◇ Learn to find "better" solution.

R code to get ready for the R demo

Open Rstudio
Paste the code below in the R console of Rstudio and execute it. Windows users might need to adjust the working directory path.

### set working directory
setwd("~/Desktop")

### generate folder
dir.create("Demo")
setwd("Demo")

### download scripts for Demo
utils::download.file("https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA/scripts/Demo.R", destfile = "Demo.R")
utils::download.file("https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA/scripts/My_functions.R", destfile = "My_functions.R")

### open R scripts
file.edit("My_functions.R")
file.edit("Demo.R")

First and Last Lines

This is a personal recommendation for starting and ending an R script. Before starting a new project, we should make sure that there are no unwanted environment variables.

rm(list = ls()) # clean/reset environment

Finally, all the information about the R session should be stored in a log file. It is important to record which version of R or which R package was used. This information is also useful for troubleshooting.

writeLines(capture.output(sessionInfo()), "sessionInfo.txt")

Challenges

In R Step-By-Step (for beginners) you will find some more basic information, including challenges that are not essential for the challenges below. Advanced users can use tidyverse, less advanced users should try to understand the commands. If you have your own scripts, use them.

Data manipulation

We will be working with the table here. It is a subset of the "Tree of Sex" database. For each plant species the sexual system is provided.

The aim is to reformat the table from the version you have to the format below.

Genus	Species	Sexual_system
Silene	acutifolia	hermaphrodite
Silene	adenocalyx	hermaphrodite
Silene	aegaea	hermaphrodite
Silene	aegyptiaca	hermaphrodite
Silene	aellenii	hermaphrodite
Silene	almolae	hermaphrodite
Silene	alpestris	hermaphrodite
Silene	acaulis	non_hermaphrodite
Silene	dioica	non_hermaphrodite
Silene	latifolia	non_hermaphrodite
Silene	vulgaris	non_hermaphrodite
Silene	ammophila	NA
Silene	amoena	NA

## Set working directory
dir.create("Sex"); setwd("Sex")

## Install required libraries
install.packages("tidyverse")

## Remove all variables
rm(list = ls())

## set seed
set.seed(1000)

## load library
library(tidyverse)

❖ Challenge #1 Use R to reformat your table (read.table,gsub,order). Advanced users can use tidyverse.

Suggestion #1: Base-version

## import data
sex <- read.csv("https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA20/data/R_sex.csv", header = F, na.strings = c("NA", ""), col.names = c("Genus", "Species", "Sexual_System"))

### correct Genus names
sex$Genus <- gsub("silene", "Silene", sex$Genus)

## Correct Sex system
sex$Sexual_System <- gsub("gynodioecy", "non_hermaphrodite", sex$Sexual_System)
sex$Sexual_System <- gsub("trioecy", "non_hermaphrodite", sex$Sexual_System)
sex$Sexual_System <- gsub("dioecy", "non_hermaphrodite", sex$Sexual_System)


## Reorder table
sex_sort <- sex[order(sex$Sexual_System, sex$Species), ]

Suggestion #1: Tidy-version

sex <- read_csv("https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA/data/R_sex.csv", col_names = c("Genus", "Species", "Sexual_system"),show_col_types = FALSE) %>%
        mutate_if(is.character, str_replace_all, "gynodioecy|dioecy|trioecy", "non_hermaphrodite") %>%
        mutate_if(is.character, str_replace_all, "silene", "Silene") %>%
        arrange(Sexual_system, Species)

❖ Challenge #2 Write an R function to import and reformat the table.

Suggestion #2: Tidy-version

reformat_sex <- function(data) {
  read_csv(data, col_names = c("Genus", "Species", "Sexual_system"),show_col_types = FALSE) %>%
    mutate_if(is.character, str_replace_all, "gynodioecy|dioecy|trioecy", "non_hermaphrodite") %>%
    mutate_if(is.character, str_replace_all, "silene", "Silene") %>%
    arrange(Sexual_system, Species)
}

source("functions.R")
reformat_sex("https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA20/data/R_sex.csv")

Data visualization

Download the tables from the course website here. The zip file contains 4 different tables (SL_female, SL_male, SD_female, SD_male) corresponding to 4 conditions (S. latifolia females and males and S. dioica females and males). For each genomic location (chromosome and position) the feature (characteristic) nucleotide diversity $\pi$ is shown. The aim is to visualize the data.

CHROM	POS	$\pi$
SL_stacks5	26	0.213904
SL_stacks5	39	0.607843

First, we need to generate a new directory and set it as working directory. Then we download the data and unzip the file.

## Create new working directory 
dir.create("Pi"); setwd("Pi")

## Download data
R.url <- "https://www.gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA/data/R_pi.zip"
utils::download.file(R.url, destfile = "R_pi.zip")

## Unzip file
unzip("R_pi.zip")

❖ Challenge #3 Import the tables into R (read.table), add the condition (SL_female, SL_male, SD_female, SD_male) to each observation and concatenate the tables (rbind).

Suggestion #3: Base-version

#### Read in tables and add condition to each observation
SD_female      <- read.table("SD_female.sites.pi", header = T)
SD_female$stat <- "SD_female"

SD_male        <- read.table("SD_male.sites.pi", header = T)
SD_male$stat   <- "SD_male"

SL_female      <- read.table("SL_female.sites.pi", header = T)
SL_female$stat <- "SL_female"

SL_male        <- read.table("SL_male.sites.pi", header = T)
SL_male$stat   <- "SL_male"

### combine dataframes
sex_species    <- rbind(SD_female, SD_male, SL_female, SL_male)

Suggestion #3: Tidy-version

sex_species <- list.files(pattern = ".sites.pi") %>%
    map(~ read_delim(., delim = "\t",show_col_types = FALSE)) %>%
    set_names(str_remove(list.files(pattern =".sites.pi"),".sites.pi" )) %>%
    bind_rows(., .id = "sex_species")

### Save table as R data 
save(sex_species, file = "sex_species.RData")

❖ Challenge #4 Generate a boxplot with each stat (boxplot). Are there any differences among the conditions? Advanced users can use ggplot2.

Suggestion #4: Base-version

## boxplots
boxplot(sex_species$PI ~ sex_species$stat)

Suggestion #4: Tidy-version

sex_species %>%
  ggplot(., aes(x = sex_species, y = PI, color = sex_species)) +
  geom_boxplot()

❖ Challenge #5 Write an R function to import the data and generate the boxplots directly.

Suggestion #5: Tidy-version

sex_species_boxplot <- function(pat) {
  list.files(pattern = pat) %>%
    map(~ read_delim(., delim = "\t",show_col_types = FALSE)) %>%
    set_names(set_names(str_remove(list.files(pattern =".sites.pi"),".sites.pi" ))) %>%
    bind_rows(., .id = "sex_species") %>%
    ggplot(., aes(x = sex_species, y = PI, color = sex_species)) +
    geom_boxplot()
}

source("functions.R")
sex_species_boxplot(".sites.pi")

Data analysis

We use the data set from the previous challenge and try to statistically test if there are differences in nucleotide diversity pi between the 4 conditions.

❖ Challenge #6: Add to the table a factor sex (male, female) and species (SL, SD) to each observation.

Suggestion #6

###load data and add factor levels
load("sex_species.RData")

sex_species_added <- sex_species %>%
        separate(col = "sex_species", into = c("species", "sex"), sep = "_") %>%
        mutate(Species = as_factor(species)) %>%
        mutate(Sex = as_factor(sex)) %>%
        select(Species, Sex, PI)

❖ Challenge #7: Use a simple linear model (lm) to find out if there are sex and/or species effects?

Suggestion #7

##simple linear model
m_sex         <- lm(PI~Sex, data=sex_species_added)
m_species     <- lm(PI~Species,data=sex_species_added)
m_sex_species <- lm(PI~Sex*Species, data=sex_species_added)

##model summary
summary(m_sex)
summary(m_species)
summary(m_sex_species)

##if you like to extract easily coefficients use the broom package

library(broom)
tidy(m_sex)
tidy(m_species)
tidy(m_sex_species)

❖ Challenge #8: Have a look at the diagnostic plots. Is there a problem with the data?

Suggestion #8

##diagnostic plots
plot(m_sex)

1. Residuals vs Fitted:This plot shows if residuals have non-linear patterns. -> ideal scattered around 0
2. Normal Q-Q: Residuals are normally distributed -> points on the diagonal
3. Scale-Location: equal variance (homoscedasticity). -> no patterns, randome scatter
4. Residuals vs Leverage. Influential observation -> no strong outliers
If you are violating these assumptions p-values are inflated. In our case all factors seems to be significant even based on the boxplot there are not so different. The larger number of data points, which are not independent, could lower the p-values even more.
Transforming the respond value (log, ...) or using another methods (e.g. Kruskal–Wallis or resampling methods) would be a more appropriate test statistic than a normal linear model.
I have a pretty pragmatic view as you rarely will be able to fulfil all assumptions but it is crucial that you know when the p-values are inflated.

Additional Information

R Packages

Repositories

CRAN - official R repository
Bioconductor - topic specific R repository
devtools - Functions from Github

Add additional library paths

myPaths <- .libPaths()
myPaths <- c(myPaths, ‘C:/USER/RpackagesR4’)
.libPaths(myPaths)  # add new path

Package installtion

The commands for the installation of R packages depends on the repository.

#### CRAN Repository

install.packages("package")
install.packages(c("packageA", "packageB"))

#### Bioconductor Repository

## R version 3.6+
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("package")

#### GitHub
library(devtools)
devtools::install_github("link/to/package")