R for Biologists

Lecture notes

R for biologists (pdf)

R code (.R)

Exercise

Solutions (.R)

Please work in groups of two. Use a markdown editor and an appropriate R-style. Advanced R-users can use tidyverse. During the exercise fill your comments directly in the GoogleDoc file using the link here.

1. Data manipulation

Download the table from the course website here. It is a subset of the sex of life database. For each plant species the sexual system is provided.

The goal is to reformat the table using R from the version you have into the format below.

Genus Species Sexual_system
Silene acutifolia hermaphrodite
Silene adenocalyx hermaphrodite
Silene aegaea hermaphrodite
Silene aegyptiaca hermaphrodite
Silene aellenii hermaphrodite
Silene almolae hermaphrodite
Silene alpestris hermaphrodite
Silene ammophila NA
Silene amoena NA
Silene acaulis non_hermaphrodite
Silene dioica non_hermaphrodite
Silene latifolia non_hermaphrodite
Silene vulgaris non_hermaphrodite

(1) Use R to reformat your table (read.table,gsub,order). Advanced users can use tidyverse.

Solution

## import data
sex <- read.csv("sex.csv", header = F, na.strings = c("NA", ""), col.names = c("Genus", "Species", "Sexual_System"))

### correct Genus names
sex$Genus <- gsub("silene", "Silene", sex$Genus)

### correct Sex system
sex$Sexual_System <- gsub("gynodioecy", "non_hermaphrodite", sex$Sexual_System)
sex$Sexual_System <- gsub("trioecy", "non_hermaphrodite", sex$Sexual_System)
sex$Sexual_System <- gsub("dioecy", "non_hermaphrodite", sex$Sexual_System)

## reorder table
sex_sort <- sex[order(sex$Sexual_System, sex$Species), ]

(2) Write an R function to reformat the table.

2. Data visualization

Download the tables from the course website here. The zip file contains 4 different tables (SL_female, SL_male, SD_female, SD_male), which corresponds to 4 conditions (S. latifolia females and males and S. dioica females and males). For each genomic location (Chromosome and Position) the feature nucleotide diversity pi is shown. The goal is to visualize the data.

CHROM POS PI
SL_stacks5 26 0.213904
SL_stacks5 39 0.607843

(1) Import the tables into R (read.table), add the condition (SL_female, SL_male, SD_female, SD_male) to each observation and concatenate the tables (rbind).

Solution

#### Read in tables
SD_female     <-     read.table("SD_female.sites.pi", header = T)
SD_female$stat  <-   "SD_female"
SD_male       <-     read.table("SD_male.sites.pi", header = T)
SD_male$stat    <-   "SD_male"
SL_female     <-     read.table("SL_female.sites.pi", header = T)
SL_female$stat  <-   "SL_female"
SL_male       <-     read.table("SL_male.sites.pi", header = T)
SL_male$stat <-      "SL_male"

### combine dataframes
pi          <-      rbind(SD_female, SD_male, SL_female, SL_male)

(2) Generate a boxplot with each stat (boxplot). Are there differences among the conditions? Advanced users can use ggplot2.

Solution

### boxplots
boxplot(pi$PI ~ pi$stat)

(3) Write an R function to generate the boxplots directly.

3. Data analysis

We use the dataset from exercise 2 and try to test if there are differences in nucleotide diversity pi among the 4 conditions.

(1) Add to the table a factor sex (male, female) and species (SL, SD) to each observation. Alternatively, you can load the already prepared dataset.

Solution

###load data table
pi <- read.csv("http://gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA19/data/pi_conc.csv")

pi$sex <-     as.factor(pi$sex)
pi$species <- as.factor(pi$species)

(2) Use a simple linear model (lm) to find out if there are sex and/or species effects?

Solution

##simple model
m_sex_species <-  lm(PI~sex*species,data=pi)
m_sex         <-  lm(PI~sex,data=pi)
m_species     <-  lm(PI~species,data=pi)

##model summary
summary(m_sex_species)
summary(m_sex)
summary(m_species)

(3) Have a look at the diagnostic plots. Are there problems with the dataset?

Solution

##diagnostic plots
plot(m_sex)

4. SimRAD

There are many R packages for biologists available. A manual sometimes even a tutorial is provided, which need to be sufficient to run the tool on our data. We going to use SimRAD a R-package that you can use to do an in silco restriction enzyme digestion. We will talk more about RAD on Friday.

SimRAD provides a number of functions to simulate restriction enzyme digestion, library construction and fragments size selection to predict the number of loci expected from most of the Restriction site Associated DNA (RAD) and Genotyping By Sequencing (GBS) approaches. SimRAD aims to provide an estimation of the number of loci expected from a given genome depending on protocol type and parameters allowing to assess feasibility, multiplexing capacity and the amount of sequencing required.

(1) Install "Biostrings" and "ShortRead" from bioconductor and the SimRAD package. Load then the package to make sure that it has been installed successfully.

if (!requireNamespace("BiocManager"))
    install.packages("BiocManager")
BiocManager::install(c("Biostrings", "ShortRead"))

install.packages("SimRAD",dependencies=T)

(2) From the manual we got the following code. The task is now to modify it as following: We going to simulate 100,000 bp genome, use EcoRI and TaqI and select a range of 330-430. Find the information about the restriction enzymes below. How many fragments do you predict?

#Restriction Enzyme 1
#TaqI
cs_5p1 = "T"; cs_3p1 = "CGA"
#Restriction Enzyme 2
#MseI #
cs_5p2 = "T"; cs_3p2 = "TAA"

#simulate a genome
simseq=sim.DNAseq(size=10000, GCfreq=0.51)

#digest the genome 
simseq.dig=insilico.digest(simseq, cs_5p1, cs_3p1, cs_5p2, cs_3p2, verbose=TRUE)

#select appropriate fragmnets
simseq.sel=adapt.select(simseq.dig, type="AB+BA", cs_5p1,cs_3p1, cs_5p2, cs_3p2)

# wide size selection (200-270):
wid.simseq=size.select(simseq.sel,min.size = 200,max.size = 270,graph=T,verbose=T)

EcoRI MseI.jpg

(3) How many fragments do you predict for a 500 Mbp genome?

Additional information