R for Biologists

Lecture notes¶

Exercise¶

Please work in groups of two. Use a markdown editor and an appropriate R-style. Advanced R-users can use tidyverse. During the exercise fill your comments directly in the GoogleDoc file using the link here.

1. Data manipulation¶

Download the table from the course website here. It is a subset of the sex of life database. For each plant species the sexual system is provided.

The goal is to reformat the table using R from the version you have into the format below.

Genus	Species	Sexual_system
Silene	acutifolia	hermaphrodite
Silene	adenocalyx	hermaphrodite
Silene	aegaea	hermaphrodite
Silene	aegyptiaca	hermaphrodite
Silene	aellenii	hermaphrodite
Silene	almolae	hermaphrodite
Silene	alpestris	hermaphrodite
Silene	ammophila	NA
Silene	amoena	NA
Silene	acaulis	non_hermaphrodite
Silene	dioica	non_hermaphrodite
Silene	latifolia	non_hermaphrodite
Silene	vulgaris	non_hermaphrodite

(1) Use R to reformat your table (read.table,gsub,order). Advanced users can use tidyverse.

Solution

## import data
sex <- read.csv("sex.csv", header = F, na.strings = c("NA", ""), col.names = c("Genus", "Species", "Sexual_System"))

### correct Genus names
sex$Genus <- gsub("silene", "Silene", sex$Genus)

### correct Sex system
sex$Sexual_System <- gsub("gynodioecy", "non_hermaphrodite", sex$Sexual_System)
sex$Sexual_System <- gsub("trioecy", "non_hermaphrodite", sex$Sexual_System)
sex$Sexual_System <- gsub("dioecy", "non_hermaphrodite", sex$Sexual_System)

## reorder table
sex_sort <- sex[order(sex$Sexual_System, sex$Species), ]

(2) Write an R function to reformat the table.

2. Data visualization¶

Download the tables from the course website here. The zip file contains 4 different tables (SL_female, SL_male, SD_female, SD_male), which corresponds to 4 conditions (S. latifolia females and males and S. dioica females and males). For each genomic location (Chromosome and Position) the feature nucleotide diversity pi is shown. The goal is to visualize the data.

CHROM	POS	PI
SL_stacks5	26	0.213904
SL_stacks5	39	0.607843

(1) Import the tables into R (read.table), add the condition (SL_female, SL_male, SD_female, SD_male) to each observation and concatenate the tables (rbind).

Solution

#### Read in tables
SD_female     <-     read.table("SD_female.sites.pi", header = T)
SD_female$stat  <-   "SD_female"
SD_male       <-     read.table("SD_male.sites.pi", header = T)
SD_male$stat    <-   "SD_male"
SL_female     <-     read.table("SL_female.sites.pi", header = T)
SL_female$stat  <-   "SL_female"
SL_male       <-     read.table("SL_male.sites.pi", header = T)
SL_male$stat <-      "SL_male"

### combine dataframes
pi          <-      rbind(SD_female, SD_male, SL_female, SL_male)

(2) Generate a boxplot with each stat (boxplot). Are there differences among the conditions? Advanced users can use ggplot2.

Solution

### boxplots
boxplot(pi$PI ~ pi$stat)

(3) Write an R function to generate the boxplots directly.

3. Data analysis¶

We use the dataset from exercise 2 and try to test if there are differences in nucleotide diversity pi among the 4 conditions.

(1) Add to the table a factor sex (male, female) and species (SL, SD) to each observation. Alternatively, you can load the already prepared dataset.

Solution

###load data table
pi <- read.csv("http://gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA19/data/pi_conc.csv")

pi$sex <-     as.factor(pi$sex)
pi$species <- as.factor(pi$species)

(2) Use a simple linear model (lm) to find out if there are sex and/or species effects?

Solution

##simple model
m_sex_species <-  lm(PI~sex*species,data=pi)
m_sex         <-  lm(PI~sex,data=pi)
m_species     <-  lm(PI~species,data=pi)

##model summary
summary(m_sex_species)
summary(m_sex)
summary(m_species)

(3) Have a look at the diagnostic plots. Are there problems with the dataset?

Solution

##diagnostic plots
plot(m_sex)

4. SimRAD¶

There are many R packages for biologists available. A manual sometimes even a tutorial is provided, which need to be sufficient to run the tool on our data. We going to use SimRAD a R-package that you can use to do an in silco restriction enzyme digestion. We will talk more about RAD on Friday.

SimRAD provides a number of functions to simulate restriction enzyme digestion, library construction and fragments size selection to predict the number of loci expected from most of the Restriction site Associated DNA (RAD) and Genotyping By Sequencing (GBS) approaches. SimRAD aims to provide an estimation of the number of loci expected from a given genome depending on protocol type and parameters allowing to assess feasibility, multiplexing capacity and the amount of sequencing required.

(1) Install "Biostrings" and "ShortRead" from bioconductor and the SimRAD package. Load then the package to make sure that it has been installed successfully.

if (!requireNamespace("BiocManager"))
    install.packages("BiocManager")
BiocManager::install(c("Biostrings", "ShortRead"))

install.packages("SimRAD",dependencies=T)

(2) From the manual we got the following code. The task is now to modify it as following: We going to simulate 100,000 bp genome, use EcoRI and TaqI and select a range of 330-430. Find the information about the restriction enzymes below. How many fragments do you predict?

#Restriction Enzyme 1
#TaqI
cs_5p1 = "T"; cs_3p1 = "CGA"
#Restriction Enzyme 2
#MseI #
cs_5p2 = "T"; cs_3p2 = "TAA"

#simulate a genome
simseq=sim.DNAseq(size=10000, GCfreq=0.51)

#digest the genome 
simseq.dig=insilico.digest(simseq, cs_5p1, cs_3p1, cs_5p2, cs_3p2, verbose=TRUE)

#select appropriate fragmnets
simseq.sel=adapt.select(simseq.dig, type="AB+BA", cs_5p1,cs_3p1, cs_5p2, cs_3p2)

# wide size selection (200-270):
wid.simseq=size.select(simseq.sel,min.size = 200,max.size = 270,graph=T,verbose=T)

EcoRI

(3) How many fragments do you predict for a 500 Mbp genome?

R for Biologists

Lecture notes¶

Exercise¶

1. Data manipulation¶

2. Data visualization¶

3. Data analysis¶

4. SimRAD¶

Additional information¶