R for Biologists
Lecture notes¶
Exercise¶
Please work in groups of two. Use a markdown editor and an appropriate R-style. Advanced R-users can use tidyverse. During the exercise fill your comments directly in the GoogleDoc file using the link here.
1. Data manipulation¶
Download the table from the course website here. It is a subset of the sex of life database. For each plant species the sexual system is provided.
The goal is to reformat the table using R from the version you have into the format below.
Genus | Species | Sexual_system |
---|---|---|
Silene | acutifolia | hermaphrodite |
Silene | adenocalyx | hermaphrodite |
Silene | aegaea | hermaphrodite |
Silene | aegyptiaca | hermaphrodite |
Silene | aellenii | hermaphrodite |
Silene | almolae | hermaphrodite |
Silene | alpestris | hermaphrodite |
Silene | ammophila | NA |
Silene | amoena | NA |
Silene | acaulis | non_hermaphrodite |
Silene | dioica | non_hermaphrodite |
Silene | latifolia | non_hermaphrodite |
Silene | vulgaris | non_hermaphrodite |
(1) Use R to reformat your table (read.table
,gsub
,order
). Advanced users can use tidyverse.
Solution
## import data
sex <- read.csv("sex.csv", header = F, na.strings = c("NA", ""), col.names = c("Genus", "Species", "Sexual_System"))
### correct Genus names
sex$Genus <- gsub("silene", "Silene", sex$Genus)
### correct Sex system
sex$Sexual_System <- gsub("gynodioecy", "non_hermaphrodite", sex$Sexual_System)
sex$Sexual_System <- gsub("trioecy", "non_hermaphrodite", sex$Sexual_System)
sex$Sexual_System <- gsub("dioecy", "non_hermaphrodite", sex$Sexual_System)
## reorder table
sex_sort <- sex[order(sex$Sexual_System, sex$Species), ]
(2) Write an R function to reformat the table.
2. Data visualization¶
Download the tables from the course website here. The zip file contains 4 different tables (SL_female
, SL_male
, SD_female
, SD_male
), which corresponds to 4 conditions (S. latifolia females and males and S. dioica females and males). For each genomic location (Chromosome and Position) the feature nucleotide diversity pi is shown. The goal is to visualize the data.
CHROM | POS | PI |
---|---|---|
SL_stacks5 | 26 | 0.213904 |
SL_stacks5 | 39 | 0.607843 |
(1) Import the tables into R (read.table
), add the condition (SL_female
, SL_male
, SD_female
, SD_male
) to each observation and concatenate the tables (rbind
).
Solution
#### Read in tables SD_female <- read.table("SD_female.sites.pi", header = T) SD_female$stat <- "SD_female" SD_male <- read.table("SD_male.sites.pi", header = T) SD_male$stat <- "SD_male" SL_female <- read.table("SL_female.sites.pi", header = T) SL_female$stat <- "SL_female" SL_male <- read.table("SL_male.sites.pi", header = T) SL_male$stat <- "SL_male" ### combine dataframes pi <- rbind(SD_female, SD_male, SL_female, SL_male)
(2) Generate a boxplot with each stat (boxplot
). Are there differences among the conditions? Advanced users can use ggplot2.
Solution
### boxplots boxplot(pi$PI ~ pi$stat)
(3) Write an R function to generate the boxplots directly.
3. Data analysis¶
We use the dataset from exercise 2 and try to test if there are differences in nucleotide diversity pi among the 4 conditions.
(1) Add to the table a factor sex (male, female) and species (SL, SD) to each observation. Alternatively, you can load the already prepared dataset.
Solution
###load data table pi <- read.csv("http://gdc-docs.ethz.ch/GeneticDiversityAnalysis/GDA19/data/pi_conc.csv") pi$sex <- as.factor(pi$sex) pi$species <- as.factor(pi$species)
(2) Use a simple linear model (lm
) to find out if there are sex and/or species effects?
Solution
##simple model m_sex_species <- lm(PI~sex*species,data=pi) m_sex <- lm(PI~sex,data=pi) m_species <- lm(PI~species,data=pi) ##model summary summary(m_sex_species) summary(m_sex) summary(m_species)
(3) Have a look at the diagnostic plots. Are there problems with the dataset?
Solution
##diagnostic plots plot(m_sex)
4. SimRAD¶
There are many R packages for biologists available. A manual sometimes even a tutorial is provided, which need to be sufficient to run the tool on our data. We going to use SimRAD a R-package that you can use to do an in silco restriction enzyme digestion. We will talk more about RAD on Friday.
SimRAD provides a number of functions to simulate restriction enzyme digestion, library construction and fragments size selection to predict the number of loci expected from most of the Restriction site Associated DNA (RAD) and Genotyping By Sequencing (GBS) approaches. SimRAD aims to provide an estimation of the number of loci expected from a given genome depending on protocol type and parameters allowing to assess feasibility, multiplexing capacity and the amount of sequencing required.
(1) Install "Biostrings" and "ShortRead" from bioconductor and the SimRAD package. Load then the package to make sure that it has been installed successfully.
if (!requireNamespace("BiocManager")) install.packages("BiocManager") BiocManager::install(c("Biostrings", "ShortRead")) install.packages("SimRAD",dependencies=T)
(2) From the manual we got the following code. The task is now to modify it as following: We going to simulate 100,000 bp genome, use EcoRI and TaqI and select a range of 330-430. Find the information about the restriction enzymes below. How many fragments do you predict?
#Restriction Enzyme 1 #TaqI cs_5p1 = "T"; cs_3p1 = "CGA" #Restriction Enzyme 2 #MseI # cs_5p2 = "T"; cs_3p2 = "TAA" #simulate a genome simseq=sim.DNAseq(size=10000, GCfreq=0.51) #digest the genome simseq.dig=insilico.digest(simseq, cs_5p1, cs_3p1, cs_5p2, cs_3p2, verbose=TRUE) #select appropriate fragmnets simseq.sel=adapt.select(simseq.dig, type="AB+BA", cs_5p1,cs_3p1, cs_5p2, cs_3p2) # wide size selection (200-270): wid.simseq=size.select(simseq.sel,min.size = 200,max.size = 270,graph=T,verbose=T)
(3) How many fragments do you predict for a 500 Mbp genome?