Population Genetics - Simulations with R¶
We can also use R to better understand population genetics. There are packages available, but we could also write our own R scripts to simulate evolving populations. We start with a step-by-step guide using our newly developed R skills.
First bi-allelic population¶
We get started with creating a bi-allelic populations.
# Population parameters: Alleles <- c("A", "a") # alleles p <- 0.5 # frequency for allele A or A1 q <- 1-p # frequency for allele a or A2 fpq <- c(p,q) # combine both frequency Nind <- 100 # number of individuals in a population # We create our first random population: pop <- array(sample(Alleles, 2 * Nind, p = fpq, replace = T ), dim = c(Nind, 2)) head(pop)
Q1 Can you wrap the few R code lines from above into an R function?
Solution #1 - R Function
A possible R function to create a bi-allelic population.
make.pop <- function(p = 0.5, pop.size = 100) { ## Create a bi-allelic population # Genotype gt <- c("A1", "A2") # Frequency for second allele q <- 1-p # Combine frequencies for both alleles fpq <- c(p,q) # Create an array with populations in lines and loci in columns # use <<- to creat a global variable pop <<- array(sample(gt, 2 * pop.size, p = fpq, replace = T), dim = c(pop.size, 2)) } # Create a default population with p = q = 0.5 and n(individuals) = 100 make.pop() # Have a look at the first few lines head(pop)
Allele Labels
I prefer to use the labels A1 and A2 for the differnt alleles instead of e.g. A and a. There are good reasons (e.g. can handle more than two alleles) but you are free to choose. Do not use A and B because this would be reserved for alleles at different loci (e.g. A1 and A2 at locus A, and B1 and B2 at locus B).
Evolving Population¶
How can we simulate the allele frequency over time? We could use the base function sample
to take a sample from the collective set of alleles in a population (gene pool). We want to keep the population size stable over time. Meaning the sample size should be of the same size as the initial population. For this reason, we have to replace the sampling and give each sampling the same probability.
We use the previous population to create a new population putting all alleles into one big gene-pool and draw a random sample. We repeat this step 9 more time to create ten generation in total.
## Create different time points pop1 <- array(sample(pop , size = 200, replace = TRUE), dim = c(100, 2)) pop2 <- array(sample(pop1, size = 200, replace = TRUE), dim = c(100, 2)) pop3 <- array(sample(pop2, size = 200, replace = TRUE), dim = c(100, 2)) pop4 <- array(sample(pop3, size = 200, replace = TRUE), dim = c(100, 2)) pop5 <- array(sample(pop4, size = 200, replace = TRUE), dim = c(100, 2)) pop6 <- array(sample(pop5, size = 200, replace = TRUE), dim = c(100, 2)) pop7 <- array(sample(pop6, size = 200, replace = TRUE), dim = c(100, 2)) pop8 <- array(sample(pop7, size = 200, replace = TRUE), dim = c(100, 2)) pop9 <- array(sample(pop8, size = 200, replace = TRUE), dim = c(100, 2)) pop10 <- array(sample(pop9, size = 200, replace = TRUE), dim = c(100, 2))
Now, we calculate the allele frequency of allele A1 (p) at all ten time points:
p1 <- sum(pop1 == "A1") / 200 p2 <- sum(pop2 == "A1") / 200 p3 <- sum(pop3 == "A1") / 200 p4 <- sum(pop4 == "A1") / 200 p5 <- sum(pop5 == "A1") / 200 p6 <- sum(pop6 == "A1") / 200 p7 <- sum(pop7 == "A1") / 200 p8 <- sum(pop8 == "A1") / 200 p9 <- sum(pop9 == "A1") / 200 p10 <- sum(pop10 == "A1") / 200
Next, we plot the allele frequency over time:
## Plot allele frequency over time p.10g <- array(c(p1, p2, p3, p4, p5, p6, p7, p8, p9, p10)) plot( p.10g, type = "l", col = "tomato", lwd = 2, xlim = range(1:10), ylim = range(0, 1), xlab = "Generation", ylab = "frequency(A)" ) abline(h = 0.5, col = "grey")
Q2 It works but it is not very flexible. Can you find a better way to wrap the R lines above into a more flexible solution?
Solution #2 - Possible Improvements
In a fist step, we could make each step depended from the previous step by removing fixed values. We could also use a if-else loop to create time points and calculate allele frequencies.
## Number of Simulations n.sim <- 10 ## Initial A1 Allele Frequency p0 <- 0.5 ## Empty Martix evo.pop <- matrix(, nrow = n.sim, ncol = length(pop)) ## Prepare Canvas for the Plot plot( 0, p0, xlim = range(0:n.sim), ylim = c(0:1), pch = 3, col = "tomato", xlab = "Generation", ylab = "p == f(A1)" ) ## Plot Allele Frequency for(ii in 1:n.sim) { if (ii == 1) { evo.pop[ii,] <- sample(pop, size = length(pop), replace = T) p <- sum(evo.pop[ii,] == "A1") / length(pop) } else { evo.pop[ii,] <- sample(evo.pop[ii-1,], size = length(pop), replace = T) p <- sum(evo.pop[ii,] == "A1") / length(pop) } points(ii, p, pch = 3, col = "tomato") } abline(h = 0.5, col = "grey")
Genotypes¶
Identify Genotypes¶
Besides the allele frequency we are also interested in the genotype frequencies. For this reason we need to be able to count the number of homozygote (A1A1 or A2A2) and heterozygote (A1A2) individuals.
Identify individuals with genotypes A1A1 (homozygote wt):
for(ii in 1:Nind) { cat(pop[ii,], "-", identical(pop[ii,],c("A1","A1")), "\n") # why print: because in a loop, automatic printing is turned off }
Find homozygote (A1A1 and A2A2) individuals:
# Identify individuals with genotypes AA (homozygote wt): for(ii in 1:Nind) { cat(pop[ii,], "-", identical(pop[ii,],c("A1","A1")) || identical(pop[ii,],c("A2","A2")), "\n") }
Identify heterozygote individuals:
for(ii in 1:Nind) { # we have to count both cases Aa and aA separately! cat(pop[ii,], "-", identical(pop[ii,],c("A1","A2")) || identical(pop[ii,],c("A2","A1")), "\n") }
Q3 Based on these examples, can you write some R code to identify the genotypes in a population?
Solution #3 - Possible Solution for a GenoTyper
require(crayon) for(ii in 1:Nind) { if (identical(pop[ii,],c("A1","A2")) || identical(pop[ii,],c("A2","A1")) == "TRUE") { cat(pop[ii,], "→", "Heterozygote", "\n") } else { cat(pop[ii,], "→", "Homozygote", "\n") } }
require(crayon) for(ii in 1:Nind) { if (identical(pop[ii,],c("A1","A2")) || identical(pop[ii,],c("A2","A1")) == "TRUE") { cat(pop[ii,], red("→"), red("Heterozygote"), "\n") } else { cat(pop[ii,], blue("→"), blue("Homozygote"), "\n") } }
Count Genotype¶
It is nice to highlight or identify the genotypes but we would also need to count them.
# Create Empty Vector: pos <- array(dim = c(Nind, 1)) # Query Genotype: Qgt <- c("A1","A1") # Count the Number of Individuals for(ii in 1:Nind){ pos[ii,] <- identical(pop[ii,], Qgt) } cat("Number of ", Qgt, "genotypes =", sum(pos), "\n")
Alternative solution to count the number of individuals with a certain genotype:
Ngt <- 0 Qgt <- c("A1","A1") for(ii in 1:Nind){ if(identical(pop[ii,], Qgt)) Ngt = Ngt + 1 } cat("Number of", Qgt, "genotypes = ", Ngt, "\n")
Q4 Based on these examples, can you write a R solution to count heterozygote individuals in a population?
Solution #4 - Count Heterozygotes
A first solution where we count case by case:
# Count all A1-A2 cases Ngt1 <- 0 Qgt1 <- c("A1", "A2") for(ii in 1:Nind){ if(identical(pop[ii,], Qgt1)) Ngt1 = Ngt1 + 1 } #Count all A2-A1 caese Ngt2 <- 0 Qgt2 <- c("A2", "A1") for(ii in 1:Nind){ if(identical(pop[ii,], Qgt2)) Ngt2 = Ngt2 + 1 } # Combine the results cat("Number of He =", Ngt1, "+", Ngt2, "=", Ngt1 + Ngt2, "\n")
# Reset Counters NHe <- 0 NHo <- 0 NA1A1 <- 0 NA2A2 <- 0 Nx <- 0 # Counting for(ii in 1:Nind){ if (identical(pop[ii,],c("A1", "A2")) || identical(pop[ii,],c("A2", "A1")) == "TRUE") { NHe = NHe + 1 } else if (identical(pop[ii,],c("A1", "A1")) == "TRUE") { NA1A1 = NA1A1 + 1 } else if (identical(pop[ii,],c("A2", "A2")) == "TRUE") { NA2A2 = NA2A2 + 1 } else { Nx = Nx + 1 } } # Results cat("Number of Ho =", NA1A1 + NA2A2, "(",NA1A1,"+",NA2A2,")" , "\n") cat("Number of He =", NHe, "\n")
Genotype Frequency¶
# make sure the counters are reset NHe <- 0 # A1-A2 or A2-A1 NHo.WT <- 0 # A1-A1 NHo.MU <- 0 # A2-A2 for(ii in 1:Nind){ if(identical(pop[ii,], c("A1", "A2")) || identical(pop[ii,], c("A2", "A1"))) NHe = NHe +1 if(identical(pop[ii,], c("A1","A1"))) NHo.WT = NHo.WT +1 if(identical(pop[ii,], c("A2","A2"))) NHo.MU = NHo.MU +1 } # calculate frequency: freq_Aa <- NHe / Nind freq_AA <- NHo.WT / Nind freq_aa <- NHo.MU / Nind # print frequencies: cat('\n','f(Aa):',freq_Aa,'\n','f(AA):',freq_AA,'\n','f(aa):',freq_aa,'\n\n', 'f(He):',freq_Aa,'\n','f(Ho):',freq_AA + freq_aa,'\n\n')
Multiple Populations¶
So far, we restricted the simulation to one population. Now we calculate the genotype frequency of multiple populations for one time point.
# Number of individuals in a population: Nind <- 100 # Number of populations: Npop <- 500 # Create vector: genotype_frequency <- array(dim=c(Npop, 3)) for(jj in 1:Npop) { # Reset Counters: n_A1A1 <- 0 n_A1A2 <- 0 n_A2A2 <- 0 # Create Population (requires function) make.pop(p = 0.5, pop.size = Nind) for(ii in 1:Nind) { # Count homozygotes A1-A1 if(identical(pop[ii,], c("A1","A1"))) n_A1A1 = n_A1A1 + 1 # Count heterozygotes A1-A2 if(identical(pop[ii,],c("A1","A2")) || identical(pop[ii,],c("A2","A1"))) n_A1A2 = n_A1A2 + 1 # Count homozygotes A2-A2 if(identical(pop[ii,],c("A2","A2"))) n_A2A2 = n_A2A2 + 1 } # Determine GT frequencies: freq_A1A1 <- n_A1A1 / Nind freq_A1A2 <- n_A1A2 / Nind freq_A2A2 <- n_A2A2 / Nind # Safe GT frequency for population jj: genotype_frequency[jj,] <- c(freq_A1A1, freq_A1A2, freq_A2A2) } # Boxplot to compare distribution boxplot(((genotype_frequency[1:Nind, 1:3])), notch = TRUE, names = c("A1A1", "A1A2", "A2A2"), col = c("blue", "green", "orange"), ylim = range(0, 1), xlab = "Genotype", ylab = "Genotype-Frequency", main = "Genotype Frequency Distribution" ) abline(h = c(0.25, 0.5), col = "grey", lty = 2)
Pop-Size Matters¶
Let us explore how genotype frequency variation depends on population size.
# Number of individuals in a population: Nind_A <- 10 Nind_B <- 100 Nind_C <- 1000 # Number of populations: Npop <- 1000 # Allele frequency fA1 <- 0.5 # Create vector: He_frequency <- array(dim = c(Npop, 3)) # Counter ll <- 0 for(kk in ) { # Population ID ll <- ll + 1 for(jj in 1:Npop) { # Reset He Counter: n_He <- 0 # Create Population make.pop(p = fA1, pop.size = 2 * kk) for(ii in 1:kk) { # Count He if(identical(pop[ii,], c("A1", "A2")) || identical(pop[ii,], c("A2", "A1"))) n_He = n_He + 1 } # determine heterozygote frequency: He_frequency[jj, ll] <- n_He / kk } } boxplot( He_frequency[, 1:3], col = "blue", notch = TRUE, names = c("N = 10", "N = 100", "N = 1000"), xlab = "Population Size", ylab = "freq(A1A2) == He" ) abline(h = c(0.25, 0.5), lty = 5, col = "grey")
Q5 Can you modify the R code to make it work for more than 3 population sizes?
Solution #5 - Flexible Population Sizes
# Number of individuals in populations: Pop_Sizes <- c(10, seq(100,1000, 100)) # Number of populations: Npop <- 1000 # Allele frequency fA1 <- 0.5 # Create vector: He_frequency <- array(dim = c(Npop, length(Pop_Sizes))) # ID Counter ll <- 0 for(kk in Pop_Sizes) { ## Population ID ll <- ll + 1 for(jj in 1:Npop) { # Reset He Counter: n_He <- 0 # Create Population make.pop(p = fA1, pop.size = 2 * kk) for(ii in 1:kk) { # count genotypes if(identical(pop[ii,],c("A1","A2")) || identical(pop[ii,],c("A2","A1"))) n_He = n_He + 1 } # determine heterozygote frequency: He_frequency[jj,ll] <- n_He / kk } } boxplot( He_frequency[, 1:length(Pop_Sizes)], col = "blue", notch = TRUE, names = (lapply(as.character(Pop_Sizes), function(x) paste("N", x, sep="="))), xlab = "Population Size", ylab = "freq(A1A2) == He" ) abline(h = c(0.25, 0.5), lty = 5, col = "grey")
Frequency Ranges¶
We adjust our R code to plot genotype frequencies for a range of p values.
# Number of individuals in a population: Nind <- 100 # Number of populations: Npop <- 500 # Create a range of allele frequencies p_range <- seq(0, 1, length.out = Npop) # Create vector: genotype_frequency <- array(dim = c(Npop, 3)) for (jj in 1:Npop) { # reset genotype counters: n_A1A1 <- 0 n_A1A2 <- 0 n_A2A2 <- 0 # create population pop <- make.pop(p = p_range[jj], pop.size = 2 * Nind) # for (ii in 1:Nind) { # count genotypes: if (identical(pop[ii, ], c("A1", "A1"))) n_A1A1 = n_A1A1 + 1 if (identical(pop[ii, ], c("A1", "A2")) || identical(pop[ii, ], c("A2", "A1"))) n_A1A2 = n_A1A2 + 1 if (identical(pop[ii, ], c("A2", "A2"))) n_A2A2 = n_A2A2 + 1 } # determine genotype frequency: freq_A1A1 <- n_A1A1 / Nind freq_A1A2 <- n_A1A2 / Nind freq_A2A2 <- n_A2A2 / Nind # safe genotype frequency for population jj: genotype_frequency[jj, ] <- c(freq_A1A1, freq_A1A2, freq_A2A2) } ## Plot the GT-frequencies # Simulation for f(A1A1) plot( genotype_frequency[1:Npop, 1], type = "p", xaxt = "n", col = "lightblue", pch = 20, ylim = range(0, 1), xlab = "A1 Allele Frequency (p)", ylab = "Genotype-Frequency" ) # Theoretical curve f(A1A1) points(p_range^2, col = "darkblue", typ = "l", lwd = 2) # Simulation for f(A1A2) (heterozygote) points(genotype_frequency[1:Npop, 2], type = "p", col = "lightgreen", pch = 20) # Theoretical curve f(A1A2) points(2 * p_range * (1 - p_range), col = "green", typ = "l", lwd = 2) # Simulation for f(A2A2) (homozygote) points(genotype_frequency[1:Npop, 3], type = "p", col = "orange", pch = 20) # Theoretical curve f(A2A2) points((1 - p_range) ^ 2, col = "red", typ = "l", lwd = 2) ## Add more opjects to plot axis(1, at = c(seq(0, Npop, Npop / 10)), labels = (seq(0, 1, 0.1))) abline(h = c(seq(0, 1, 0.1)), lty = 5, col = "grey") legend( "top", legend = c("A1A1", "A1A2", "A2A2"), col = c("lightblue", "lightgreen", "orange"), pch = 15, cex = 1.5, bty = "n" )