Population Genetics - Simulations with R


We can also use R to better understand population genetics. There are packages available, but we could also write our own R scripts to simulate evolving populations. We start with a step-by-step guide using our newly developed R skills.

First bi-allelic population

We get started with creating a bi-allelic populations.

# Population parameters:
Alleles <- c("A", "a") # alleles
p       <- 0.5         # frequency for allele A or A1
q       <- 1-p         # frequency for allele a or A2
fpq     <- c(p,q)      # combine both frequency 
Nind    <- 100         # number of individuals in a population

# We create our first random population:
pop <- array(sample(Alleles, 2 * Nind, p = fpq, replace = T ), dim = c(Nind, 2))
head(pop)

Q1 Can you wrap the few R code lines from above into an R function?

Solution #1 - R Function

A possible R function to create a bi-allelic population.

make.pop <- function(p = 0.5, pop.size = 100) {
  ## Create a bi-allelic population
  # Genotype
  gt <- c("A1", "A2")
  # Frequency for second allele
  q <- 1-p
  # Combine frequencies for both alleles
  fpq <- c(p,q)
  # Create an array with populations in lines and loci in columns
  # use <<- to creat a global variable
  pop <<- array(sample(gt, 2 * pop.size, p = fpq, replace = T), dim = c(pop.size, 2))
}

# Create a default population with p = q = 0.5 and n(individuals) = 100
make.pop()
# Have a look at the first few lines
head(pop)

Allele Labels

I prefer to use the labels A1 and A2 for the differnt alleles instead of e.g. A and a. There are good reasons (e.g. can handle more than two alleles) but you are free to choose. Do not use A and B because this would be reserved for alleles at different loci (e.g. A1 and A2 at locus A, and B1 and B2 at locus B).

Evolving Population

How can we simulate the allele frequency over time? We could use the base function sample to take a sample from the collective set of alleles in a population (gene pool). We want to keep the population size stable over time. Meaning the sample size should be of the same size as the initial population. For this reason, we have to replace the sampling and give each sampling the same probability.

We use the previous population to create a new population putting all alleles into one big gene-pool and draw a random sample. We repeat this step 9 more time to create ten generation in total.

## Create different time points
pop1  <- array(sample(pop , size = 200,  replace = TRUE), dim = c(100, 2))
pop2  <- array(sample(pop1, size = 200,  replace = TRUE), dim = c(100, 2))
pop3  <- array(sample(pop2, size = 200,  replace = TRUE), dim = c(100, 2))
pop4  <- array(sample(pop3, size = 200,  replace = TRUE), dim = c(100, 2))
pop5  <- array(sample(pop4, size = 200,  replace = TRUE), dim = c(100, 2))
pop6  <- array(sample(pop5, size = 200,  replace = TRUE), dim = c(100, 2))
pop7  <- array(sample(pop6, size = 200,  replace = TRUE), dim = c(100, 2))
pop8  <- array(sample(pop7, size = 200,  replace = TRUE), dim = c(100, 2))
pop9  <- array(sample(pop8, size = 200,  replace = TRUE), dim = c(100, 2))
pop10 <- array(sample(pop9, size = 200,  replace = TRUE), dim = c(100, 2))

Now, we calculate the allele frequency of allele A1 (p) at all ten time points:

p1  <- sum(pop1  == "A1") / 200
p2  <- sum(pop2  == "A1") / 200
p3  <- sum(pop3  == "A1") / 200
p4  <- sum(pop4  == "A1") / 200
p5  <- sum(pop5  == "A1") / 200
p6  <- sum(pop6  == "A1") / 200
p7  <- sum(pop7  == "A1") / 200
p8  <- sum(pop8  == "A1") / 200
p9  <- sum(pop9  == "A1") / 200
p10 <- sum(pop10 == "A1") / 200

Next, we plot the allele frequency over time:

## Plot allele frequency over time
p.10g <- array(c(p1, p2, p3, p4, p5, p6, p7, p8, p9, p10))
plot(
  p.10g,
  type = "l",
  col = "tomato",
  lwd = 2,
  xlim = range(1:10),
  ylim = range(0, 1),
  xlab = "Generation",
  ylab = "frequency(A)"
)
abline(h = 0.5, col = "grey")

Q2 It works but it is not very flexible. Can you find a better way to wrap the R lines above into a more flexible solution?

Solution #2 - Possible Improvements

In a fist step, we could make each step depended from the previous step by removing fixed values. We could also use a if-else loop to create time points and calculate allele frequencies.

## Number of Simulations
n.sim <- 10

## Initial A1 Allele Frequency
p0 <- 0.5

## Empty Martix
evo.pop <- matrix(, nrow = n.sim, ncol = length(pop))

## Prepare Canvas for the Plot
plot(
  0,
  p0,
  xlim = range(0:n.sim),
  ylim = c(0:1),
  pch = 3,
  col = "tomato",
  xlab = "Generation",
  ylab = "p == f(A1)"
)

## Plot Allele Frequency
for(ii in 1:n.sim) {
  if (ii == 1) {
    evo.pop[ii,] <- sample(pop, size = length(pop), replace = T)
    p  <- sum(evo.pop[ii,]  == "A1") / length(pop)
  } else {
    evo.pop[ii,] <- sample(evo.pop[ii-1,], size = length(pop), replace = T)
    p  <- sum(evo.pop[ii,]  == "A1") / length(pop)
  }
  points(ii, p, pch = 3, col = "tomato")
}
abline(h = 0.5, col = "grey")

Genotypes

Identify Genotypes

Besides the allele frequency we are also interested in the genotype frequencies. For this reason we need to be able to count the number of homozygote (A1A1 or A2A2) and heterozygote (A1A2) individuals.

Identify individuals with genotypes A1A1 (homozygote wt):

for(ii in 1:Nind)
{
  cat(pop[ii,], "-", identical(pop[ii,],c("A1","A1")), "\n") # why print: because in a loop, automatic printing is turned off
}

Find homozygote (A1A1 and A2A2) individuals:

# Identify individuals with genotypes AA (homozygote wt):
for(ii in 1:Nind)
{
  cat(pop[ii,], "-", identical(pop[ii,],c("A1","A1")) || identical(pop[ii,],c("A2","A2")), "\n")
}

Identify heterozygote individuals:

for(ii in 1:Nind)
{
  # we have to count both cases Aa and aA separately!
  cat(pop[ii,], "-", identical(pop[ii,],c("A1","A2")) || identical(pop[ii,],c("A2","A1")), "\n")
}

Q3 Based on these examples, can you write some R code to identify the genotypes in a population?

Solution #3 - Possible Solution for a GenoTyper

require(crayon)
for(ii in 1:Nind)
{
  if (identical(pop[ii,],c("A1","A2")) || identical(pop[ii,],c("A2","A1")) == "TRUE") {
    cat(pop[ii,], "→", "Heterozygote", "\n")
  } else {
    cat(pop[ii,], "→", "Homozygote", "\n")
  }
}
A more colourful solution but you need the R package **crayon** to make it work.
require(crayon)
for(ii in 1:Nind)
{
  if (identical(pop[ii,],c("A1","A2")) || identical(pop[ii,],c("A2","A1")) == "TRUE") {
    cat(pop[ii,], red("→"), red("Heterozygote"), "\n")
  } else {
    cat(pop[ii,], blue("→"), blue("Homozygote"), "\n")
  }
}

Count Genotype

It is nice to highlight or identify the genotypes but we would also need to count them.

# Create Empty Vector:
pos <- array(dim = c(Nind, 1))

# Query Genotype:
Qgt <- c("A1","A1")

# Count the Number of Individuals
for(ii in 1:Nind){
  pos[ii,] <- identical(pop[ii,], Qgt)
}

cat("Number of ", Qgt, "genotypes =", sum(pos), "\n")

Alternative solution to count the number of individuals with a certain genotype:

Ngt <- 0
Qgt <- c("A1","A1")

for(ii in 1:Nind){
  if(identical(pop[ii,], Qgt))
  Ngt = Ngt + 1
}

cat("Number of", Qgt, "genotypes = ", Ngt, "\n")

Q4 Based on these examples, can you write a R solution to count heterozygote individuals in a population?

Solution #4 - Count Heterozygotes

A first solution where we count case by case:

# Count all A1-A2 cases
Ngt1  <- 0
Qgt1 <- c("A1", "A2")

for(ii in 1:Nind){
  if(identical(pop[ii,], Qgt1))
  Ngt1 = Ngt1 + 1
}

#Count all A2-A1 caese
Ngt2  <- 0
Qgt2 <- c("A2", "A1")

for(ii in 1:Nind){
  if(identical(pop[ii,], Qgt2))
  Ngt2 = Ngt2 + 1
}

# Combine the results
cat("Number of He =", Ngt1, "+", Ngt2, "=", Ngt1 + Ngt2, "\n")
An alternative solution:
# Reset Counters
NHe   <- 0
NHo   <- 0
NA1A1 <- 0
NA2A2 <- 0
Nx    <- 0

# Counting
for(ii in 1:Nind){
  if (identical(pop[ii,],c("A1", "A2")) || identical(pop[ii,],c("A2", "A1")) == "TRUE") {
    NHe = NHe + 1
  } else if (identical(pop[ii,],c("A1", "A1")) == "TRUE") {
    NA1A1 = NA1A1 + 1
  } else if (identical(pop[ii,],c("A2", "A2")) == "TRUE") {
    NA2A2 = NA2A2 + 1
  } else {
    Nx = Nx + 1
  }
}

# Results
cat("Number of Ho =", NA1A1 + NA2A2, "(",NA1A1,"+",NA2A2,")" , "\n")
cat("Number of He =", NHe, "\n")

Genotype Frequency

# make sure the counters are reset
NHe    <- 0 # A1-A2 or A2-A1
NHo.WT <- 0 # A1-A1
NHo.MU <- 0 # A2-A2

for(ii in 1:Nind){
  if(identical(pop[ii,], c("A1", "A2")) || identical(pop[ii,], c("A2", "A1")))
    NHe = NHe +1
  if(identical(pop[ii,], c("A1","A1")))
    NHo.WT = NHo.WT +1
  if(identical(pop[ii,], c("A2","A2")))
    NHo.MU = NHo.MU +1
}

# calculate frequency:
freq_Aa <- NHe / Nind
freq_AA <- NHo.WT / Nind
freq_aa <- NHo.MU / Nind

# print frequencies:
cat('\n','f(Aa):',freq_Aa,'\n','f(AA):',freq_AA,'\n','f(aa):',freq_aa,'\n\n',
    'f(He):',freq_Aa,'\n','f(Ho):',freq_AA + freq_aa,'\n\n')

Multiple Populations

So far, we restricted the simulation to one population. Now we calculate the genotype frequency of multiple populations for one time point.

# Number of individuals in a population:
Nind <- 100

# Number of populations:
Npop <- 500

# Create vector:
genotype_frequency <- array(dim=c(Npop, 3))

for(jj in 1:Npop)
{
  # Reset Counters:
  n_A1A1 <- 0
  n_A1A2 <- 0
  n_A2A2 <- 0

  # Create Population (requires function)
  make.pop(p = 0.5, pop.size = Nind)

  for(ii in 1:Nind)
  {
    # Count homozygotes A1-A1
    if(identical(pop[ii,], c("A1","A1")))
      n_A1A1 = n_A1A1 + 1
    # Count heterozygotes A1-A2
    if(identical(pop[ii,],c("A1","A2")) || identical(pop[ii,],c("A2","A1")))
      n_A1A2 = n_A1A2 + 1
    # Count homozygotes A2-A2
    if(identical(pop[ii,],c("A2","A2")))
      n_A2A2 = n_A2A2 + 1
  }

  # Determine GT frequencies:
  freq_A1A1 <- n_A1A1 / Nind
  freq_A1A2 <- n_A1A2 / Nind
  freq_A2A2 <- n_A2A2 / Nind

  # Safe GT frequency for population jj:
  genotype_frequency[jj,] <- c(freq_A1A1, freq_A1A2, freq_A2A2)
}

# Boxplot to compare distribution
boxplot(((genotype_frequency[1:Nind, 1:3])),
        notch = TRUE,
        names = c("A1A1", "A1A2", "A2A2"),
        col = c("blue", "green", "orange"),
        ylim = range(0, 1),
        xlab = "Genotype",
        ylab = "Genotype-Frequency",
        main = "Genotype Frequency Distribution"
)
abline(h = c(0.25, 0.5), col = "grey", lty = 2)

Pop-Size Matters

Let us explore how genotype frequency variation depends on population size.

# Number of individuals in a population:
Nind_A <- 10
Nind_B <- 100
Nind_C <- 1000

# Number of populations:
Npop <- 1000

# Allele frequency
fA1 <- 0.5

# Create vector:
He_frequency <- array(dim = c(Npop, 3))

# Counter
ll <- 0

for(kk in )
{

  # Population ID
  ll <- ll + 1

  for(jj in 1:Npop)
  {
    # Reset He Counter:
    n_He <- 0

    # Create Population
    make.pop(p = fA1, pop.size = 2 * kk)

    for(ii in 1:kk)
    {
      # Count He
      if(identical(pop[ii,], c("A1", "A2")) || identical(pop[ii,], c("A2", "A1")))
        n_He = n_He + 1
    }

    # determine heterozygote frequency:
    He_frequency[jj, ll] <- n_He / kk

  }
}

boxplot(
    He_frequency[, 1:3],
    col = "blue", notch = TRUE,
    names = c("N = 10", "N = 100", "N = 1000"),
    xlab = "Population Size",
    ylab = "freq(A1A2) == He"
)
abline(h = c(0.25, 0.5), lty = 5, col = "grey")

Q5 Can you modify the R code to make it work for more than 3 population sizes?

Solution #5 - Flexible Population Sizes

# Number of individuals in populations:
Pop_Sizes <- c(10, seq(100,1000, 100))

# Number of populations:
Npop <- 1000

# Allele frequency
fA1 <- 0.5

# Create vector:
He_frequency <- array(dim = c(Npop, length(Pop_Sizes)))

# ID Counter
ll <- 0

for(kk in Pop_Sizes)
{

  ## Population ID
  ll <- ll + 1

  for(jj in 1:Npop)
  {
    # Reset He Counter:
    n_He <- 0

    # Create Population
    make.pop(p = fA1, pop.size = 2 * kk)

    for(ii in 1:kk)
    {
      # count genotypes
      if(identical(pop[ii,],c("A1","A2")) || identical(pop[ii,],c("A2","A1")))
        n_He = n_He + 1
    }

    # determine heterozygote frequency:
    He_frequency[jj,ll] <- n_He / kk

  }
}

boxplot(
    He_frequency[, 1:length(Pop_Sizes)],
    col = "blue", notch = TRUE,
    names = (lapply(as.character(Pop_Sizes), function(x) paste("N", x, sep="="))),
    xlab = "Population Size",
    ylab = "freq(A1A2) == He"
)
abline(h = c(0.25, 0.5), lty = 5, col = "grey")

Frequency Ranges

We adjust our R code to plot genotype frequencies for a range of p values.

# Number of individuals in a population:
Nind <- 100

# Number of populations:
Npop <- 500

# Create a range of allele frequencies
p_range <- seq(0, 1, length.out = Npop)

# Create vector:
genotype_frequency <- array(dim = c(Npop, 3))

for (jj in 1:Npop)
{
  # reset genotype counters:
  n_A1A1 <- 0
  n_A1A2 <- 0
  n_A2A2 <- 0

  # create population
  pop <- make.pop(p = p_range[jj], pop.size = 2 * Nind)

  #
  for (ii in 1:Nind)
  {
    # count genotypes:
    if (identical(pop[ii, ], c("A1", "A1")))
      n_A1A1 = n_A1A1 + 1
    if (identical(pop[ii, ], c("A1", "A2")) ||
        identical(pop[ii, ], c("A2", "A1")))
      n_A1A2 = n_A1A2 + 1
    if (identical(pop[ii, ], c("A2", "A2")))
      n_A2A2 = n_A2A2 + 1
  }

  # determine genotype frequency:
  freq_A1A1 <- n_A1A1 / Nind
  freq_A1A2 <- n_A1A2 / Nind
  freq_A2A2 <- n_A2A2 / Nind

  # safe genotype frequency for population jj:
  genotype_frequency[jj, ] <- c(freq_A1A1, freq_A1A2, freq_A2A2)
}

## Plot the GT-frequencies

# Simulation for f(A1A1)
plot(
  genotype_frequency[1:Npop, 1],
  type = "p",
  xaxt = "n",
  col = "lightblue",
  pch = 20,
  ylim = range(0, 1),
  xlab = "A1 Allele Frequency (p)",
  ylab = "Genotype-Frequency"
)

# Theoretical curve f(A1A1)
points(p_range^2, col = "darkblue", typ = "l", lwd = 2)

# Simulation for f(A1A2) (heterozygote)
points(genotype_frequency[1:Npop, 2],
       type = "p",
       col = "lightgreen",
       pch = 20)

# Theoretical curve f(A1A2)
points(2 * p_range * (1 - p_range), col = "green", typ = "l", lwd = 2)

# Simulation for f(A2A2) (homozygote)
points(genotype_frequency[1:Npop, 3],
       type = "p",
       col = "orange",
       pch = 20)

# Theoretical curve f(A2A2)
points((1 - p_range) ^ 2, col = "red", typ = "l", lwd = 2)

## Add more opjects to plot
axis(1, at = c(seq(0, Npop, Npop / 10)), labels = (seq(0, 1, 0.1)))
abline(h = c(seq(0, 1, 0.1)), lty = 5, col = "grey")
legend(
  "top",
  legend = c("A1A1", "A1A2", "A2A2"),
  col = c("lightblue", "lightgreen", "orange"),
  pch = 15,
  cex = 1.5,
  bty = "n"
)