This vignette is a re-analysis of the data set from Chaillou et al. (2015), as suggested in the Homeworks section of the slides. Note that this is only one of many analyses that could be done on the data.
We first setup our environment by loading a few packages
library(tidyverse) ## data manipulation
library(phyloseq) ## analysis of microbiome census data
library(ape) ## for tree manipulation
library(vegan) ## for community ecology analyses
## You may also need to install gridExtra with
## install.packages("gridExtra")
## and load it with
## library(gridExtra)
## if you want to reproduce some of the side-by-side graphs shown below
We then load the data from the Chaillou et al. (2015) dataset. We assume that they are located in the data/chaillou
folder. A quick look at the folder content shows we have access to a tree (in newick format, file extension nwk
) and a biom file (extension biom
). A glance at the biom file content shows that the taxonomy is "k__Bacteria", "p__Tenericutes", "c__Mollicutes", "o__Mycoplasmatales", "f__Mycoplasmataceae", "g__Candidatus Lumbricincola", "s__NA"
. This is the so called greengenes format and the taxonomy should thus be parsed using the parse_taxonomy_greengenes
function.
food <- import_biom("data/chaillou/chaillou.biom",
parseFunction = parse_taxonomy_greengenes)
We can then add the phylogenetic tree to the food
object
phy_tree(food) <- read_tree("data/chaillou/tree.nwk")
food
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 508 taxa and 64 samples ]
## sample_data() Sample Data: [ 64 samples by 3 sample variables ]
## tax_table() Taxonomy Table: [ 508 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 508 tips and 507 internal nodes ]
The data consists of 64 samples of food: 8 replicates for each of 8 food types. We have access to the following descriptors
We can navigate through the metadata
EnvType is coded in French and with categories in no meaningful order. We will translate them and order them to have food type corresponding to meat first and to seafood second.
## We create a "dictionary" for translation and order the categories
## as we want
dictionary = c("BoeufHache" = "Ground_Beef",
"VeauHache" = "Ground_Veal",
"MerguezVolaille" = "Poultry_Sausage",
"DesLardons" = "Bacon_Dice",
"SaumonFume" = "Smoked_Salmon",
"FiletSaumon" = "Salmon_Fillet",
"FiletCabillaud" = "Cod_Fillet",
"Crevette" = "Shrimp")
env_type <- sample_data(food)$EnvType
sample_data(food)$EnvType <- factor(dictionary[env_type], levels = dictionary)
We can also build a custom color palette to remind ourselves which samples correspond to meat and which to seafood.
my_palette <- c('#67001f','#b2182b','#d6604d','#f4a582',
'#92c5de','#4393c3','#2166ac','#053061')
names(my_palette) <- dictionary
Before moving on to elaborate statistics, we’ll look at the taxonomic composition of our samples ## Taxonomic Composition {.tabset}
To use the plot_composition
function, we first need to source it:
## If not installed, installed via
## remotes::install_github("mahendra-mariadassou/phyloseq-extended", branch = "dev")
library(phyloseq.extended)
## Alternative if installation of graphical methods fails:
## source only graphical functions
# download.file(url = "https://raw.githubusercontent.com/mahendra-mariadassou/phyloseq-extended/master/R/graphical_methods.R",
# destfile = "graphical_methods.R")
# source("graphical_methods.R")
We then look at the community composition at the phylum level. In order to highlight the structure of the data, we split the samples according to their food of origin. We show several figures, corresponding to different taxonomic levels.
We can see right away that meat products are whereas Bacteroidetes and Proteobacteria are more abundant in seafood.
plot_composition(food, "Kingdom", "Bacteria", "Phylum", fill = "Phylum") +
facet_grid(~EnvType, scales = "free_x", space = "free_x") +
theme(axis.text.x = element_blank())
We have access to the following taxonimic ranks:
rank_names(food)
## [1] "Kingdom" "Phylum" "Class" "Order" "Family" "Genus" "Species"
Given the importance of Firmicutes, we can zoom in within that phylum and investigate the composition at the genus level.
plot_composition(food, "Phylum", "Firmicutes", "Genus", fill = "Genus", numberOfTaxa = 5) +
facet_grid(~EnvType, scales = "free_x", space = "free_x") +
theme(axis.text.x = element_blank())
We see that there is a high diversity of genera across the different types of food and that no single OTU dominates all communities.
The samples have very similar sampling depths
sample_sums(food) %>% range()
## [1] 11718 11857
there is thus no need for rarefaction.
We compare the different types of food in terms of diversity.
plot_richness(food, x = "EnvType", color = "EnvType",
measures = c("Observed", "Shannon", "InvSimpson")) +
geom_boxplot(aes(group = EnvType)) + ## add one boxplot per type of food
scale_color_manual(values = my_palette) ## custom color palette
Different foods have very different diversities with dice of bacon having the highest number (\(\sim\) 250) of OTUs. However, most foods have low effective diversities.
We can quantify the previous claims by performing an ANOVA of the diversity against the covariates of interest. For the sake of brevity, we focus here on both the observed and effective number of species (as measured by the InvSimpson measure). We first build a data.frame with both covariates and diversity indices.
div_data <- cbind(estimate_richness(food), ## diversity indices
sample_data(food) ## covariates
)
Foods differ significantly in terms of number of observed OTUs…
model <- aov(Observed ~ EnvType, data = div_data)
anova(model)
## Analysis of Variance Table
##
## Response: Observed
## Df Sum Sq Mean Sq F value Pr(>F)
## EnvType 7 86922 12417.5 12.488 1.634e-09 ***
## Residuals 56 55686 994.4
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
… but no in terms of effective number of species
model <- aov(InvSimpson ~ EnvType, data = div_data)
anova(model)
## Analysis of Variance Table
##
## Response: InvSimpson
## Df Sum Sq Mean Sq F value Pr(>F)
## EnvType 7 486.26 69.466 1.2785 0.2777
## Residuals 56 3042.72 54.334
We have access to a phylogenetic tree so we can compute all 4 distances seen during the tutorial.
dist.jac <- distance(food, method = "cc")
dist.bc <- distance(food, method = "bray")
dist.uf <- distance(food, method = "unifrac")
dist.wuf <- distance(food, method = "wunifrac")
p.jac <- plot_ordination(food,
ordinate(food, method = "MDS", distance = dist.jac),
color = "EnvType") +
ggtitle("Jaccard") + scale_color_manual(values = my_palette)
p.bc <- plot_ordination(food,
ordinate(food, method = "MDS", distance = dist.bc),
color = "EnvType") +
ggtitle("Bary-Curtis") + scale_color_manual(values = my_palette)
p.uf <- plot_ordination(food,
ordinate(food, method = "MDS", distance = dist.uf),
color = "EnvType") +
ggtitle("UniFrac") + scale_color_manual(values = my_palette)
p.wuf <- plot_ordination(food,
ordinate(food, method = "MDS", distance = dist.wuf),
color = "EnvType") +
ggtitle("wUniFrac") + scale_color_manual(values = my_palette)
gridExtra::grid.arrange(p.jac, p.bc, p.uf, p.wuf,
ncol = 2)
In this example, Jaccard and UniFrac distances provide a clear separation of meats and seafoods, unlike Bray-Curtis and wUniFrac. This means that food ecosystems share their abundant taxa but not the rare ones.
UniFrac also gives a much better separation of poultry sausages and ground beef / veal than Jaccard. This means that although those foods have taxa in common, their specific taxa are located in different parts of the phylogenetic tree. Since UniFrac appears to the most relevant distance here, we’ll restrict downstream analyses to it.
One can also note that dices of bacon are located between meats and seafoods. We’ll come back latter to that point latter.
In the previous graphs, we use unconstrained ordination to eyeball global structure. We can also try constrained ordination to restrict our attention to the variability that can be explained by some variables (the constraints). Constrained ordination is done with CAP
(Constrained Analysis of Proximities) and you need to specify which constraints you are going to use with the formula
parameter.
Let’s have a look at EnvType
first. We know from previous analyses that this is a strong structuring factor. Limiting ourselves to the structure that is compatible with EnvType
allow us to identify the general patterns seen without constraints (the percent of variations associatied to the axes are very similar to the ones found before).
plot_ordination(food,
ordinate(food, method = "CAP", distance = dist.uf,
formula = ~ EnvType),
color = "EnvType") +
ggtitle("UniFrac (constrained by EnvType)") + scale_color_manual(values = my_palette)
Contrast this with Description
, which is a bookkeeping variable, absolutely not related to the microbial community. At first glance, the structure is preserved but a closer look at the percentage of variability explained by the different cap axes is depressing. Limiting ourselves to the structure that is compatible with Description
makes us lose the general patterns found in the unconstrained analysis.
plot_ordination(food,
ordinate(food, method = "CAP", distance = dist.uf,
formula = ~ Description),
color = "EnvType") +
ggtitle("UniFrac (constrained by Description)") +
scale_color_manual(values = my_palette)
The hierarchical clustering of uniFrac distances using the Ward linkage function (to produce spherical clusters) show a perfect separation according.
par(mar = c(1, 0, 2, 0))
plot_clust(food, dist = "unifrac", method = "ward.D2", color = "EnvType",
palette = my_palette,
title = "Clustering of samples (UniFrac + Ward)\nsamples colored by EnvType")
We use the Unifrac distance to assess the variability in terms of microbial repertoires between foods.
metadata <- sample_data(food) %>% as("data.frame")
model <- adonis(dist.uf ~ EnvType, data = metadata, permutations = 999)
model
##
## Call:
## adonis(formula = dist.uf ~ EnvType, data = metadata, permutations = 999)
##
## Permutation: free
## Number of permutations: 999
##
## Terms added sequentially (first to last)
##
## Df SumsOfSqs MeanSqs F.Model R2 Pr(>F)
## EnvType 7 7.6565 1.09379 13.699 0.63132 0.001 ***
## Residuals 56 4.4713 0.07984 0.36868
## Total 63 12.1278 1.00000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The results show that food origin explains 63.13% of the total variability observed between samples in terms of species repertoires. This is quite strong! Don’t expect such strong results in general.
We investigate the content of our samples when looking at the raw count table and use a custom color scale to reproduce (kind of) the figures in the paper.
p <- plot_heatmap(food) +
facet_grid(~EnvType, scales = "free_x", space = "free_x") +
scale_fill_gradient2(low = "#1a9850", mid = "#ffffbf", high = "#d73027",
na.value = "white", trans = scales::log_trans(4),
midpoint = log(100, base = 4))
plot(p)
It looks like: - there is some kind of block structure in the data (eg. OTUs that are abundant in seafoods but not meat products or vice-versa) - dices of bacon have a lot of OTU in common with seafoods. This is indeed the case and the result of “contamination”, sea salt is usually added to bacon to add flavor. But, in addition to flavor, the grains of salt also bring sea-specific bacterial taxa (at least their genetic material) with them.
To highlight the structure, we’re going to cluster the OTUs based on their prevalence / specifity patterns across the different food types.
otu_prev_spec <- estimate_prevalence(food, group = "EnvType", format = "wide", rarefy = FALSE)
otu_clust <- hclust(dist(otu_prev_spec), method = "ward.D2")
otu_order <- otu_clust$labels[otu_clust$order] ## order of appearance in the tree
## Take a look at the tree
plot(otu_clust, hang = -1); rect.hclust(otu_clust, k = 7)
## Tree based classification of OTUs with similar profiles
otu_group <- paste("Group", cutree(otu_clust, k = 7))
names(otu_group) <- taxa_names(food)
We can then order the OTUs in the same order as in the tree and split them according to their group.
## Add group information to the OTUs
tax_table(food) <- cbind(tax_table(food)[, 1:7], "Group" = otu_group)
## Plot the heatmap again but use custom OTU order and split the OTU by group
p <- plot_heatmap(food, taxa.order = otu_order) +
facet_grid(Group~EnvType, scales = "free", space = "free") +
scale_fill_gradient2(low = "#1a9850", mid = "#ffffbf", high = "#d73027",
na.value = "white", trans = scales::log_trans(4),
midpoint = log(100, base = 4))
plot(p)
To highlight the structure, we’re going to perform a differential analysis (between meat products and seafoods) and look only at the differentially abundant taxa.
cds <- phyloseq_to_deseq2(food, ~ FoodType)
dds <- DESeq2::DESeq(cds)
## If the previous line fails, you can try
## dds <- DESeq2::DESeq(cds, sfType = "poscounts")
results <- DESeq2::results(dds, tidy = TRUE) %>% rename(OTU = row)
We can explore the full result table
DT::datatable(results, filter = "top") %>%
DT::formatRound(columns = names(results), digits = 4)
Or only select the OTUs with an adjusted p-value lower than \(0.05\) and sort them according to fold-change.
da_otus <- results %>%
filter(padj < 0.05) %>%
arrange(log2FoldChange) %>%
pull(OTU)
length(da_otus)
## [1] 273
We end up with 273 significant OTUs sorted according to their fold-change. We keep only those OTUs in that order in the heat map.
p <- plot_heatmap(prune_taxa(da_otus, food), ## keep only da otus...
taxa.order = da_otus ## ordered according to fold-change
) +
facet_grid(~EnvType, scales = "free_x", space = "free_x") +
scale_fill_gradient2(low = "#1a9850", mid = "#ffffbf", high = "#d73027",
na.value = "white", trans = scales::log_trans(4),
midpoint = log(100, base = 4))
plot(p)
The different elements we’ve seen allow us to draw a few conclusions: