Matthew J. Neave

Bioinformatics - Data Science

Projects | Publications | Archive | About

Analysing the Microbiome of Corals

22 Nov 2015 | bioinformatics, R

During this post I will detail the statistical and graphical steps (using the R programming language) for reproducing the results in:

Neave, M.J., Rachmawati, R., Xun, L., Michell, C.T., Bourne, D.G., Apprill, A., Voolstra, C.R. (2017). Differential specificity between closely related corals and abundant Endozoicomonas endosymbionts across global scales, published in The ISME Journal. A pdf of this paper and other related code is available under publications. The raw sequence data used in this analysis is available here PRJNA280923, and complete GitHub repository here.

First some background. It is relatively well known that corals have a microbiome containing a large diversity of bacteria. However, it is not clear if certain bacterial species are always present in the microbiome (i.e., core members), of if some bacterial species specifically associate with particular species of coral. These bacterial members may play an important role in keeping corals healthy, or conversely, they may cause disease. Previous work had found that the bacterial genera Endozoicomonas was often present in coral microbiomes, and was considered a potential core microbiome member. These studies, however, often examined corals living on a single reef, or geographically nearby.

We wanted to examine the microbiome of corals on a larger geographical scale to determine if corals worldwide associated with core bacterial species. To do this we sampled two coral species, Stylophora pistillata and Pocillopora verrucosa, at 28 reefs across 7 major geographical regions in 6 countries. We extracted the DNA from these samples, and analysed copies of a particular bacterial gene (16S rRNA gene), which allowed us to determine the abundance and identity of the bacteria present in each sample. These gene copies were analysed using a procedure called Minimum Entropy Decomposition (MED), which groups the copies into units analagous to species.

The following R code takes the bacterial abundance and diversity of each sample (as output from MED), and determines the microbiome similarity across corals, diversity differences, correlation with environmental variables, and the identity of core bacterial species.

I will firstly load the R packages that are required for this analysis

library("phyloseq"); packageVersion("phyloseq")
## [1] '1.10.0'
library("ggplot2"); packageVersion("ggplot2")
## [1] '1.0.1'
library("plyr"); packageVersion("plyr")
## [1] '1.8.1'
library("vegan"); packageVersion("vegan")
## [1] '2.2.1'
library("grid"); packageVersion("grid")
## [1] '3.1.1'
library("knitr"); packageVersion("knitr")
## [1] '1.11'
library("clustsig"); packageVersion("clustsig")
## [1] '1.1'
library('ape'); packageVersion("ape")
## [1] '3.2'
library('RColorBrewer'); packageVersion("RColorBrewer")
## [1] '1.1.2'
library("dunn.test"); packageVersion("dunn.test")
## [1] '1.3.1'
library("DESeq2"); packageVersion("DESeq2")
## [1] '1.6.3'

Import data

We need to import the matrix percent file and count file generated by the minimum entropy decomposition (MED) pipeline, subsampled to 7974 reads per sample, and the associated taxonony file.

allShared = read.table("all.7974.matrixPercent.txt", header = T, row.names = 1)
allCounts = read.table("all.7974.matrixCount.txt", header = T, row.names = 1)
allTax = read.table("all.7974.nodeReps.nr_v119.knn.taxonomy", header = T, sep = "\t", 
    row.names = 1)

allTax = allTax[, 2:8]
allTax = as.matrix(allTax)

Now to import the shared and taxonomy files generated in mothur for 3% and 1% pairwise similarity, in order to calculate alpha diversity measures and to compare to the MED procedure. Also import the 3% OTU file without any subsampling for alpha diversity calculations.

all3OTUshared = read.table("all.7974.0.03.pick.shared", header=T, row.names=2)
all3OTUshared = all3OTUshared[,3:length(all3OTUshared)]

alpha3OTUshared = read.table("all.7974.0.03.shared", header=T)
rownames(alpha3OTUshared) = alpha3OTUshared[,2]
alpha3OTUshared = alpha3OTUshared[,4:length(alpha3OTUshared)]

all1OTUshared = read.table("all.7974.0.01.pick.shared", header=T, row.names=2)
all1OTUshared = all1OTUshared[,3:length(all1OTUshared)]

all3OTUtax = read.table('all.7974.0.03.taxonomy', header=T, sep='\t', row.names=1)
all3OTUtax = all3OTUtax[,2:8]
all3OTUtax = as.matrix(all3OTUtax)

all1OTUtax = read.table('all.7974.0.01.taxonomy', header=T, sep='\t', row.names=1)
all1OTUtax = all1OTUtax[,2:8]
all1OTUtax = as.matrix(all1OTUtax)

Import Endozoicomonas phylogenetic tree (exported from ARB) using the APE package (Fig. 3). Also import a MED percent matrix that is slightly modified to accomodate the tree

endoTreeFile = read.tree(file='MEDNJ5.tree')
allSharedTree = read.table("all.7974.matrixPercent.tree.txt", header=T, row.names=1)

Import meta data for the samples, including metaData3.txt, which is slightly modified to accomodate heatmap sample ordering, and metaDataChem which contains additional columns of physiochemical data

metaFile = read.table('metaData2.MED', header=T, sep='\t', row.names=1)
metaFile3 = read.table('metaData3.txt', header=T, sep='\t', row.names=1)
metaFileChem = read.table('metaDataChem.txt', header=T, sep='\t', row.names=1)

The R package phyloseq will be used to help with analysis of the microbiome data. To use this package, we need to create phyloseq objects from our data.

Create phyloseq objects and add consistent coloring for sites

OTU = otu_table(allShared, taxa_are_rows = FALSE)
OTUcounts = otu_table(allCounts, taxa_are_rows = FALSE)
OTUs3 = otu_table(all3OTUshared, taxa_are_rows = FALSE)
OTUs3alpha = otu_table(alpha3OTUshared, taxa_are_rows = FALSE)
OTUs1 = otu_table(all1OTUshared, taxa_are_rows = FALSE)
OTUtree = otu_table(allSharedTree, taxa_are_rows = FALSE)

TAX = tax_table(allTax)
TAX3 = tax_table(all3OTUtax)
TAX1 = tax_table(all1OTUtax)

META = sample_data(metaFile)
METAchem = sample_data(metaFileChem)
TREE = phy_tree(endoTreeFile)

allPhylo = phyloseq(OTU, TAX, META)
countPhylo = phyloseq(OTUcounts, TAX, META)
all3OTUphylo = phyloseq(OTUs3, TAX3, META)
alpha3OTUphylo = phyloseq(OTUs3alpha, META)
all1OTUphylo = phyloseq(OTUs1, TAX1, META)
allPhyloChem = phyloseq(OTU, TAX, METAchem)
endoTree = phyloseq(OTUtree, META, TREE)

cols <- c(AmericanSamoa = "#D95F02", Indonesia = "#A6761D", MaggieIs = "#666666", 
    Maldives = "#E6AB02", Micronesia = "#66A61E", Ningaloo = "#7570B3", RedSea = "#E7298A", other = "black")

Ordinations to compare MED vs pairwise OTUs

The MED procedure is relatively new and I would like to compare this method with the more traditional method of pairwise OTU generation. A good way to do this is to see how ordinations of the samples change with the different methods. Before we can do ordinations, we need to subset the samples for the two corals, remove taxa with 0s, create relative abundance and square-root the sample counts.

filter_stylo_data <- function(initial_matrix){
  initial_coral <- subset_samples(initial_matrix, species=="Stylophora pistillata")
  coral_filt = filter_taxa(initial_coral, function(x) mean(x) > 0, TRUE)
  coral_filt_rel = transform_sample_counts(coral_filt, function(x) x / sum(x) )
  coral_filt_rel_sqrt = transform_sample_counts(coral_filt_rel, function(x) sqrt(x) ) 
  return(coral_filt_rel_sqrt)
}

filter_pverr_data <- function(initial_matrix){
  initial_coral <- subset_samples(initial_matrix, species=="Pocillopora verrucosa")
  coral_filt = filter_taxa(initial_coral, function(x) mean(x) > 0, TRUE)
  coral_filt_rel = transform_sample_counts(coral_filt, function(x) x / sum(x) )
  coral_filt_rel_sqrt = transform_sample_counts(coral_filt_rel, function(x) sqrt(x) ) 
  return(coral_filt_rel_sqrt)
}

spistPhyloRelSqrt <- filter_stylo_data(allPhylo)
spist3OTUphyloRelSqrt <- filter_stylo_data(all3OTUphylo)
spist1OTUphyloRelSqrt <- filter_stylo_data(all1OTUphylo)

pverrPhyloRelSqrt <- filter_pverr_data(allPhylo)
pverr3OTUphyloRelSqrt <- filter_pverr_data(all3OTUphylo)
pverr1OTUphyloRelSqrt <- filter_pverr_data(all1OTUphylo)

Now the data is ready for ordinations comparing the techniques.

compOrdinations <- function(sample_data, sample_name) {
    theme_set(theme_bw())
    sample_dataOrd <- ordinate(sample_data, "NMDS", "bray")
    plot_ordination(sample_data, sample_dataOrd, type = "samples", color = "site", 
        title = sample_name) + geom_point(size = 2) + scale_color_manual(values = cols)
}

compOrdinations(spistPhyloRelSqrt, "S. pistillata MED OTUs")

stress 0.2243104 
procrustes: rmse 0.08223169  max resid 0.3308432

plot of chunk unnamed-chunk-7

compOrdinations(spist3OTUphyloRelSqrt, "S. pistillata 3% OTUs")
stress 0.2267772 
procrustes: rmse 0.003399554  max resid 0.02201426 

plot of chunk unnamed-chunk-7

compOrdinations(spist1OTUphyloRelSqrt, "S. pistillata 1% OTUs")
stress 0.2225414 
rmse 0.0001117761  max resid 0.0006041839 

plot of chunk unnamed-chunk-7

compOrdinations(pverrPhyloRelSqrt, "P. verrucosa MED OTUs")
stress 0.2153855 
rmse 0.06094373  max resid 0.3421333 

plot of chunk unnamed-chunk-7

compOrdinations(pverr3OTUphyloRelSqrt, "P. verrucosa 3% OTUs")
stress 0.2235731 
procrustes: rmse 0.05175333  max resid 0.3316109 

plot of chunk unnamed-chunk-7

compOrdinations(pverr1OTUphyloRelSqrt, "P. verrucosa 1% OTUs")
stress 0.2186754 
rmse 0.0680086  max resid 0.3282367 

plot of chunk unnamed-chunk-7

The ordinations show that the MED procedure does a nice job of clearly delineating samples from different sites, which suggests that it is correctly categorising the OTUs. Using 1% pairwise OTU generation also does a good job, while 3% pairwise starts to mix some of the samples.

Overall, the microbiomes of Stylophora pistillata are separated into each sampling region, while the Pocillopora verrucosa microbiomes are more similar across the sites.

Alpha diversity measures

Alpha diversity measures can be used to determine if a particular coral species has higher or lower species richness compared to the other species, or to the surrounding seawater. First we need to subset the corals, then plot the measures using phyloseq and ggplot2

Note: I’ll use unsubampled 3% pairwise OTUs for calculation of alpha diversity measures as this will make them more comparable to other studies, plus the MED pipeline has not yet implemented alpha diversity

allAlphaTmp <- subset_samples(alpha3OTUphylo, species == "seawater")
allAlphaTmp2 <- subset_samples(alpha3OTUphylo, species == "Stylophora pistillata")
allAlphaTmp3 <- subset_samples(alpha3OTUphylo, species == "Pocillopora verrucosa")
allAlpha2 <- merge_phyloseq(allAlphaTmp, allAlphaTmp2, allAlphaTmp3)

allAlphaPlot2 <- plot_richness(allAlpha2, x = "species", measures = c("Chao1", "Simpson", 
    "observed"), color = "site", sortby = "Chao1")

ggplot(data = allAlphaPlot2$data) + 
    geom_point(aes(x = species, y = value, color = site), 
    position = position_jitter(width = 0.1, height = 0)) + 
    geom_boxplot(aes(x = species, y = value, color = NULL), alpha = 0.1, outlier.shape = NA) + scale_color_manual(values = cols) + 
    theme(axis.text.x = element_text(angle = 90)) + 
    facet_wrap(~variable, scales = "free_y") + 
    scale_x_discrete(limits = c("Stylophora pistillata", "Pocillopora verrucosa", "seawater"))

plot of chunk unnamed-chunk-8

It looks like the seawater samples contained a greater diversity of bacteria compared to the corals, which were similar to each other. Let’s check if these differences are statistically significant using a kruskal-wallis test and a dunn post-hoc test to check which specific groups are different.

alphaObserved = estimate_richness(allAlpha2, measures="Observed")
alphaSimpson = estimate_richness(allAlpha2, measures="Simpson")
alphaChao = estimate_richness(allAlpha2, measures="Chao1")

alpha.stats <- cbind(alphaObserved, sample_data(allAlpha2))
alpha.stats2 <- cbind(alpha.stats, alphaSimpson)
alpha.stats3 <- cbind(alpha.stats2, alphaChao)

kruskal.test(Observed~species, data = alpha.stats3)

## 
## 	Kruskal-Wallis rank sum test
## 
## data:  Observed by species
## Kruskal-Wallis chi-squared = 61.8764, df = 2, p-value = 3.662e-14

dunn.test(alpha.stats3$Observed, alpha.stats3$species, method="bonferroni")

##   Kruskal-Wallis rank sum test
## 
## data: x and group
## Kruskal-Wallis chi-squared = 61.8764, df = 2, p-value = 0
## 
## 
##                            Comparison of x by group                            
##                                  (Bonferroni)                                  
## Col Mean-|
## Row Mean |   Pocillop   seawater
## ---------+----------------------
## seawater |  -7.510384
##          |     0.0000
##          |
## Stylopho |  -1.357184   6.783011
##          |     0.2621     0.0000

kruskal.test(Simpson~species, data = alpha.stats3)

## 
## 	Kruskal-Wallis rank sum test
## 
## data:  Simpson by species
## Kruskal-Wallis chi-squared = 12.2453, df = 2, p-value = 0.002193

dunn.test(alpha.stats3$Simpson, alpha.stats3$species, method="bonferroni")

##   Kruskal-Wallis rank sum test
## 
## data: x and group
## Kruskal-Wallis chi-squared = 12.2453, df = 2, p-value = 0
## 
## 
##                            Comparison of x by group                            
##                                  (Bonferroni)                                  
## Col Mean-|
## Row Mean |   Pocillop   seawater
## ---------+----------------------
## seawater |  -3.397898
##          |     0.0010
##          |
## Stylopho |  -0.811204   2.904738
##          |     0.6259     0.0055

kruskal.test(Chao1~species, data = alpha.stats3)

## 
## 	Kruskal-Wallis rank sum test
## 
## data:  Chao1 by species
## Kruskal-Wallis chi-squared = 64.3067, df = 2, p-value = 1.086e-14

dunn.test(alpha.stats3$Chao1, alpha.stats3$species, method="bonferroni")

##   Kruskal-Wallis rank sum test
## 
## data: x and group
## Kruskal-Wallis chi-squared = 64.3067, df = 2, p-value = 0
## 
## 
##                            Comparison of x by group                            
##                                  (Bonferroni)                                  
## Col Mean-|
## Row Mean |   Pocillop   seawater
## ---------+----------------------
## seawater |  -7.581725
##          |     0.0000
##          |
## Stylopho |  -1.146749   7.033279
##          |     0.3772     0.0000

In each case, the seawater was signficiantly different to the corals, while the corals were not different to each other. This suggests the corals have a more ‘selective’ community of microbes compared to the surrounding seawater.

Similarity Profile Analysis (SIMPROF)

This will show how the samples cluster without any a priori assumptions regarding sample origin.

Need to import the shared file containing just spist OTUs, then calcualte the simprof clusters based on the braycurtis metric.

spist <- subset_samples(allPhylo, species == "Stylophora pistillata")
spistShared = otu_table(spist)
class(spistShared) <- "numeric"

spistSIMPROF <- simprof(spistShared, num.expected = 1000, num.simulated = 99, method.cluster = "average", 
    method.distance = "braycurtis", method.transform = "squareroot", alpha = 0.05, 
    sample.orientation = "row", silent = TRUE)

simprof.plot(spistSIMPROF, leafcolors = NA, plot = TRUE, fill = TRUE, leaflab = "perpendicular", 
    siglinetype = 1)

plot of chunk unnamed-chunk-10

## 'dendrogram' with 2 branches and 73 members total, at height 99.58749

pVerr <- subset_samples(allPhylo, species == "Pocillopora verrucosa")
pVerrShared = otu_table(pVerr)
class(pVerrShared) <- "numeric"

pVerrSIMPROF <- simprof(pVerrShared, num.expected = 1000, num.simulated = 99, method.cluster = "average", 
    method.distance = "braycurtis", method.transform = "squareroot", alpha = 0.05, 
    sample.orientation = "row", silent = TRUE)

simprof.plot(pVerrSIMPROF, leafcolors = NA, plot = TRUE, fill = TRUE, leaflab = "perpendicular", 
    siglinetype = 1)

plot of chunk unnamed-chunk-10

## 'dendrogram' with 2 branches and 53 members total, at height 99.459

Again, the Stylophora pistillata microbiomes seem to be similar within sites and different across sites. On the other hand, Pocillopora verrucosa microbiomes are more similar across all regions.

Chemical and biological correlations

Now I will use the envfit function from the Vegan package to test if any environmental variables are significantly correlated with microbiome differences in the corals

draw_envfit_ord <- function(coral_chem, env_data) {
    chemNoNA <- na.omit(metaFileChem[sample_names(coral_chem), env_data])
    coralNoNA <- prune_samples(rownames(chemNoNA), coral_chem)
    
    theme_set(theme_bw())
    coralNoNAOrd <- ordinate(coralNoNA, "NMDS", "bray")
    coralNoNAOrdPlot <- plot_ordination(coralNoNA, coralNoNAOrd, type = "samples", 
        color = "site") + geom_point(size = 3) + scale_color_manual(values = c(cols))
    
    # get points for ggplot
    pointsNoNA <- coralNoNAOrd$points[rownames(chemNoNA), ]
    chemFit <- envfit(pointsNoNA, env = chemNoNA, na.rm = TRUE)
    print(chemFit)
    chemFit.scores <- as.data.frame(scores(chemFit, display = "vectors"))
    chemFit.scores <- cbind(chemFit.scores, Species = rownames(chemFit.scores))
    
    # create arrow info
    chemNames <- rownames(chemFit.scores)
    arrowmap <- aes(xend = MDS1, yend = MDS2, x = 0, y = 0, shape = NULL, color = NULL)
    labelmap <- aes(x = MDS1, y = MDS2 + 0.04, shape = NULL, color = NULL, size = 1.5, 
        label = chemNames)
    arrowhead = arrow(length = unit(0.25, "cm"))
    
    # note: had to use aes_string to get ggplot to recognize variables
    coralNoNAOrdPlot + coord_fixed() + geom_segment(arrowmap, size = 0.5, data = chemFit.scores, 
        color = "black", arrow = arrowhead, show_guide = FALSE) + geom_text(aes_string(x = "MDS1", 
        y = "MDS2", shape = NULL, color = NULL, size = 1.5, label = "Species"), size = 3, 
        data = chemFit.scores)
}

waterQual <- c("temp", "salinity", "Domg", "pH")
nutrients <- c("PO4", "N.N", "silicate", "NO2", "NH4")
FCM <- c("prok", "syn", "peuk", "pe.peuk", "Hbact")

spistChem <- subset_samples(allPhyloChem, species == "Stylophora pistillata")
pverrChem <- subset_samples(allPhyloChem, species == "Pocillopora verrucosa")

draw_envfit_ord(spistChem, waterQual)

## Square root transformation
## Wisconsin double standardization
## stress 0.1729033 
## procrustes: rmse 0.04036827  max resid 0.2729433 
## 
## ***VECTORS
## 
##              MDS1     MDS2     r2 Pr(>r)    
## temp      0.75871 -0.65143 0.4813  0.001 ***
## salinity -0.15146  0.98846 0.1597  0.009 ** 
## Domg     -0.91133  0.41168 0.0833  0.121    
## pH       -0.53245  0.84646 0.2194  0.006 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Permutation: free
## Number of permutations: 999

plot of chunk unnamed-chunk-11

draw_envfit_ord(spistChem, nutrients)

## Square root transformation
## Wisconsin double standardization
## stress 0.1774209 
## procrustes: rmse 0.04163872  max resid 0.2726075 
## 
## ***VECTORS
## 
##              MDS1     MDS2     r2 Pr(>r)    
## PO4      -0.24248  0.97016 0.2268  0.004 ** 
## N.N       0.85590 -0.51714 0.0672  0.179    
## silicate -0.80233  0.59688 0.4800  0.001 ***
## NO2      -0.53794  0.84298 0.4273  0.001 ***
## NH4       0.78539 -0.61900 0.0203  0.612    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Permutation: free
## Number of permutations: 999

plot of chunk unnamed-chunk-11

draw_envfit_ord(spistChem, FCM)

## Square root transformation
## Wisconsin double standardization
## stress 0.1774208 
## procrustes: rmse 0.04147211  max resid 0.2726479 
## 
## ***VECTORS
## 
##             MDS1     MDS2     r2 Pr(>r)    
## prok    -0.05013 -0.99874 0.2678  0.001 ***
## syn      0.54913  0.83574 0.1100  0.037 *  
## peuk     0.53372  0.84566 0.0852  0.082 .  
## pe.peuk  0.95500  0.29660 0.0543  0.226    
## Hbact   -0.76909  0.63914 0.0420  0.370    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Permutation: free
## Number of permutations: 999

plot of chunk unnamed-chunk-11

draw_envfit_ord(pverrChem, waterQual)

## Square root transformation
## Wisconsin double standardization
## stress 0.2511791 
## procrustes: rmse 0.161753  max resid 0.4343423 

## 
## ***VECTORS
## 
##              MDS1     MDS2     r2 Pr(>r)   
## temp      0.23382  0.97228 0.4251  0.006 **
## salinity  0.15289 -0.98824 0.3311  0.020 * 
## Domg     -0.99489  0.10092 0.0597  0.523   
## pH       -0.01848 -0.99983 0.3003  0.021 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Permutation: free
## Number of permutations: 999

plot of chunk unnamed-chunk-11

draw_envfit_ord(pverrChem, nutrients)

## Square root transformation
## Wisconsin double standardization
## stress 0.2434124 
## procrustes: rmse 0.08711831  max resid 0.344812 

## 
## ***VECTORS
## 
##              MDS1     MDS2     r2 Pr(>r)  
## PO4       0.04248  0.99910 0.0808  0.375  
## N.N       0.06363  0.99797 0.2207  0.056 .
## silicate  0.83138  0.55571 0.2191  0.063 .
## NO2       0.27636  0.96105 0.0109  0.888  
## NH4      -0.75677  0.65369 0.1645  0.120  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Permutation: free
## Number of permutations: 999

plot of chunk unnamed-chunk-11

draw_envfit_ord(pverrChem, FCM)

## Square root transformation
## Wisconsin double standardization
## stress 0.243433 
## procrustes: rmse 0.03141653  max resid 0.1071445

## 
## ***VECTORS
## 
##             MDS1     MDS2     r2 Pr(>r)    
## prok    -0.28887 -0.95737 0.4431  0.001 ***
## syn     -0.62071 -0.78404 0.0366  0.675    
## peuk     0.60373 -0.79719 0.0241  0.755    
## pe.peuk -0.66198  0.74952 0.0365  0.639    
## Hbact   -0.22203 -0.97504 0.0908  0.348    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Permutation: free
## Number of permutations: 999

plot of chunk unnamed-chunk-11

A few interesting statistically significant correlations between the microbiomes and chemical data were seen, however, no clear pattern emerged. Note that many of the chemicals were not significantly (p < 0.05) correlated to the microbiome samples.