Use microeco分析扩增子数据
发表于:2021-03-14 | 分类: 生物信息
字数统计: 6.2k | 阅读时长: 33分钟 | 阅读量:

本文阐述使用microeco分析扩增子数据……

1. 安装microeco

1
2
3
4
5
6
7
# If devtools package is not installed, first install it
install.packages("devtools")
# then install microeco
devtools::install_github("ChiLiubio/microeco")

if (!requireNamespace("devtools", quietly = TRUE)){install.packages("devtools")}
devtools::install_github("jbisanz/qiime2R")

2. 准备数据

otu_table

OTU表

Sample 1 Sample 2 Sample 3 Sample 4
OTU1 232.0 209.0 349.0 256.0
OTU2 75.0 35.0 44.0 0.0
OTU3 237.0 224.0 291.0 353.0
OTU4 371.0 80.0 319.0 345.0

taxonomy_table

分类表

Kingdom Phylum Class Order Family Genus Species
OTU1 d__Bacteria p__Desulfobacterota c__Desulfobacteria o__Desulfatiglandales f__Desulfatiglandaceae g__Desulfatiglans
OTU2 d__Bacteria p__Sva0485 c__Sva0485 o__Sva0485 f__Sva0485 g__Sva0485 s__uncultured_hydrocarbon
OTU3 d__Bacteria p__Desulfobacterota c__Syntrophobacteria o__Syntrophobacterales f__uncultured g__uncultured s__uncultured_delta
OTU4 d__Bacteria p__Myxococcota c__Polyangia o__Polyangiales f__Sandaracinaceae g__uncultured

sample_info

样本元数据

SampleID Group Type Saline
Sample 1 G1 T1 Non-saline
Sample 2 G1 T1 Non-saline
Sample 3 G2 T1 Saline
Sample 4 G2 T2 Saline

env_data

环境因子

SampleID Depth Longitude Latitude
Sample 1 0 23.0 20
Sample 2 10 35.0 44.0
Sample 3 20 43.0 70.0
Sample 4 30 60.0 69.0

phylo_tree

进化树

3. 导入数据

  • 加载R包

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    library(microeco)
    library(ape)
    library(qiime2R)
    # use pipe operator in magrittr package
    library(magrittr)
    # set.seed is used to fix the random number generation to make the results repeatable
    set.seed(123)
    # make the plotting background same with the tutorial
    library(ggplot2)
    theme_set(theme_bw())
  • 导入数据

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    # 单独导入环境因子文件
    env_data <- read.delim("E:/Researches/lujia16S/Analysis_20200907/Ordination_analyses/env4.txt", sep = "\t", header = TRUE, row.names = 1)

    # 定义数据导入函数
    qiimed2meco <- function(ASV_data, sample_data, taxonomy_data, phylo_tree = NULL){
    # Read ASV data
    ASV <- as.data.frame(read_qza(ASV_data)$data)
    # Read metadata
    metadata <- read_q2metadata(sample_data)
    rownames(metadata) <- as.character(metadata[, 1])
    # Read taxonomy table
    taxa_table <- read_qza(taxonomy_data)
    taxa_table <- parse_taxonomy(taxa_table$data)
    # Make the taxonomic table clean, this is very important.
    taxa_table %<>% tidy_taxonomy
    # Read phylo tree
    if(!is.null(phylo_tree)){
    phylo_tree <- read_qza(phylo_tree)$data
    }
    dataset <- microtable$new(sample_table = metadata, tax_table = taxa_table, otu_table = ASV, phylo_tree = phylo_tree)
    dataset
    }

    # 导入本地数据,包括OTU表、样本元数据、分类表、tree文件。这几个文件均有QIIME2生成。
    meco_dataset <- qiimed2meco(ASV_data = "E:/Researches/lujia16S/Analysis_20200907/dada2_table.qza", sample_data = "E:/Researches/lujia16S/Analysis_20200907/metadata.tsv", taxonomy_data = "E:/Researches/lujia16S/Analysis_20200907/taxonomy.qza", phylo_tree = "E:/Researches/lujia16S/Analysis_20200907/tree.qza")

    meco_dataset

4. 数据预处理

  • 移除未被分配到 “k__Archaea” 或 “k__Bacteria”的OTUs

    1
    2
    meco_dataset$tax_table %<>% base::subset(Kingdom == "k__Archaea" | Kingdom == "k__Bacteria")
    print(meco_dataset)
  • 移除注释为线粒体和叶绿体的OTUs

    1
    2
    3
    4
    5
    # 移除tax_table中包含taxa名字的行,无论分类等级(taxonomic ranks),不区分大小写字母。简言之,taxa = c("mitochondria", "chloroplast")定义了删除包含mitochondria和chloroplast的行。

    meco_dataset$filter_pollution(taxa = c("mitochondria", "chloroplast"))

    print(meco_dataset)
  • 为了使otu_table、tax_table和phylo_tree中的OTUs相同,我们再次使用tidy_dataset() 函数

    1
    2
    meco_dataset$tidy_dataset()
    print(meco_dataset)
  • 调用sample_sums()检查各样本中的序列数量

    1
    meco_dataset$sample_sums() %>% range
  • 有时,为了减少物种数目对多样性度量的影响,我们需要进行重采样,使每个样本中的序列数量相等。函数rarefy_samples可以在稀疏(rarefying)前后自动调用函数tidy_dataset。

    1
    2
    3
    # As an example, we use 20001 sequences in each sample
    meco_dataset$rarefy_samples(sample.size = 20001)
    meco_dataset$sample_sums() %>% range

5. alpha多样性

  • 然后,我们使用cal_abund()计算每个分类等级的分类单元丰度(taxa abundance)。此函数返回一个名为taxa_abund的列表,其中包含每个分类等级上的丰度信息的多个数据框。列表自动存储在microtable object中。

    1
    2
    3
    meco_dataset$cal_abund()
    # return dataset$taxa_abund
    class(meco_dataset$taxa_abund)
  • 通过save_abund()将taxa abundance保存至本地文件

    1
    2
    dir.create("taxa_abund")
    meco_dataset$save_abund(dirpath = "taxa_abund")
  • 计算alpha多样性。结果自动存储在microtable object中。作为示例,此处未计算系统发育多样性(phylogenetic diversity)

    1
    2
    3
    4
    5
    6
    7
    8
    9
    # 若要计算Faith's phylogenetic diversity,设置PD = TRUE,计算速度会较慢
    meco_dataset$cal_alphadiv(PD = FALSE)

    # return dataset$alpha_diversity
    class(meco_dataset$alpha_diversity)

    # save dataset$alpha_diversity to a directory
    dir.create("alpha_diversity")
    meco_dataset$save_alphadiv(dirpath = "alpha_diversity")

6. β多样性

  • 利用函数cal_betadiv()计算beta-多样性的距离矩阵。我们提供了四个最常用的索引:Bray-curtis、Jaccard、加权Unifrac和未加权Unifrac。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    # If you do not want to calculate unifrac metrics, use unifrac = FALSE
    # 需要GUniFrac package
    install.packages("GUniFrac")

    meco_dataset$cal_betadiv(unifrac = TRUE)
    # return dataset$beta_diversity
    class(meco_dataset$beta_diversity)

    # save dataset$beta_diversity to a directory
    dir.create("beta_diversity")
    meco_dataset$save_betadiv(dirpath = "beta_diversity")

7. trans_abund class

  • 绘制Barplot。转换分类丰度数据,以便使用ggplot2包绘制分类单元丰度。

    1
    2
    3
    4
    # create trans_abund object
    # use 12 Phyla with the highest abundance in the dataset.
    t1 <- trans_abund$new(dataset = meco_dataset, taxrank = "Phylum", ntaxa = 12)
    # t1 object now include the transformed abundance data t1$abund_data and other elements for the following plotting
  • 绘制Barplot. We remove the sample names in x axis and add the facet to show abundance according to groups.

    1
    2
    3
    4
    5
    6
    t1$plot_bar(others_color = "grey70", facet = "Site", xtext_keep = FALSE, legend_text_italic = FALSE)
    # return a ggplot2 object

    # 获取组内平均值。The groupmean parameter can be used to obtain the group-mean barplot.
    t1 <- trans_abund$new(dataset = meco_dataset, taxrank = "Phylum", ntaxa = 12, groupmean = "Site")
    t1$plot_bar(others_color = "grey70", legend_text_italic = FALSE)
  • Then alluvial plot is implemented in the plot_bar function.

    1
    2
    3
    4
    install.packages("ggalluvial")
    t1 <- trans_abund$new(dataset = meco_dataset, taxrank = "Phylum", ntaxa = 12)
    # use_alluvium = TRUE make the alluvial plot, clustering =TRUE can be used to reorder the samples by clustering
    t1$plot_bar(use_alluvium = TRUE, clustering = TRUE, xtext_type_hor = FALSE, xtext_size = 6)
  • The box plot is an excellent way to intuitionally show data distribution across groups.

    1
    2
    3
    # show 15 taxa at Class level
    t1 <- trans_abund$new(dataset = meco_dataset, taxrank = "Class", ntaxa = 15)
    t1$plot_box(group = "Site")

show 15 taxa at Class level

  • Then we show the heatmap with the high abundant genera.

    1
    2
    3
    # show 40 taxa at Genus level
    t1 <- trans_abund$new(dataset = meco_dataset, taxrank = "Genus", ntaxa = 40)
    t1$plot_heatmap(facet = "Site", xtext_keep = FALSE, withmargin = FALSE)
  • Then, we show the pie chart.

    1
    2
    3
    t1 <- trans_abund$new(dataset = meco_dataset, taxrank = "Phylum", ntaxa = 6, groupmean = "Site")
    # all pie chart in one row
    t1$plot_pie(facet_nrow = 1)

8. trans_venn class

  • The trans_venn class is used for venn analysis. To analyze the unique and shared OTUs of groups, we first merge samples according to the “Group” column of sample_table.

    1
    2
    3
    4
    5
    6
    7
    8
    # merge samples as one community for each group
    dataset1 <- meco_dataset$merge_samples(use_group = "Site")
    # dataset1 is a new microtable object
    # create trans_venn object
    t1 <- trans_venn$new(dataset1, ratio = "seqratio")
    t1$plot_venn()
    # The integer data is OTU number
    # The percentage data is the sequence number/total sequence number
  • When the groups are too many to show with venn plot, we can use petal plot.

    1
    2
    3
    4
    # use "Type" column in sample_table
    dataset1 <- meco_dataset$merge_samples(use_group = "Site")
    t1 <- trans_venn$new(dataset1)
    t1$plot_venn(petal_plot = TRUE)
  • Sometimes, we are interested in the unique and shared species. In another words, the composition of the unique or shared species may account for the different and similar parts of ecological characteristics across groups[10]. For this goal, we first transform the results of venn plot to the traditional species-sample table, that is, another object of microtable class.

    1
    2
    3
    4
    5
    6
    dataset1 <- meco_dataset$merge_samples(use_group = "Site")
    t1 <- trans_venn$new(dataset1)
    # transform venn results to the sample-species table, here do not consider abundance, only use presence/absence information.
    t2 <- t1$trans_venn_com(use_OTUs_frequency = TRUE)
    # t2 is a new microtable class, each part is considered as a sample
    class(t2)
  • We use bar plot to show the composition at the Genus level.

    1
    2
    3
    4
    5
    # calculate taxa abundance, that is, the frequency
    t2$cal_abund()
    # transform and plot
    t3 <- trans_abund$new(dataset = t2, taxrank = "Genus", ntaxa = 12)
    t3$plot_bar(bar_type = "part", legend_text_italic = T, ylab_title = "Frequency (%)", xtext_type_hor = FALSE)
  • We also try to use pie chart to show the compositions at the Phylum level

    1
    2
    t3 <- trans_abund$new(dataset = t2, taxrank = "Phylum", ntaxa = 8)
    t3$plot_pie(facet_nrow = 3, use_colors = rev(c(RColorBrewer::brewer.pal(8, "Dark2"), "grey50")))

9. trans_alpha class

  • Alpha diversity can be transformed and plotted using trans_alpha class. Creating trans_alpha object can return two data frame: alpha_data and alpha_stat.

    1
    2
    3
    t1 <- trans_alpha$new(dataset = meco_dataset, group = "Site")
    # return t1$alpha_stat
    t1$alpha_stat[1:5, ]
  • Then, we test the differences among groups using the KW rank sum test and anova with multiple comparisons.

    1
    2
    3
    t1$cal_diff(method = "KW")
    # return t1$res_alpha_diff
    t1$res_alpha_diff[1:5, ]
    Groups Measure Test method p.value Significance
    1 T16 vs T18 Observed KW 2.601895e-03 **
    2 T16 vs T20 Observed KW 3.011399e-08 ***
    3 T16 vs T21 Observed KW 2.174162e-04 ***
    4 T16 vs T17 Observed KW 1.234229e-03 **
    5 T18 vs T20 Observed KW 7.319258e-08 ***
    1
    2
    3
    t1$cal_diff(method = "anova")
    # return t1$res_alpha_diff
    t1$res_alpha_diff
    Observed Chao1 ACE Shannon Simpson InvSimpson Fisher Coverage
    T18 a a a a a a a
    T16 b b b a a a b
    T17 c c c b a b c
    T21 d d d c a c d
    T20 e e e d b d e
  • Now, let us plot the mean and se of alpha diversity for each group, and add the duncan.test (agricolae package) result.

    1
    t1$plot_alpha(add_letter = TRUE, measure = "Chao1")
  • We can also use the boxplot to show the paired comparisons directly.

    1
    t1$plot_alpha(pair_compare = TRUE, measure = "Chao1")

10. trans_beta class

  • The distance matrix of beta diversity can be transformed and plotted using trans_beta class. The analysis referred to the beta diversity in this class mainly include ordination, group distance, clustering and manova. We first show the ordination using PCoA.

    1
    2
    3
    4
    5
    6
    # we first create an object and select PCoA for ordination
    t1 <- trans_beta$new(dataset = meco_dataset, group = "Site", measure = "bray", ordination = "PCoA")
    # t1$res_ordination is the ordination result list
    class(t1$res_ordination)
    # plot the PCoA result
    t1$plot_ordination(plot_color = "Site", plot_shape = "Site", plot_group_ellipse = TRUE)
  • Then we plot and compare the group distances.

    1
    2
    3
    4
    5
    6
    7
    8
    # calculate and plot sample distances within groups
    t1$cal_group_distance()
    # return t1$res_group_distance
    t1$plot_group_distance(distance_pair_stat = TRUE)

    # calculate and plot sample distances between groups (报错:错误: Insufficient values in manual scale. 10 needed but only 8 provided.)
    t1$cal_group_distance(within_group = FALSE)
    t1$plot_group_distance(distance_pair_stat = TRUE)
  • Clustering plot is also a frequently used method.

    1
    2
    # use replace_name to set the label name, group parameter used to set the color (报错:找不到对象'dataset')
    t1$plot_clustering(group = "Indexs", replace_name = c("Water-depth(m)", "Indexs"))
  • perMANOVA is often used in the differential test of distances among groups.

    1
    2
    3
    # manova for all groups
    t1$cal_manova(cal_manova_all = TRUE)
    t1$res_manova$aov.tab
    Permutation: free
    Number of permutations: 999
    Terms added sequentially (first to last)
    Df SumsOfSqs MeanSqs F.Model R2 Pr(>F)
    Site 4 15.669 3.9173 19.445 0.48077 0.001 ***
    Residuals 84 16.923 0.2015 0.51923
    Total 88 32.592 1.00000

    > Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

1
2
3
# manova for each paired groups
t1$cal_manova(cal_manova_paired = TRUE)
t1$res_manova
Groups measure permutations R2 p.value Significance
1 T16 vs T18 bray 999 0.2748773 0.001
2 T16 vs T20 bray 999 0.4539103 0.001
3 T16 vs T21 bray 999 0.4102009 0.001
4 T16 vs T17 bray 999 0.2243404 0.001
5 T18 vs T20 bray 999 0.3736482 0.001
6 T18 vs T21 bray 999 0.3504104 0.001
7 T18 vs T17 bray 999 0.2147055 0.001
8 T20 vs T21 bray 999 0.3575765 0.001
9 T20 vs T17 bray 999 0.4589248 0.001
10 T21 vs T17 bray 999 0.4395176 0.001
1
2
3
 # manova for specified group set: here "Group + Type"
t1$cal_manova(cal_manova_set = "Site+ Indexs")
t1$res_manova$aov.tab
Permutation: free
Number of permutations: 999
Terms added sequentially (first to last)
Df SumsOfSqs MeanSqs F.Model R2 Pr(F)
Site 4 15.669 4 0 0.48077 1
Indexs 84 16.923 0 0 0.51923 1
Residuals 0 0.000 Inf 0.00000
Total 88 32.592 1.00000

11. trans_diff class

  • Differential abundance test is a very important part in the microbial community data analysis. It can be used to find the significant taxa in determining the community differences across groups. Currently, trans_diff class have three famous approaches to perform this analysis: metastat[11], LEfSe[12] and random forest. Metastat depends on the permutations and t-test and performs well on the sparse data. It is used for the comparisons between two groups.

    1
    2
    3
    4
    5
    6
    # metastat analysis at Genus level
    t1 <- trans_diff$new(dataset = meco_dataset, method = "metastat", group = "Site", metastat_taxa_level = "Genus")
    # t1$res_metastat is the result
    # t1$res_metastat_group_matrix is the group comparisons order for plotting
    # plot the first paired groups, choose_group = 1
    t1$plot_metastat(use_number = 1:10, qvalue = 0.05, choose_group = 1)
  • LEfSe combines the non-parametric test and linear discriminant analysis. We implement this approach in this package instead of the python version.

    1
    2
    3
    4
    t1 <- trans_diff$new(dataset = meco_dataset, method = "lefse", group = "Site", alpha = 0.01, lefse_subgroup = NULL)
    # t1$res_lefse is the LEfSe result
    # t1$res_abund is the abundance information
    t1$plot_lefse_bar(LDA_score = 4)
  • We can also plot the abundance of taxa detected using LEfSe.

    1
    t1$plot_diff_abund(use_number = 1:30)
  • Then, we show the cladogram of the differential features in the taxonomic tree. There are too many taxa in this dataset. As an example, we only use the highest 200 abundant taxa in the tree and 50 differential features. We only show the full taxonomic label at Phylum level and use letters at other levels to reduce the text overlap.

    1
    2
    3
    # clade_label_level 5 represent phylum level in this analysis
    # require ggtree package
    t1$plot_lefse_cladogram(use_taxa_num = 200, use_feature_num = 50, clade_label_level = 5)
  • There may be a problem related with the taxonomic labels in the plot. When the levels used are too many, the taxonomic labels may have too much overlap. However, if we only indicate the Phylum labels, the taxa in the legend with marked letters are too many. At this time, you can select the taxa that you want to show in the plot manually like the following operation.

    1
    2
    3
    4
    5
    6
    7
    8
    # choose some taxa according to the positions in the previous picture; those taxa labels have minimum overlap
    use_labels <- c("c__Deltaproteobacteria", "c__Actinobacteria", "o__Rhizobiales", "p__Proteobacteria", "p__Bacteroidetes",
    "o__Micrococcales", "p__Acidobacteria", "p__Verrucomicrobia", "p__Firmicutes",
    "p__Chloroflexi", "c__Acidobacteria", "c__Gammaproteobacteria", "c__Betaproteobacteria", "c__KD4-96",
    "c__Bacilli", "o__Gemmatimonadales", "f__Gemmatimonadaceae", "o__Bacillales", "o__Rhodobacterales")
    # then use parameter select_show_labels to show
    t1$plot_lefse_cladogram(use_taxa_num = 200, use_feature_num = 50, select_show_labels = use_labels)
    # Now we can see that more taxa names appear in the tree
  • If you are interested in taxonomic tree, you can also use metacoder package[13] to plot the taxonomic tree based on the selected taxa. We do not show the usage here.

  • The third approach is rf, which depends on the random forest[14, 15] and the non-parametric test. The current method can calculate random forest by bootstrapping like the method in LEfSe and only use the significant features. MeanDecreaseGini is selected as the indicator value in the analysis.

    1
    2
    3
    4
    5
    6
    7
    8
    # use Genus level for parameter rf_taxa_level, if you want to use all taxa, change to "all"
    # nresam = 1 and boots = 1 represent no bootstrapping and use all samples directly
    t1 <- trans_diff$new(dataset = meco_dataset, method = "rf", group = "Site", rf_taxa_level = "Genus")
    # t1$res_rf is the result stored in the object
    # plot the result
    t2 <- t1$plot_diff_abund(use_number = 1:20, only_abund_plot = FALSE)
    gridExtra::grid.arrange(t2$p1, t2$p2, ncol=2, nrow = 1, widths = c(2,2))
    # the middle asterisk represent the significances

12. trans_env class

  • 分析环境因子对微生物群落结构和组装的影响:RDA分析(db-RDA和RDA).

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    # add_data is used to add the environmental data
    t1 <- trans_env$new(dataset = meco_dataset, add_data = env_data[, 1:7])

    # use bray-curtis distance to do dbrda
    t1$cal_rda(use_dbrda = TRUE, use_measure = "bray")

    # t1$res_rda is the result list stored in the object
    t1$trans_rda(adjust_arrow_length = TRUE, max_perc_env = 10)

    # t1$res_rda_trans is the transformed result for plotting
    t1$plot_rda(plot_color = "Site")

    # use Genus
    t1$cal_rda(use_dbrda = FALSE, taxa_level = "Genus")
    # As the main results of RDA are related with the projection and angles between different arrows,
    # we adjust the length of the arrow to show them clearly using several parameters.
    t1$trans_rda(show_taxa = 10, adjust_arrow_length = TRUE, max_perc_env = 1500, max_perc_tax = 3000, min_perc_env = 200, min_perc_tax = 300)
    # t1$res_rda_trans is the transformed result for plotting
    t1$plot_rda(plot_color = "Site")
  • Mantel test 用于检测环境因子和距离矩阵之间是否具有显著的相关性。

    1
    2
    3
    t1$cal_mantel(use_measure = "bray")
    # return t1$res_mantel
    t1$res_mantel
    variable_name cor_method corr_res p_res significance
    1 TN pearson 0.5571885 0.001 ***
    2 TC pearson 0.5712239 0.001 ***
    3 TS pearson 0.2665453 0.001 ***
    4 TOC pearson 0.3540337 0.001 ***
    5 Salinity pearson 0.2782537 0.001 ***
    6 Temperature pearson 0.5856050 0.001 ***
    7 Dissolved.oxygen pearson 0.4358422 0.001 ***
    8 Surface.chlorophyll.concentrations pearson 0.2586823 0.001 ***
    9 pH pearson 0.4498964 0.001 ***
    10 PAR pearson 0.1712861 0.001 ***
    11 Density pearson 0.5682679 0.001 ***
    12 Turbidity pearson 0.2260436 0.001 ***
  • 环境变量与分类群(taxa)的相关性对分析和推断影响群落结构的因素具有重要意义。在本例中,我们首先进行了差异丰度检验(differential abundance test)和随机森林分析(random forest),得到了重要的属(Genus)。然后利用这些分类单元进行相关性分析。

    1
    2
    3
    4
    5
    6
    7
    # first create trans_diff object
    t2 <- trans_diff$new(dataset = meco_dataset, method = "rf", group = "Site", rf_taxa_level = "Genus")
    # then create trans_env object
    t1 <- trans_env$new(dataset = meco_dataset, add_data = env_data[, 1:7])
    # calculate correlation
    t1$cal_cor(use_data = "other", p_adjust_method = "fdr", other_taxa = t2$res_rf$Taxa[1:60])
    # return t1$res_cor
  • 使用ggplot2或pheatmap进行可视化

    1
    2
    3
    4
    5
    # default ggplot2 method
    t1$plot_corr()

    # clustering heatmap; require pheatmap package
    t1$plot_corr(pheatmap = TRUE)

7个环境变量与分类群ggplot2

7个环境变量与分类群相关性_pheatmap

  • 各组内的环境变量与分类群taxa之间的关系

    1
    2
    3
    4
    # calculate correlations for different groups using parameter by_group
    t1$cal_cor(by_group = "Site", use_data = "other", p_adjust_method = "fdr", other_taxa = t2$res_rf$Taxa[1:60])
    # return t1$res_cor
    t1$plot_corr()

correlations_between_environmental_variables_and_60_taxa

  • 环境因子与alpha-多样性之间的关系

    1
    2
    3
    4
    t1 <- trans_env$new(dataset = meco_dataset, add_data = env_data[, 1:7])
    # use add_abund_table parameter to add the extra data table
    t1$cal_cor(add_abund_table = meco_dataset$alpha_diversity)
    t1$plot_corr()

relationship_between_7_environmental_factors_and_alpha_diversity

13. trans_nullmodel class

  • 近几十年来,系统发育分析和空模型(null model)的结合,通过增加系统发育维度(phylogeny dimension),更加有力地促进了生态位和中性(niche and neutral)对群落组装影响的推断[16,17]。trans_nullmodel class提供了一个封装,包括对系统发育信号、beta平均成对系统发育距离(beta mean pairwise phylogenetic distance,betaMPD)、beta平均最近分类单元距离(beta mean nearest taxon distance,betaMNTD)、beta最近分类单元指数(beta nearest taxon index,betaNTI)、beta净相关指数(beta net relatedness index,betaNRI)和基于Bray-Curtis的Raup-Crick(Bray-Curtis-based Raup-Crick,RCbray)的计算。系统发育信号分析的方法基于mantel相关图(mantel correlogram)[18],与其他方法相比,系统发育信号的变化是直观而清晰的。betaMNTD和betaMPD的算法已经过优化,比picante包中的算法更快[3]。RCbray和betaNTI(或betaNRI)之间的组合可用于推断在特定假设下支配群落装配(community assembly)的每个生态过程(ecological process)的强度[17]。这可以通过函数cal_process()来解析每个推断进程(ecological process)的百分比来实现。我们首先检查系统发育信号:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    # generate trans_nullmodel object; use 10000 OTUs as example
    t1 <- trans_nullmodel$new(meco_dataset, taxa_number = 10000, add_data = env_data)

    # use TOC as the test variable (__报错:Error in cor(as.vector(xdis), ydis, method = method, use = use) :
    cov/cor中有遗漏值__)
    t1$cal_mantel_corr(use_env = "TOC")

    # return t1$res_mantel_corr
    # plot the mantel correlogram(__报错:Error in names(x) <- value : 'names'属性的长度[4]必需和矢量的长度[0]一样__)
    t1$plot_mantel_corr()
  • betaNRI(ses.betampd)用于显示“basal”系统发育转换(phylogenetic turnover)[18]。与betaNTI相比,它能捕获更多与深层系统发育(deep phylogeny)相关的转换信息(turnover information)。值得注意的是,经过几十年的发展,出现了许多空模型(null models)。在trans-nullmodel class中,我们随机化了物种的系统发育相关性。这种洗牌方法(shuffling approach)固定了观察到的物种α-多样性和β-多样性的水平,以探讨观察到的系统发育转换是否与空模型(物种间的系统发育关系是随机的)显著不同。

    1
    2
    3
    # 运行500次null model
    t1$cal_ses_betampd(runs=500, abundance.weighted = TRUE)
    # 返回t1$res_ses_betampd
  • 可以使用trans_beta class中的plot_group_distance function绘制betaNRI图。结果表明T20和T21的平均betaNRI 显著高于其它三者,表明T20和T21中的basal phylogenetic turnover是高的。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    # 将betaNRI矩阵加入到beta_diversity列表中
    meco_dataset$beta_diversity[["betaNRI"]] <- t1$res_ses_betampd

    # 使用measure "betaNRI"创建trans_beta class
    t2 <- trans_beta$new(dataset = meco_dataset, group = "Site", measure = "betaNRI")

    # transform the distance for each group
    t2$cal_group_distance()

    # 结果可视化
    library(ggplot2)
    g1 <- t2$plot_group_distance(distance_pair_stat = TRUE)
    g1 + geom_hline(yintercept = -2, linetype = 2) + geom_hline(yintercept = 2, linetype = 2)

betaNRI all

  • 若要单独的对每个组进行 null model analysis,例如每个组作为一个物种池(species pool),我们可以分别为每个组计算结果。 我们发现,当分别对每个组进行betaNRI 分析时,CW和TW间的mean betaNRI没有显著差异,且二者均显著高于IW ,揭示了在将每个区域视为特定物种库的条件下,CW和TW中变量选择的强度(strength of variable selection)可能相似。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    # 创建一个列表用于存放trans_nullmodel的结果
    sesbeta_each <- list()
    group_col <- "Site"
    all_groups <- unique(meco_dataset$sample_table[, group_col])

    # 对每个组分别进行计算
    for(i in all_groups){
    # like the above operation, but need provide 'group' and 'select_group'
    test <- trans_nullmodel$new(meco_dataset, group = group_col, select_group = i, taxa_number = 1000, add_data = env_data)
    test$cal_ses_betampd(runs = 500, abundance.weighted = TRUE)
    sesbeta_each[[i]] <- test$res_ses_betampd
    }

    # 合并结果并重塑(reshape),得到一个对称矩阵(symmetrical matrix)
    library(reshape2)
    test <- lapply(sesbeta_each, melt) %>% do.call(rbind, .) %>% reshape2::dcast(., Var1~Var2, value.var = "value") %>% `row.names<-`(.[,1]) %>% .[, -1, drop = FALSE]

    # 如同上述操作
    meco_dataset$beta_diversity[["betaNRI"]] <- test
    t2 <- trans_beta$new(dataset = meco_dataset, group = "Site", measure = "betaNRI")
    t2$cal_group_distance()
    g1 <- t2$plot_group_distance(distance_pair_stat = TRUE)
    g1 + geom_hline(yintercept = -2, linetype = 2) + geom_hline(yintercept = 2, linetype = 2)

betaNRI individual

  • BetaNTI(ses.betamntd) 可用于指示系统发育的末端转换( phylogenetic terminal turnover) [17]

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    # 运行500次null model
    t1$cal_ses_betamntd(runs=500, abundance.weighted = TRUE)
    # 返回t1$res_ses_betamntd

    # 将betaNTI矩阵加入到beta_diversity列表中
    meco_dataset$beta_diversity[["betaNTI"]] <- t1$res_ses_betamntd

    # 使用measure "betaNRI"创建trans_beta class
    t2 <- trans_beta$new(dataset = meco_dataset, group = "Site", measure = "betaNTI")

    # transform the distance for each group
    t2$cal_group_distance()

    # 结果可视化
    library(ggplot2)
    g1 <- t2$plot_group_distance(distance_pair_stat = TRUE)
    g1 + geom_hline(yintercept = -2, linetype = 2) + geom_hline(yintercept = 2, linetype = 2)
  • cal_rcbray()功能用于计算RCbray (Bray-Curtis-based Raup-Crick) ,以评估成分转换(compositional turnover)是否主要受漂移控制[19]。我们应用空模型(null model)通过从每个物种池中随机采样个体来模拟物种分布,同时保留物种发生频率(species occurrence frequency)和样本物种丰富度(sample species richness)[18]。

    1
    2
    3
    # result stored in t1$res_rcbray
    t1$cal_rcbray(runs = 1000)
    # return t1$res_rcbray
  • 作为一个例子,我们还计算了引用文献[17,18]中所示的在群落组装(community assembly)上推断过程( inferred processes )所占的比例。在此示例中,具有显着betaNTI值(|βNTI|> 2)的成对比较部分是估计的选择(Selection)造成影响; βNTI> 2代表异构选择(heterogeneous ); βNTI<-2表示同质选择(homogeneous )。 RCbray值表征了随机分配(randomization)下观察到的Bray-Curtis和Bray-Curtis期望值之间的偏差大小(magnitude of deviation)。 | RCbray | > 0.95被认为是显着的。 |βNTI| < 2和RCbray > +0.95被视为受散播限制(Dispersal Limitation)与漂移(Drift)相结合的影响。 |βNTI| < 2和RCbray < -0.95被视为均质分散(Homogenizing Dispersal)影响的估计值。 |βNTI| < 2和|RCbray| < 0.95估算了漂移单独作用的影响。

    1
    2
    3
    4
    5
    # use betaNTI and rcbray to evaluate processes
    t1$cal_process(use_betamntd = TRUE)

    # return t1$res_process
    t1$res_process
    process percentage
    1 variable selection 0.4341164
    2 homogeneous selection 63.6874362
    3 dispersal limitation 0.0000000
    4 homogeneous dispersal 14.8365679
    5 drift 21.0418795

14. trans_network class

correlation-based network

  • 准备R包并进行计算相关性
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    install.packages("WGCNA")
    library(WGCNA)
    # 以下3选1
    # 1. Use R base cor.test, slow
    t1 <- trans_network$new(dataset = meco_dataset, cal_cor = "base", taxa_level = "OTU", filter_thres = 0.0001, cor_method = "spearman")
    # return t1$res_cor_p list; one table: correlation; another: p value

    # 2. SparCC method, require SpiecEasi package
    # SparCC is very slow, so consider filtering more species with low abundance
    t1 <- trans_network$new(dataset = meco_dataset, cal_cor = "SparCC", taxa_level = "OTU", filter_thres = 0.001, SparCC_simu_num = 100)

    # 3. When the OTU number is large, use R WGCNA package to replace R base to calculate correlations
    # require WGCNA package
    t1 <- trans_network$new(dataset = dataset, cal_cor = "WGCNA", taxa_level = "OTU", filter_thres = 0.0001, cor_method = "spearman")

构建网络

  • COR_optimization = TRUE represent using RMT theory to find the optimized correlation threshold instead of the COR_cut.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    # construct network; require igraph package
    t1$cal_network(p_thres = 0.01, COR_optimization = TRUE)
    # return t1$res_network

    # (可选) use arbitrary coefficient threshold to contruct network
    t1$cal_network(p_thres = 0.01, COR_cut = 0.73)

    # save network
    # open the gexf file using Gephi(https://gephi.org/)
    # require rgexf package
    install.packages("rgexf")
    t1$save_network(filepath = "network.gexf")
  • Now, we show the node colors with the Phylum information and the edges colors with the positive and negative correlations. All the data used has been stored in the network.gexf file, including modules classifications, Phylum information and edges classifications.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    # calculate network attributes
    t1$cal_network_attr()
    # return t1$res_network_attr

    # classify the node; return t1$res_node_type
    t1$cal_node_type()
    # return t1$res_node_type
    # we retain the file for the following example in trans_func part
    network_node_type <- t1$res_node_type

    # plot node roles in terms of the within-module connectivity and among-module connectivity
    t1$plot_taxa_roles(use_type = 1)

    # plot node roles with phylum information
    t1$plot_taxa_roles(use_type = 2)
  • Now, we show the eigengene analysis of modules. The eigengene of a module, i.e. the first principal component of PCA, represents the main variance of the abundance in the species of the module.

    1
    2
    t1$cal_eigen()
    # return t1$res_eigen
  • 出错了!Then we perform correlation heatmap to show the relationships between eigengenes and environmental factors.

    1
    2
    3
    4
    5
    6
    # create trans_env object like the above operation
    t2 <- trans_env$new(dataset = dataset, add_data = env_data_16S[, 4:11])
    # calculate correlations
    t2$cal_cor(add_abund_table = t1$res_eigen)
    # plot the correlation heatmap
    t2$plot_corr()
  • The function cal_sum_links() is used to sum the links (edge) number from one taxa to another or in the same taxa. The function plot_sum_links() is used to show the result from the function cal_sum_links(). This is very useful to fast see how many nodes are connected between different taxa or within one taxa. In terms of “Phylum” level in the tutorial, the function cal_sum_links() sum the linkages number from one Phylum to another Phylum or the linkages in the same Phylum. So the numbers along the outside of the circular plot represent how many edges or linkages are related with the Phylum. For example, in terms of Proteobacteria, there are roughly total 900 edges associated with the OTUs in Proteobacteria, in which roughly 200 edges connect both OTUs in Proteobacteria and roughly 150 edges connect the OTUs from Proteobacteria with the OTUs from Chloroflexi.

    1
    2
    3
    4
    5
    6
    devtools::install_github("mattflor/chorddiag")
    # calculate the links between or within taxonomic ranks (报错:Error in ecount(network) : 没有"ecount"这个函数)
    t1$cal_sum_links(taxa_level = "Phylum")
    # return t1$res_sum_links_pos and t1$res_sum_links_neg
    # require chorddiag package
    t1$plot_sum_links(plot_pos = TRUE, plot_num = 10)

15. trans_func class

1
2
3
4
5
6
7
# Identify microbial traits
# create object of trans_func
t2 <- trans_func$new(meco_dataset)
# mapping the taxonomy to the database
# the function can recognize prokaryotes or fungi automatically
t2$cal_spe_func()
# return t2$res_spe_func, 1 represent function exists, 0 represent no or cannot confirmed.
  • The percentages of the OTUs having the same trait can reflect the functional redundancy of this function in the community or the module in the network.

    1
    2
    3
    4
    5
    6
    7
    # calculate the percentages of OTUs for each trait in each module of network
    # use_community = FALSE represent calculating module, not community, node_type_table provide the module information
    t2$cal_spe_func_perc(use_community = FALSE, node_type_table = network_node_type)
    # return t2$res_spe_func_perc
    # we only plot some important traits, so we use the default group list to filter and show the traits.
    t2$plot_spe_func_perc(select_samples = paste0("M", 1:10))
    # M represents module, ordered by the nodes number from high to low
  • Tax4Fun requires a strict input file demand on the taxonomic information. To analyze the trimmed or changed OTU data in R with Tax4Fun, we provide a link to the Tax4Fun functional prediction.

    1
    2
    3
    4
    5
    6
    7
    8
    t1 <- trans_func$new(meco_dataset)
    # install Tax4Fun package and download SILVA123 ref data from http://tax4fun.gobics.de/
    wget https://github.com/bwemheu/Tax4Fun2/releases/download/1.1.5/Tax4Fun2_1.1.5.tar.gz
    install.packages(pkgs = "Tax4Fun2_1.1.5.tar.gz", repos = NULL, source = TRUE)
    # decompress SILVA123; provide path in folderReferenceData as you put
    t1$cal_tax4fun(folderReferenceData = "./SILVA123")
    # return two files: t1$tax4fun_KO: KO file; t1$tax4fun_path: pathway file.
    # t1$tax4fun_KO$Tax4FunProfile[1:5, 1:2]

16. 知识点

确定性过程和随机过程

  • The term “deterministic process” refers to two types of selective forces, namely, those that lead to either more (i.e. homogeneous selection) or less (i.e. heterogeneous selection) similar structures among communities due to homogeneous and heterogeneous environmental pressures, respectively (Zhou & Ning, 2017). 确定性过程包括两种选择力,即分别导致更加相似(即同质选择)或更少相似(即异质选择)的群落间的结构的同质和异质环境压力。

  • The term “stochastic process” refers to homogenizing dispersal, dispersal limitation (combined with drift) and pure drift, which can obscure the turnover among microbial communities due to high dispersal; low dispersal; and random changes in birth, death and reproduction, respectively (Zhou & Ning, 2017). 随机过程是指均匀分散、分散限制(结合漂变)和纯漂变,它们可以通过高度分散、低分散和出生、死亡和繁殖的随机变化来掩盖/减弱微生物群落之间的更替。

Community assembly processes by Stegen *et al.* https://www.nature.com/articles/ismej201393

  • An NTI of >+2 indicates that the ASVs in a local community are more closely related than expected by chance, suggesting the role of selective pressures (e.g. environmental conditions) in phylogenetic clustering. An NTI of <−2 represents phylogenetic overdispersion, indicating two possible biotic interactions: competition and facilitation. In contrast, a mean NTI across multiple communities that is significantly greater or less than zero indicates phylogenetic clustering or overdispersion, respectively (Zhou & Ning, 2017). NTI >2表明,在一个地方群落中,ASVs的亲缘关系比预期的更为密切,表明选择压力(如环境条件)在系统发育聚类中的作用。NTI <-2表示系统发育过度分散,表明两种可能的生物相互作用:竞争和促进。相反,多个群落间的平均NTI显著大于或小于零,分别表明系统发育聚类或过度分散(Zhou和Ning,2017)。
  • βNTI>+2 or <−2 signified heterogeneous selection or homogeneous selection, respectively. βNTI>+2 or <−2 分别指示异质选择和同质选择。

17. 参考

上一篇:
扩增子分析--计算随机过程和决定性过程比例
下一篇:
宏基因组分析流程及代码