|
|
||||||||
1 MATFORSK Norwegian Food Research Institute, Osloveien 1, 1430 Ås, Norway
2 Hedmark University College, 2306 Hamar, Norway
Correspondence
Knut Rudi
knut.rudi{at}matforsk.no
| ABSTRACT |
|---|
|
|
|---|
Details of the computer program used for AIBIMM and a step-by-step explanation of the method of PCA modelling are available as supplementary material in IJSEM Online.
| INTRODUCTION |
|---|
|
|
|---|
We present a novel two-step phylogenetic approach based on alignment-independent bilinear multivariate modelling (AIBIMM) for global analysis of 16S rRNA gene phylogeny. The first step in AIBIMM is to transform DNA sequence data into DNA n-mer frequencies. The n-mer frequency data are obtained by sliding a window of size n (i.e. n nucleotides) and identifying the base combination present for each step. The number of contributions for each of the 4n combinations is then counted. The second step is based on using these data as the input in principal component analysis (PCA) (Mardia et al., 1979
). PCA is a multivariate statistical method that visualizes the main structures in data with redundancies, and it is particularly well suited to building models that can be used for classification. We used a classification based on individual PCA decompositions of subsets of the data (Wold, 1976
). An outline of the AIBIMM approach, as exemplified by a multimer window of n=5, is shown in Fig. 1
.
|
The aim of our work was to evaluate the application of AIBIMM by comparison with traditional phylogenetic methods and by global analyses of 16S rRNA gene sequences for taxa covering the bacterial and archaeal domains (Ludwig & Schleifer, 1999
; Acinas et al., 2004
; Stackebrandt & Goebel, 1994
; Suau et al., 1999
; Weisburg et al., 1991
). Our conclusion from the AIBIMM evaluation is that we obtained a good, statistically based separation of the taxa analysed and that AIBIMM has an advantage over traditional phylogenetic methods (distance, maximum-parsimony and maximum-likelihood) for large datasets and for sequences that are difficult to align. AIBIMM is thus promising for future global 16S rRNA gene analyses (Curtis & Sloan, 2004
).
| METHODS |
|---|
|
|
|---|
The sequences were transformed into multimer frequencies (n=1 to 6) by the computer program PhyloMode (described in Appendix 1 available as supplementary material in IJSEM Online; http://www.matforsk.no/web/sampro.nsf/downloadE/Microbial_community), developed in-house. The sizes of the multimer windows were chosen as a trade-off between detecting phylogenetic signals (homologous multimer equalities), avoiding base composition biases due to non-homologous multimer equalities and the requirements for computer operation time.
A given pair of multimers can either be equal due to a common ancestor (the same evolutionary origin) or equal due to mutational events (equal multimers with different evolutionary origin). The probability (Pequal) for homologous multimers equalities (equal multimers with the same evolutionary origin) was calculated by the formula Pequal = (1 x)n, where x is the frequency of nucleotide substitutions and n is the multimer window size. Basically, this formula describes the relationship between sequence and multimer divergence. The probability of non-homologous multimer equalities (equal multimers with different evolutionary origin) for the variable positions (Pnonhom) was calculated by Pnonhom = 1 [(4n 1)/4n]xl, where l is the length of the sequences compared, x is the frequency of nucleotide substitutions and n is the multimer window size. This formula describes the probability that a pair of different multimers becomes equal due to mutational events. Given a multimer of size n, there are (4n 1) alternative multimers to the given multimer. The probability that a random mutation would not generate a multimer equal to the given multimer is thus (4n 1)/4n. Given a series of random mutations r, this probability is [(4n 1)/4n]r. In our case, the number of random mutations is given by multiplying the length of the DNA sequence l by the frequency of nucleotide substitutions x. Finally, the probability that any of the random mutations r would lead to a multimer equal to the given multimer is one minus the probability that they would not.
Bilinear multivariate modelling.
Bilinear multivariate modelling methods (see e.g. Martens & Næs, 1989
) are designed for situations with many variables and possibly strong collinearities (correlations) among them, which is the case for the nucleotide multimer data. Redundant information in the original variables (multimers) is used to identify new latent variables (principal components) that reflect the most important underlying (latent) structure of the data. The principal components are projections of the original variables onto the space of most important variability. In our data, the variability is due to differences among homologous or non-homologous multimer equalities.
We used PCA with the NIPALS algorithm (Fisher & MacKenzie, 1923
) for bilinear multivariate modelling (Mardia et al., 1979
). We used implementations from The Unscrambler software (CAMO Inc.) and the program PhyloMode developed in-house (see Appendix 1, available as supplementary material in IJSEM Online).
PCA models were tested using jackknife cross-validation (Unscrambler implementation) in order to determine the stability of the models and the dimensionality of the data (number of principal components needed to describe the redundant variance). This procedure is based on successively deleting one sample or a certain percentage of the observations from the data. The rest of the data are used to build the model. The model is then tested on the observations kept out of the computations, and the predicted residual variance is computed. This procedure is repeated until all samples have been deleted once. Finally, the total residual variance is determined by averaging the individual contributions from each segment. The square root of the residual predictive variance is the root mean square error of prediction (RMSEP). The maximum number of principal components needed to describe the redundant variance is determined from the RMSEP. The criterion is that the RMSEP should decrease for the principal components included. We used 281 segments with 10 samples per segment for the global analyses of all 2818 taxa, while full cross-validation was used for the rest of the analyses.
Principal component scores were plotted for visual inspection of the relationships between samples in the datasets. The so-called loadings, representing the contributions of the different original variables (multimers) on the final scores, were also plotted to determine which multimers are important for the principal components. A detailed description of the steps in the PCA analysis is given in Appendix 2, available as supplementary material in IJSEM Online.
Classification by distance to PCA models.
In order to validate the group structures revealed by the PCA, a separate centred PCA model was built for each group identified to describe the main group structures. We used all taxa within the respective groups in constructing the PCA model. The reason for this was to obtain the best possible description of the group. Since the data have already been filtered for low-quality sequences, there will be no danger of overfitting the model to errors rather than true phylogenetic signals. The optimal number of principal components was selected separately for each model (according to the cross-validation results). The sample-to-model distances for the taxa to be classified (the so-called Si value) were computed. For each model, Si cut-off values were set so that they included all the taxa for which the model was built. The accuracy of the classification was then evaluated based on how many additional taxa had Si values lower than the cut-off value. Note that this strategy has strong similarities to the SIMCA method (Wold, 1976
) for classification.
AIBIMM tree construction.
Trees were constructed to visualize the hierarchical structure in the AIBIMM data. The PCA coordinates were used as input for the cluster analyses. We used Euclidean distances since these represent the true distance between objects in orthogonal space.
We used centroid, single, complete and median linkage in the tree (dendrogram) construction. Statistical evaluation of tree stability was done using each of the cross-validation PCA coordinate data as the input for construction of separate trees. The branch frequencies (i.e. the number of times a given branch occurs among the jackknife trees) were subsequently used as stability indicators.
Traditional phylogenetic reconstruction.
DNA sequences were aligned by the alignment software CLUSTAL W (Thompson et al., 1997
). The alignments were then corrected manually using the program BioEdit (Hall, 1999
). Prealigned sequences (Maidak et al., 1997
) were used in addition to our own alignments.
Phylogenetic distance trees were constructed using TamuraNei distances and the minimum-evolution algorithm provided in the MEGA 3 software package (Kumar et al., 2001
). The interior branch test with 100 replicates was used for significance testing. Maximum-parsimony trees were also constructed using MEGA 3 with close-neighbour interchange (CNI) and bootstrap statistical testing (100 replicates). Maximum-likelihood analyses were run on the server http://bioweb.pasteur.fr using quartet puzzling and maximum-likelihood analyses for tree construction (Strimmer & von Haeseler, 1996
).
| RESULTS |
|---|
|
|
|---|
We used prealigned sequences from the RDP-II database in our traditional phylogenetic reconstruction (Maidak et al., 1997
). Our own alignment with CLUSTAL W (Thompson et al., 1997
) was clearly not correct (results not shown). The difficulties in aligning the sequences are probably due to a high degree of divergence in the variable regions of 16S rRNA for the taxa analysed.
The AIBIMM tree (Fig. 2a
) showed tight clustering, within a distinct subgroup of the Actinobacteria, for taxa belonging to Rubrobacter (Rubrobacteridae) and Acidimicrobium (Acidimicrobidae). The distance (Fig. 2b
), maximum-parsimony (Fig. 2c
) and maximum-likelihood (Fig. 2d
) methods, on the other hand, showed an unresolved deep branching of the Actinobacteria. The Actinobacteria appeared more closely related to the Archaea than to the Proteobacteria in these trees. In particular, the placements of Rubrobacter and Acidimicrobium were inconsistent.
|
|
Global AIBIMM model
The global data are presented as PCA plots because of the difficulties of representing large sets of data as trees. We determined a global AIBIMM model based on nucleotide pentamers for all 2818 taxa included in this work (Fig. 4a
). The first three principal components explained 23 % of the variance in the data. The variance explained by cross-validation was almost identical to the modelled variance, strongly supporting the validity of the AIBIMM model. The model separated the Archaea and Bacteria as two major groups. The Bacteria were further separated into Firmicutes, Actinobacteria, Alpha-/Delta-/Epsilonproteobacteria, Beta-/Gammaproteobacteria and a miscellaneous unseparated group, as indicated by the ellipsoids.
|
There was low or no structure (different clusters) of the loading for the pentameter variables in the global AIBIMM model (Fig. 4b
). This is in stark contrast to the high structure found in the score plot (Fig. 4a
). There are therefore no particular structures for the pentamers that are more important than others for explaining the variance in the score plot. This suggests that the structures are mainly explained by homologous multimer equalities, since non-homologous multimer equalities would result in structures in the loading plot such as, for example, gradients of G+C content.
AIBIMM models for the major subgroups
The support for the six subgroups presented in Fig. 3
was also evaluated by Si distance classification (Fig. 5
). The classification based on Si showed that all the models, except the model for the miscellaneous group (Fig. 5b
), gave a good separation of the target and non-target taxa.
|
The AIBIMM models used for classification based on Si distances also showed structuring of the taxa within each subgroup (Fig. 6
). The variance explained by cross-validation, compared with the modelled variance, was significant to at least the eighth principal component for all the subgroups, except in the miscellaneous group (this group was significant until the fifth principal component). This further supports the conclusion that the data are highly structured. The Euryarchaeota/Crenarchaeota were separated into three tightly clustered groups with x, y centroid coordinates of (14, 12), (17, 14) and (20, 0), respectively (Fig. 6a
). The first group contained methanogenic taxa, the second halophilic taxa and the third thermophilic taxa. There was no clear separation of Euryarchaeota and Crenarchaeota for the thermophilic taxa. These groups seemed continuous with the highest score for the first principal component for the Crenarchaeota. The loadings, however, showed that there has been a selection within Crenarchaeota for structures contained in the pentamer 5'-GGGGG-3'. The Aeropyrum pernix strains are the most extreme, with a mean frequency of 20 for these G-tract pentamers, while the mean value for the entire dataset of bacteria and archaea is 3.
|
There are relatively few tight clusters of taxa within the Firmicutes (Fig. 6c
). The cross-validation results, however, were stable, with low errors, showing that redundant data gave the same pattern. It is therefore likely that this pattern is due to the evolution of the Firmicutes and not instability or uncertainty of the model. The Staphylococcaceae, however, formed a tight cluster with centroid coordinates (15, 2.5). The Mycoplasmataceae also formed a relatively distinct cluster of taxa (centroid coordinates 2.3, 14). There was also an apparent structuring of aerotolerant taxa (negative score for first principal component) and anaerobic taxa (positive score for first principal component).
The Alpha-/Delta-/Epsilonproteobacteria are clearly divided into three separate clusters (Fig. 6d
). The Alphaproteobacteria consists of the largest group, with 227 taxa, with centroid coordinates of (3.5, 0). The Epsilonproteobacteria gave the most negative score for both principal components (centroid coordinates 22, 12.5). The Deltaproteobacteria gave a negative score for the first principal component and a positive score for the second (centroid coordinates 9.5, 12).
The Beta-/Gammaproteobacteria were divided in several clusters (Fig. 6e
). The cluster of the Betaproteobacteria gave the highest negative score for both principal components (centroid coordinates 12, 10). The Gammaproteobacteria, on the other hand, was separated into several different clusters. This class is relatively diverse, with centroid coordinates ranging from Pseudomonas (9, 12.5) to the Enterobacteriacae (12, 6).
The structures within the Actinobacteria resemble those of the Firmicutes, with a few tight clusters (Fig. 6f
). The only relatively distinct clusters identified were corynebacteria and mycobacteria, with respective centroid coordinates of (10, 5) and (11.5, 6). There was also a separation of two major clusters along the first principal component. Both corynebacteria and mycobacteria gave negative scores, while there was a wide diversity of taxa giving positive scores.
Nested AIBIMM approach
We analysed the 227 taxa belonging to the Alphaproteobacteria (outlined in Fig. 6d
) as an example of the nested AIBIMM approach. This analysis revealed four main clusters (Fig. 7a
). The tight cluster with centroid coordinates (2, 9) consisting of 63 taxa belonging to the Rhizobiales was further analysed by the nested approach. When analysing this group alone, new cluster patterns were revealed (Fig. 7c
). A cluster of 17 taxa belonging to the genus Bartonella (centroid coordinates 11.7, 2.1) was further analysed using the nested approach. Analyses of this genus alone gave three new clusters (Fig. 7e
). Two of these clusters consisted of single species, Bartonella bacilliformis (centroid coordinates 10, 1.5) and Bartonella quintana (centroid coordinates 0, 10). The last group, with centroid coordinates (2.5, 1.5), was more diverse and consisted of several species. However, as evaluated by the cross-validation results, further separation of the taxa was not possible (results not shown). Finally, the Si distance classification revealed that the different PCA models gave excellent separation of the taxa within the nested models from those outside. There were no misclassifications in any of the models for the 2818 taxa evaluated (Fig. 7b, d, f
).
|
| DISCUSSION |
|---|
|
|
|---|
A challenge with alignments is the analysis of positions containing noise, such as alignment errors, or positions that have lost phylogenetic signal due to mutation saturation. Such positions lower the resolution of the deep branches in phylogenetic trees. Evolutionary models (Kolaczkowski & Thornton, 2004
; Felsenstein, 1981
; Hillis & Huelsenbeck, 1992
; Holder & Lewis, 2003
; Huelsenbeck et al., 2001
; Brochier & Philippe, 2002
) or separate analyses of conserved regions (Ludwig et al., 1998
; Brochier & Philippe, 2002
) have been used to increase the resolution.
The distinct clustering of the Actinobacteria detected by AIBIMM is an interesting example with respect to analyses of deep phylogenetic branches. The AIBIMM results showed that the Actinobacteria is a relatively tightly clustered group separated along the first principal component in our global model. All other bacteria were separated along the second principal component (see Fig. 3
). Thus, the Actinobacteria are distinct among the bacteria according to our AIBIMM model. In contrast, a recent state-of-the-art, alignment-based phylogenetic reconstruction placed the Actinobacteria in a divergent unresolved group between Beta-/Gammaproteobacteria and Alpha-/Delta-/Epsilonproteobacteria (Acinas et al., 2004
), which is not in accordance with standard taxonomy (Garrity et al., 2003
). The reason is probably the lack of resolution in the phylogenetic methods applied (Kolaczkowski & Thornton, 2004
).
AIBIMM and traditional phylogenetic methods have different fields of application. Given a correct alignment, relatively few taxa and a known evolutionary model, alignment-based methods represent the most direct phylogenetic reconstruction. These methods try to reconstruct the actual evolutionary steps.
The advantage of AIBIMM, however, is the possibility to analyse large global sets of data. The global analysis of the Actinobacteria, for instance, indicated that this bacterial group is distinct from other bacteria. Such global structures are difficult to reveal using alignment-based phylogenetic methods. As we see it, the basic problem with current phylogenetic approaches is that phylogenetic signals can not be separated from signals due to incorrect alignments or mutation saturation (noise) in global sets of data. These problems, however, are addressed in AIBIMM by using cross-validated PCA analyses. The importance of using PCA here lies primarily in its ability to compress information onto the most important dimensions of variability in the dataset and, in this way, to focus only on valid common structures in the taxa considered. The more subtle parts of the genetic information are filtered by the jackknife cross-validation, such as unstructured data due to non-homologous multimer equalities. As a consequence, the combination of cross-validated PCA with an optimal multimer window size selects parts of the data that contain phylogenetic structure.
Size effect of nucleotide multimer window
The optimal size of the nucleotide multimers applied to the AIBIMM transformation is dependent on the variance (length and divergence) of the sequences. Theoretical considerations of equalities due to homologous or non-homologous multimers can be used in selecting an optimal window size (see Fig. 3
).
For instance, for window size n=4, the crossing-point at approximately 15 % divergence between the probability of homologous and non-homologous multimer equalities (Fig. 3d
) indicates that non-homologous multimers can introduce structure in the n=4 data. Empirically, this can be seen as structures such as G+C content become predominant in the variance among the taxa. The effect is even more pronounced in the empirical data for the shorter multimers (n=1 to 3). Theoretically, hexamer analyses give the best separation between homologous and non-homologous multimer equalities (see Fig. 3f
). There are, however, 46 = 4096 combinatorial possibilities of hexamers, making these analyses less efficient with respect to computer operation time when analysing large sets of data.
The multimer window of n=5 is quite efficient with respect to computer operation time, and the probability of homologous multimer equalities is higher than for non-homologous equalities for sequences with less than approximately 25 % divergence. Comparison between the empirical AIBIMM data and prior knowledge about the relatedness between the taxa analysed indicates that homologous multimer equalities dominate the structure for the multimer window of n=5. These are the reasons why n=5 was chosen for the global analyses in this work.
Non-homologous multimer structures
It is not possible to remove all effects of non-homologous multimers in the AIBIMM analyses. However, if the non-homologous structure is known, such as G+C content, it can be avoided in the AIBIMM analyses. It is possible to evaluate whether the loading for a given principal component follows for example a multimer G+C gradient. If the loading follows a gradient, the given principal component can be excluded from the AIBIMM analysis in order to obtain a more accurate phylogenetic description.
Codon usage biases in protein-encoding genes are another example of non-homologous multimer structure that could represent a particular problem with respect to AIBIMM. This problem, however, could probably partly be avoided by omitting the third, wobble position in the AIBIMM analyses. The resulting dinucleotides should then be used as the unit in the multimer analyses (analysing windows of 2n). The principal component loadings can also be evaluated with respect to codon-usage gradients as for the G+C content biases described above.
A novel coordinate-based AIBIMM classification system
Our view is that it would better to classify bacteria according to an AIBIMM coordinate system rather than by traditional alignment- and/or tree-based systems (Garrity et al., 2003
). Coordinate-based classification systems are better suited for storage and retrieval from databases than are tree-based systems. AIBIMM also enables a more objective and statistically more soundly based classification, since alignments are avoided. Furthermore, the classification system should be adaptable to very large sets of data. There are, for instance, no technical barriers to implementing such classification systems that accommodate the 1 to 2 million bacterial species assumed to exist on Earth (Curtis & Sloan, 2004
).
| ACKNOWLEDGEMENTS |
|---|
| REFERENCES |
|---|
|
|
|---|
Brochier, C. & Philippe, H. (2002). Phylogeny: a non-hyperthermophilic ancestor for bacteria. Nature 417, 244.[CrossRef][Medline]
Curtis, T. P. & Sloan, W. T. (2004). Prokaryotic diversity and its limits: microbial community structure in nature and implications for microbial ecology. Curr Opin Microbiol 7, 221226.[CrossRef][Medline]
Felsenstein, J. (1981). Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol 17, 368376.[CrossRef][Medline]
Fisher, R. A. & MacKenzie, W. A. (1923). Studies in crop variation. II. The manurial response of different potato varieties. J Agric Sci 13, 311320.
Garrity, G. M., Bell, J. A. & Lilburn, T. G. (2003). Taxonomic outline of the prokaryotes, release 4. In Bergey's Manual of Systematic Bacteriology, 2nd edn. New York: Springer. doi:10.1007/bergeysoutline
Hall, T. A. (1999). BioEdit: a user-friendly biological sequence alignment editor and analysis program for Windows 95/98/NT. Nucleic Acids Symp Ser 41, 9598.
Hillis, D. M. & Huelsenbeck, J. P. (1992). Signal, noise, and reliability in molecular phylogenetic analyses. J Hered 83, 189195.
Holder, M. & Lewis, P. O. (2003). Phylogeny estimation: traditional and Bayesian approaches. Nat Rev Genet 4, 275284.[CrossRef][Medline]
Huelsenbeck, J. P., Ronquist, F., Nielsen, R. & Bollback, J. P. (2001). Bayesian inference of phylogeny and its impact on evolutionary biology. Science 294, 23102314.
Kolaczkowski, B. & Thornton, J. W. (2004). Performance of maximum parsimony and likelihood phylogenetics when evolution is heterogeneous. Nature 431, 980984.[CrossRef][Medline]
Kumar, S., Tamura, K., Jakobsen, I.-B. & Nei, M. (2001). MEGA2: molecular evolutionary genetics analysis software. Bioinformatics 17, 12441245.
Kvaal, K., Wold, J. P., Indahl, U. G., Baardseth, P. & Næs, T. (1998). Multivariate feature extraction from textural images of bread. Chemometrics Intell Lab Syst 42, 141158.
Ludwig, W. & Schleifer, K.-H. (1999). Phylogeny of bacteria beyond the 16S rRNA standard. ASM News 65, 752757.
Ludwig, W., Strunk, O., Klugbauer, S., Klugbauer, N., Weizenegger, M., Neumaier, J., Bachleitner, M. & Schleifer, K. H. (1998). Bacterial phylogeny based on comparative sequence analysis. Electrophoresis 19, 554568.[CrossRef][Medline]
Maidak, B. L., Olsen, G. J., Larsen, N., Overbeek, R., McCaughey, M. J. & Woese, C. R. (1997). The RDP (Ribosomal Database Project). Nucleic Acids Res 25, 109111.
Mardia, K. V., Kent, J. T. & Bibby, J. M. (1979). Multivariate Analysis. London: Academic Press.
Martens, H. & Næs, T. (1989). Multivariate Calibration. Chichester: Wiley.
Stackebrandt, E. & Goebel, B. (1994). Taxonomic note: a place for DNA-DNA reassociation and 16S rRNA sequence analysis in the present species definition in bacteriology. Int J Syst Bacteriol 44, 846849.
Strimmer, K. & von Haeseler, A. (1996). Quartet puzzling: a quartet maximum likelihood method for reconstructing tree topologies. Mol Biol Evol 13, 964969.
Suau, A., Bonnet, R., Sutren, M., Godon, J. J., Gibson, G. R., Collins, M. D. & Dore, J. (1999). Direct analysis of genes encoding 16S rRNA from complex communities reveals many novel molecular species within the human gut. Appl Environ Microbiol 65, 47994807.
Thompson, J. D., Gibson, T. J., Plewniak, F., Jeanmougin, F. & Higgins, D. G. (1997). The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Res 25, 48764882.
Vinga, S. & Almeida, J. (2002). Alignment-free sequence comparison a review. Bioinformatics 19, 513523.
Weisburg, W. G., Barns, S. M., Pelletier, D. A. & Lane, D. J. (1991). 16S ribosomal DNA amplification for phylogenetic study. J Bacteriol 173, 697703.
Woese, C. R. (1987). Bacterial evolution. Microbiol Rev 51, 221271.
Wold, S. (1976). Pattern recognition by means of disjoint principal component models. Pattern Recognit 8, 127139.[CrossRef]
This article has been cited by other articles:
![]() |
K. Rudi, M. Zimonja, B. Kvenshagen, J. Rugtveit, T. Midtvedt, and M. Eggesbo Alignment-Independent Comparisons of Human Gastrointestinal Tract Microbial Communities in a Multidimensional 16S rRNA Gene Evolutionary Space Appl. Envir. Microbiol., April 15, 2007; 73(8): 2727 - 2734. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| INT J SYST EVOL MICROBIOL | MICROBIOLOGY | J GEN VIROL |
| J MED MICROBIOL | ALL SGM JOURNALS | |