|
|
||||||||
1 Bioinformatics Group, American Type Culture Collection, Manassas, VA 20110, USA
2 Bergey's Manual Trust and Department of Microbiology and Molecular Genetics, Michigan State University, East Lansing, MI 48824, USA
Correspondence
George Garrity
garrity{at}msu.edu
| ABSTRACT |
|---|
|
|
|---|
Published online ahead of print on 9 May 2003 as DOI 10.1099/ijs.0.02749-0.
A PDF file of the comprehensive taxonomy can be found at http://dx.doi.org/10.1007/bergeysoutline
| INTRODUCTION |
|---|
|
|
|---|
The legacy of 15 years of SSU rRNA sequencing is tens of thousands of sequences, some nearly full length, some short, most of high quality, and some of more dubious value. This collection is an invaluable resource for the establishment of a comprehensive prokaryotic taxonomy. There is a consensus that SSU rRNA-based phylogenies are largely consistent with the evolutionary history of the organisms, since the groups formed using this approach are often confirmed by other data, i.e. by phenotypic properties. Furthermore, while attempts to build a universal tree have revealed that different molecules produce different trees, it would appear that such trees are generally, though not absolutely, consistent with the SSU rRNA tree (Brown & Koretke, 2000
). Whole-proteome comparisons have also produced trees similar to the SSU rRNA tree (Tekaia et al., 1999
). Therefore, we expect that a taxonomy based on the wealth of available SSU rRNA sequences should have predictive as well as organizational value.
Until recently, very few attempts have been made to produce a comprehensive taxonomy of the prokaryotes that (i) takes advantage of the large numbers of SSU rRNA sequences available, (ii) is reconcilable with our knowledge of other genotypic and phenotypic information, and (iii) provides a link between the phylogenetic models and the nomenclatural record. We recently published our first attempts in this direction (Garrity & Lilburn, 2002
), which drew heavily on techniques from exploratory data analysis (Tukey, 1977
). A principal components analysis (PCA) of a matrix of evolutionary distances of >9000 full-length sequences to a set of 223 benchmark sequences revealed the structure of the higher-level relationships amongst the prokaryotes. Despite the fact that the approach reliably reduced the dimensionality of the data from 223 dimensions to 3 and allowed the generation of meaningful two-dimensional plots, the underlying cause for the placement of a given species was not immediately apparent, as PCA involves a transformation of the original data. Moreover, some distortion of visual perception was likely to occur.
In this paper, we extend our exploratory data analysis approach to include techniques that have also found useful application in the field of microarray analysis (Eisen et al., 1998
), i.e. supervised clustering and visualization via heat maps. Our results demonstrate that these graphs (which are a recent adaptation of shaded distance matrices; see, for example, Sneath & Sokal, 1973
) provide an informative overview of the current taxonomy and make errors in classification obvious. The heat maps will provide a simple way of placing sequences from novel organisms in the taxonomy, thus allowing simultaneous identification and classification of organisms. We also introduce a comprehensive prokaryotic taxonomy.
| METHODS |
|---|
|
|
|---|
Sequence data.
Only relatively long prokaryotic sequences were used in the analyses in order to maximize the information content and to ensure that the sequences contained as many homologous positions as possible. The 9206 sequences used were more than 1399 bases long and had less than 4 % ambiguity. If sequences contained no data (N's) in more than 10 consecutive alignment positions, they were eliminated from the dataset.
The data were grouped and 223 benchmark sequences incorporated as discussed previously (Garrity & Lilburn, 2002
). In the benchmark set of sequences, each sequence represented, where possible, a type species and type genus on which the families are based (Garrity et al., 2002
). All 25 phyla in the Bergey's Taxonomic Outline are represented.
Estimation of evolutionary distances.
Prior to estimation of evolutionary distance, subsets of sequences were created, ranging from 750 to 900 sequences in total. Each subset contained the benchmark sequences as the first 223 sequences. Matrices of evolutionary distances were calculated in PAUP* (version 4.08) (Swofford, 2000
) using the JukesCantor model (Jukes & Cantor, 1969
). Following computation, each matrix was exported as a tab-delimited file, using a short identifier to tag each sequence.
Data structures.
Matrices of evolutionary distances were imported into the statistical package S-Plus 6.1 (Insightful), edited and joined in a single data frame and finally linked to a data frame containing taxonomic and physiological information, as described previously (Garrity & Lilburn, 2002
). By invoking functions that are part of S-Plus, we were able to arrange the sequence order on the axes of the matrix according to the current version of the taxonomy, based on the hierarchy of names, which were treated as ordered factor variables. Sequences without names were moved to the ends of the lists. The matrix was then colour-coded to allow data patterns to be seen. This allows the identification of any potential misplacements that arise because of incorrect annotation of sequences or because of a failure to identify synonyms.
In the next cycle, the misidentified sequences were extracted and arranged in a new matrix according to their similarities to the benchmark sequences. The unnamed benchmark sequences were also reordered.
Similar routines were carried out on two subsets of the data: sequences from the class Betaproteobacteria and from the family Comamonadaceae.
| RESULTS |
|---|
|
|
|---|
|
|
|
|
| DISCUSSION |
|---|
|
|
|---|
The application of exploratory data analysis techniques to the problem of a comprehensive prokaryotic taxonomy has proved to be quite fruitful. The two techniques adopted, PCA and shaded distance matrices, have their own advantages for constructing the taxonomy. PCA provides a three-dimensional map of the sequence space or evolutionary space defined by the dataset (Garrity & Lilburn, 2002
). This structure was consistent with extant large-scale phylogenetic trees, and the ability to visualize the position of individual sequences within this three-dimensional higher-level structure proved invaluable. The map-like qualities of the PCA plots enabled us to spot misclassified organisms, poorly curated sequences and other problems or anomalies. The disadvantages of the three-dimensional view include occlusion of points (one can't see the entire dataset at once) and a lack of resolution between taxonomic groups below phylum level unless further processing is done. Nevertheless, a three-dimensional point-cloud view of sequence space is quite provocative, encouraging questions about the evolutionary forces that drove the sequences to adopt the positions we see.
Heat maps can present the data as they are clustered in both extant and proposed taxonomies. Since the heat maps give a two-dimensional view, there is, in principle, no occlusion of data, and one can review the positions of all the sequences in the taxonomic hierarchy. Thus, in our matrix showing all of the sequence distances (Fig. 1
), we can first see that our initial taxonomy is generally consistent. Problems are visible, such as the misplaced sequence in the Clones' benchmarks, but solutions immediately offer themselves. We can see where this misplaced sequence might belong (in the Archaea), and the correct placement of many of the unnamed sequences is suggested by the colour and position of the matrix elements. The heat maps make assignment of new sequences to the appropriate taxon simple: the sequence is aligned, distances to the benchmark sequences are obtained, and the new sequence is placed in the complete distance matrix next to the sequence that has the most similar set of distances with respect to the benchmarks. We note that not all the information in the heat map is immediately visible because screen and printer resolutions lead to the presentation of multiple distances from the x-axis sequences to a single benchmark sequence as a single block of colour. If we zoom in on the matrix or create a subset of cells from the matrix, as in Fig. 2
, this problem disappears.
Fig. 2
shows the distribution of the 1743 unnamed sequences amongst the 25 phyla. This is an example of how heat maps can be used to generate speculative classifications none of the organisms associated with the unnamed sequences have validly published names. Fig. 2
makes the broad diversity of the benchmarks evident: dark lines representing one taxon in the Gammaproteobacteria and four taxa from the Actinobacteria stand out. These dark lines indicate a relatively low level of similarity to all the other taxa represented in the matrix. This could mean that there is a problem with these sequences, or it may imply that these sequences are from rare, extremely under-represented lineages, or from lineages yet to be described.
Fig. 3
relates to the exploration of the taxonomy of a second subset of the sequences, those representing the Betaproteobacteria. It is easy to see in this figure that eight of the 392 sequences from organisms classified as members of the Betaproteobacteria are probably misclassified. Two of the first three sequences along the x-axis are probably from the Alphaproteobacteria, whereas the last five sequences appear to be most closely related to the Gammaproteobacteria, but it is possible that they are not members of the Proteobacteria at all. The correct placement of these last five sequences can only be achieved if the set of benchmarks includes sequences that are related to the unknowns. This result illustrates the importance of a comprehensive selection of benchmark sequences.
The heat maps are more flexible than the PCA plots in that they can be used to visualize relationships down to the taxonomic level at which tree displays are useful, as when we seek resolution at or below the family level (which is the level at which the benchmarks were set). For example, in the visualization of the family Comamonadaceae shown in Fig. 4
, we can see the genera within the family quite clearly, especially genera that are represented by several sequences, like Hydrogenophaga. It is also readily apparent that the family contains two major groups and, indeed, we note that, in the latest edition of the Bergey's Manual of Systematic Bacteriology (Willems & Gillis, 2004
), the genera in the smaller group (Rubrivivax, Sphaerotilus, Ideonella and Aquabacterium), which were once included in the Comamonadaceae, have been removed from this family. At the same time, Leptothrix was moved into the Comamonadaceae, but it is apparent from this analysis that it should be grouped with the removed genera. It would also appear that Polaromonas and Brachymonas are probably not members of the Comamonadaceae, but this family contains the closest sequenced relatives to these two genera. Fig. 4
also contains five species of the genus Aquaspirillum and Alcaligenes latus, which are currently not classified as members of the Comamonadaceae. These species more closely resemble members of the Comamonadaceae than the type strains of the type species of their respective genera, and the placements shown in Fig. 4
illustrate the use of heat maps for the revision of an extant taxonomy. Note also that, at the taxonomic level presented in Fig. 4
, the heat maps can be used to enhance (or complement) a tree: the distances from a given taxon to all the other taxa in a tree are given by the colour of the matrix elements in that row.
The overviews provided by heat-map visualizations also invite observations on prokaryotic diversity. For example, by comparing the x-axes in Figs 1 and 2![]()
, we can compare the distribution of the named and unnamed sequences within the 25 phyla. The proportions of Proteobacteria and Firmicutes are the same in each distribution. In the unnamed sequence set, the proportions of Spirochaetes and Actinobacteria are two to four times lower than in the named sequence set, while the proportion of Bacteroidetes and Cyanobacteria are twofold greater. As mentioned above, rows or columns that are globally dark may be indicative of unusually diverse sequences and, hence, organisms. The dark lines seen within the Proteobacteria, Actinobacteria and Dictyoglomus in Fig. 2
are associated with sequences that are less similar to all the other sequences in the dataset than normal, and hint at the existence of novel phyla. Similar dark lines are seen within the Firmicutes in Fig. 1
, and are concentrated in the region of several families that historically have been hard to place. These include Haemobartonella and Eperythrozoon, two genera that appear well separated from all others in the PCA maps.
A description of the supervised clustering algorithm used to build and refine the heat maps is outside the scope of this journal and has been submitted for publication elsewhere. The software will be available as part of a web site that will allow researchers to explore prokaryotic taxonomy and classify their own organisms on the basis of the SSU rRNA sequences. A prototype of this site will be available before the end of 2003 (http://www.msu.edu/
garrity/taxoweb/index.html).
The taxonomy developed using, in part, the approaches discussed in this paper is available from the Bergey's Manual Trust via the World Wide Web (Garrity et al., 2002
; http://dx.doi.org/10.1007/bergeysoutline). It is revised twice a year as new data and analyses become available and includes information regarding emendations of the classification along with commentary on taxa in dispute. The outline also includes information about when an organism was first described, strain designation, culture deposit information, synonymies, SSU rRNA sequence deposit information and the Ribosomal Database Project II short identifier. The latter two items are not always available, as high-quality, full-length SSU rRNA sequences are not yet available for all of the type strains. This taxonomy is perhaps the first comprehensive taxonomy of the prokaryotes, and is presented as a work in progress. Comments and suggestions are welcomed.
| ACKNOWLEDGEMENTS |
|---|
| REFERENCES |
|---|
|
|
|---|
Cannone, J. J., Subramanian, S., Schnare, M. N. & 11 other authors (2002). The Comparative RNA Web (CRW) Site: an online database of comparative sequence and structure information for ribosomal, intron and other RNAs. BMC Bioinformatics 3, 2.[CrossRef][Medline]
Eisen, M. B., Spellman, P. T., Brown, P. O. & Botstein, D. (1998). Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A 95, 1486314868.
Fox, G. E., Stackebrandt, E., Hespell, R. B. & 16 other authors (1980). The phylogeny of prokaryotes. Science 209, 457463.
Garrity, G. M. & Lilburn, T. G. (2002). Mapping taxonomic space: an overview of the road map to the second edition of Bergey's Manual of Systematic Bacteriology. WFCC Newsl 35, 515.
Garrity, G. M., Johnson, K. L., Bell, J. & Searles, D. B. (2002). Taxonomic outline of the procaryotes. Release 3.0, July 2002. http://dx.doi.org/10.1007/bergeysoutline
Hebert, P. D. N., Cywinska, A., Ball, S. L. & deWaard, J. R. (2003). Biological identifications through DNA barcodes. Proc R Soc Lond B Biol Sci 270, 313321.[Medline]
Jukes, T. H. & Cantor, R. R. (1969). Evolution of protein molecules. In Mammalian Protein Metabolism, pp. 21132. Edited by H. N. Munro. New York: Academic Press.
Maidak, B. L., Cole, J. R., Lilburn, T. G. & 7 other authors (2001). The RDP-II (Ribosomal Database Project). Nucleic Acids Res 29, 173174.
Sneath, P. H. A. & Sokal, R. R. (1973). Numerical Taxonomy. The Principles and Practice of Numerical Classification. San Francisco: W. H. Freeman.
Swofford, D. (2000). PAUP*. Phylogenetic Analysis Using Parsimony (*and Other Methods). Sunderland, MA: Sinauer.
Tekaia, F., Lazcano, A. & Dujon, B. (1999). The genomic tree as revealed from whole proteome comparisons. Genome Res 9, 550557.
Tukey, J. W. (1977). Exploratory Data Analysis. Reading, PA: Addison-Wesley.
Willems, A. & Gillis, M. (2004). Family Comamonadaceae Willems, De Ley, Gillis and Kersters 1991, 447VP. In Bergey's Manual of Systematic Bacteriology, 2nd edn, vol. 2, The Proteobacteria, part C, The Betaproteobacteria, the Deltaproteobacteria and the Epsilonproteobacteria. Edited by D. J. Brenner, N. P. Kreig, J. T. Staley & G. M. Garrity. New York: Springer (in press).
Wuyts, J., Van de Peer, Y. & De Wachter, R. (2001). Distribution of substitution rates and location of insertion sites in the tertiary structure of ribosomal RNA. Nucleic Acids Res 29, 50175028.
This article has been cited by other articles:
![]() |
T. G. Lilburn, S. H. Harrison, J. R. Cole, and G. M. Garrity Computational aspects of systematic biology Brief Bioinform, June 1, 2006; 7(2): 186 - 195. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. M. Garrity and T. G. Lilburn Self-organizing and self-correcting classifications of biological data Bioinformatics, May 15, 2005; 21(10): 2309 - 2314. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| INT J SYST EVOL MICROBIOL | MICROBIOLOGY | J GEN VIROL |
| J MED MICROBIOL | ALL SGM JOURNALS | |