Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation

The Sorcerer II Global Ocean Sampling expedition: northwest Atlantic through eastern tropical Pacific

Douglas B Rusch et al. PLoS Biol. 2007 Mar.

Abstract

The world's oceans contain a complex mixture of micro-organisms that are for the most part, uncharacterized both genetically and biochemically. We report here a metagenomic study of the marine planktonic microbiota in which surface (mostly marine) water samples were analyzed as part of the Sorcerer II Global Ocean Sampling expedition. These samples, collected across a several-thousand km transect from the North Atlantic through the Panama Canal and ending in the South Pacific yielded an extensive dataset consisting of 7.7 million sequencing reads (6.3 billion bp). Though a few major microbial clades dominate the planktonic marine niche, the dataset contains great diversity with 85% of the assembled sequence and 57% of the unassembled data being unique at a 98% sequence identity cutoff. Using the metadata associated with each sample and sequencing library, we developed new comparative genomic and assembly methods. One comparative genomic method, termed "fragment recruitment," addressed questions of genome structure, evolution, and taxonomic or phylogenetic diversity, as well as the biochemical diversity of genes and gene families. A second method, termed "extreme assembly," made possible the assembly and reconstruction of large segments of abundant but clearly nonclonal organisms. Within all abundant populations analyzed, we found extensive intra-ribotype diversity in several forms: (1) extensive sequence variation within orthologous regions throughout a given genome; despite coverage of individual ribotypes approaching 500-fold, most individual sequencing reads are unique; (2) numerous changes in gene content some with direct adaptive implications; and (3) hypervariable genomic islands that are too variable to assemble. The intra-ribotype diversity is organized into genetically isolated populations that have overlapping but independent distributions, implying distinct environmental preference. We present novel methods for measuring the genomic similarity between metagenomic samples and show how they may be grouped into several community types. Specific functional adaptations can be identified both within individual ribotypes and across the entire community, including proteorhodopsin spectral tuning and the presence or absence of the phosphate-binding gene PstS.

PubMed Disclaimer

Conflict of interest statement

Competing interests. The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Sampling Sites
Microbial populations were sampled from locations in the order shown. Samples were collected at approximately 200 miles (320 km) intervals along the eastern North American coast through the Gulf of Mexico into the equatorial Pacific. Samples 00 and 01 identify sets of sites sampled as part of the Sargasso Sea pilot study [19]. Samples 27 through 36 were sampled off the Galapagos Islands (see inset). Sites shown in gray were not analyzed as part of this study.
Figure 2
Figure 2. Fragment Recruitment Plots
The horizontal axis of each panel corresponds to a 100-kb segment of genomic sequence from the indicated reference microbial genome. The vertical axis indicates the sequence identity of an alignment between a GOS sequence and the reference genomic sequence. The identity ranges from 100% (top) to 50% (bottom). Individual GOS sequencing reads were colored to reflect the sample from which they were isolated. Geographically nearby samples have similar colors (see Poster S1 for key). Each organism shows a distinct pattern of recruitment reflecting its origin and relationship to the environmental data collected during the course of this study. (A) P. ubique HTCC1062 recruits the greatest density of GOS sequences of any genome examined to date. The GOS sequences show geographic stratification into bands, with sequences from temperate water samples off the North American coast having the highest identity (yellow to yellow-green colors). At lower identity, sequences from all the marine environments could be aligned to HTCC1062. (B) P. marinus MIT9312 recruits a large number of GOS sequences into a single band that zigzags between 85%–95% identity on average. These sequences are largely derived from warm water samples in the Gulf of Mexico and eastern Pacific (green to greenish-blue reads). (C) P. marinus MED4 recruits largely the same set of reads as MIT9312 (B) though the sequences that form the zigzag recruit at a substantially lower identity. A small number of sequences from the Sargasso Sea samples (red) are found at high identity. (D) P. marinus NATL2A recruits far fewer sequences than any of the preceding panels. Like MED4, a small number of high-identity sequences were recruited from the Sargasso samples. (E) P. marinus MIT9313 is a deep-water low-light–adapted strain of Prochlorococcus. GOS sequences were recruited almost exclusively at low identity in vertical stacks that correspond to the locations of conserved genes. On the left side of this panel is a very distinctive pattern of recruitment that corresponds to the highly conserved 16S and 23S mRNA gene operon. (F) P. marinus CCMP1375, another deep-water low-light–adapted strain, does not recruit GOS sequences at high identity. Only stacks of sequences are seen corresponding to the location of conserved genes. (G) Synechococcus WH8102 recruits a modest number of high-identity sequences primarily from the Sargasso Sea samples. A large number of moderate identity matches from the Pacific and hypersaline lagoon (GS33) samples are also visible. (H) Synechococcus CC9605 recruits largely the same sequences as does Synechococcus WH8102, but was isolated from Pacific waters. GOS sequences from some of the Pacific samples recruit at high identity, while sequences from the Sargasso and hypersaline lagoon (bluish-purple) were recruited at moderate identities. (I) Synechococcus CC9902 is distantly related to either of the preceding Synechococcus strains. While this strain also recruits largely the same sequences as the WH8102 and CC9902 strains, they recruit at significantly lower identity. (J–O) Fragment recruitment plots to extreme assemblies seeded with phylogenetically informative sequences. Using this approach it is not only possible to assemble contigs with strong similarities to known genomes but to identify contigs from previously uncultured genomes. In each case a 100-kb segment from an extreme assembly is shown. Each plot shows a distinct pattern of recruitment that distinguishes the panels from each other. (J) Seeded from a Prochlorococcus marinus-related sequence, this contig recruits a broad swath of GOS sequences that correspond to the GOS sequences that form the zigzag on P. marinus MIT9312 recruitment plots (see [B] or Poster S1 for comparison). (K–L) Seeded from SAR11 clones, these contigs show significant synteny to the known P. ubique HTCC1062 genome. (K) is strikingly similar to previous recruitment plots to the HTCC1062 genome (see [A] or Poster S1). In contrast, (L) identifies a different strain that recruits high-identity GOS sequences primarily from the Sargasso Sea samples (red). (M–O) These three panels show recruitment plots to contigs belonging to the uncultured Actinobacter, Roseobacter, and SAR86 lineages.
Figure 3
Figure 3. Population Structure and Variation as Revealed by Phylogeny
Phylogenies were produced using neighbor-joining. There is significant within-clade variation as well as an absence of strong geographic structure to variants of SAR11 (P. ubique HTCC1062) and P. marinus MIT9312. Similar reads are not necessarily from similar locations, and reads from similar locations are not necessarily similar. (A) Geographic distribution of SAR11 proteorhodopsin variants. Keys to coloration: blue, Pacific; pink, Atlantic. (B) Geographic distribution of Prochlorococcus variants. Keys to coloration: blue, Pacific; pink, Atlantic. (C) Origins of spectral tuning of SAR11 proteorhodopsins. Reads are colored according to whether they contain the L (green) or Q (blue) variant at the spectral tuning residue described in the text. The selection of tuning residue is lineage restricted, but each variant must have arisen on two separate occasions.
Figure 4
Figure 4. Categories of Recruitment Metadata
The recruitment metadata distinguishes eight different general categories based on the relative placement of paired end sequencing reads (mated reads) when recruited to a reference sequence in comparison to their known orientation and separation on the clone from which they were derived. Assuming orientation is correct, two mated reads can be recruited closer together, further apart, or within expected distances given the size of the clone from which the sequences were derived. These sequences are categorized as “short,” “long,” or “good,” respectively. Alternately, the mated reads may be recruited in a mis-oriented fashion, which trumps issues of separation. These reads can be categorized as “normal,” “anti-normal,” or “outie.” In addition, there are two other categories. “No mate” indicates that no mated read was available for recruitment, possibly due to sequencing error. Perhaps most useful of any of the recruitment categories, “missing” mates indicate that while a mated sequence was available, it was not recruited to the reference. “Missing” mates identify breaks in synteny between the environmental data and the reference sequence.
Figure 5
Figure 5. Fragment Recruitment at Sites of Rearrangements
Environmental sequences recruited near breaks in synteny have characteristic patterns of recruitment metadata. Indeed, each of five basic rearrangements (i.e., insertion, deletion, translocation, inversion, and inverted translocation) produced a distinct pattern when examining the recruitment metadata. Here, example recruitment plots for each type of rearrangement have been artificially generated. The “good” and “no mate” categories have been suppressed. In each case, breaks in synteny are marked by the presence of stacks of “missing” mate reads. The presence or absence of other categories distinguishes each type of rearrangement from the others.
Figure 6
Figure 6. Examples of Chimeric Extreme Assemblies
(A) Fragment recruitment to an extreme assembly contig indicates the assembly is chimeric between two organisms, based on dramatic shifts in density of recruitment, level of conservation, and sample distribution. (B) Fragment recruitment to a SAR11-related extreme assembly. Changes in color, density, and vertical location toward the top of the figure indicate transitions among multiple subtypes of SAR11.
Figure 7
Figure 7. Fragment Recruitment Plots to 20-kb Segments of SAR11-Like Contigs Show That Many SAR11 Subtypes, with Distinct Distributions, Can Be Separated by Extreme Assembly
Each segment is constructed of a unique set of GOS sequencing reads (i.e., no read was used in more than one segment). Segments are arbitrarily labeled (A–X) for reference in Figure 8.
Figure 8
Figure 8. Phylogeny of GOS Reads Aligning to P. ubique HTCC1062 Upstream of 16S Gene Indicates That the Extreme Assemblies in Figure 7 Correspond to Monophyletic Subtypes
Coloring of branches indicates that the corresponding reads align at >90% identity to the extreme assembly segments shown in Figure 7; colored labels (A–X) correspond to the labels in Figure 7, indicating the segment or segments to which reads aligned.
Figure 9
Figure 9. Presence and Abundance of Dominant Ribotypes
The relative abundance of various ribotypes (rows) in each filter (columns) is represented by the area of the corresponding spot (if any). The listed ribotypes each satisfied the following criteria in at least one filter: the ribotype was among the five most abundant ribotypes detected in the shotgun data, and was represented by at least three sequencing reads. Relative abundance is based on the total number of 16S sequences in a given filter. Order and grouping of filters is based on the clustering of genomic similarity shown in Figure 11. Ribotype order was determined based on similarity of sample distribution. A marked contrast between temperate and tropical groups is visible. Estuarine samples GS11 and GS12 contained a mix of ribotypes seen in freshwater and temperate marine samples, while samples from nonmarine habitats or larger filter sizes were pronounced outliers. The presence of large amounts of Burkholderia and Shewanella in one Sargasso Sea sample (GS00a) makes this sample look much less like other Sargasso and tropical marine samples than it otherwise would. Note that 16S is not a measure of cell abundance since 16S genes can be multicopy.
Figure 10
Figure 10. Similarity between Samples in Terms of Shared Genomic Content
Genomic similarity, as described in the text, is an estimate of the amount of the genetic material in two filters that is “the same” at a given percent identity cutoff—not the amount of sequence in common in a finite dataset, but rather in the total set of organisms present on each filter. Similarities are shown for 98% identity. (A) Hierarchical clustering of samples based on pairwise similarities. (B) Pairwise similarities between samples, represented as a symmetric matrix of grayscale intensities; a darker cell in the matrix indicates greater similarity between the samples corresponding to the row and column, with row and column ordering as in (A). Groupings of similar filters appear as subtrees in (A) and as squares consisting of two or more adjacent rows and columns with darker shading. Colored bars highlight groups of samples described in the text; labels are approximate characterizations rather than being strictly true of every sample in a group.
Figure 11
Figure 11. Sample Similarity at 90% Identity
Similarity between samples in terms of shared genomic content similar to Figure 10, except that the plots were done using a 90% identity cutoff that has proven reasonable for separating some moderately diverged subtypes
Figure 12
Figure 12. Distribution of Common Proteorhodopsin Variants across GOS Samples
The leucine (L) and methionine (M) variants absorb maximally in the green spectrum (Oded Beja, personal communication) while the glutamine (Q) variant absorbs maximally in the blue spectrum. The relative abundance of each variant is shown as a percentage (x-axis) per sample (y-axis). Total abundance for all variants in read equivalents normalized by the abundance of recA protein are shown on the right side of the y-axis. The L and Q variants show a nonrandom distribution. The L variant is abundant in temperate Atlantic waters close to the U.S. and Canadian coast. The Q variant is abundant in warmer waters further from land. The M variant is moderately abundant in a wide range of samples with no obvious geographic/environmental association.
None

Comment in

Similar articles

Cited by

References

    1. Whitman WB, Coleman DC, Wiebe WJ. Prokaryotes: The unseen majority. Proc Natl Acad Sci U S A. 1998;95:6578–6583. - PMC - PubMed
    1. Beja O, Koonin EV, Aravind L, Taylor LT, Seitz H, et al. Comparative genomic analysis of archaeal genotypic variants in a single population and in two different oceanic provinces. Appl Environ Microbiol. 2002;68:335–345. - PMC - PubMed
    1. DeLong EF, Pace NR. Environmental diversity of bacteria and archaea. Systematic Biol. 2001;50:1–9. - PubMed
    1. Sogin ML, Morrison HG, Huber JA, Welch DM, Huse SM, et al. Microbial diversity in the deep sea and the underexplored “rare biosphere.”. Proc Natl Acad Sci U S A. 2006;103:12115–12120. - PMC - PubMed
    1. Garrity GM. Bergey's manual of systematic bacteriology. New York: Springer-Verlag; 2001.

Publication types