Skip to main content
Geosciences LibreTexts

17.19: Metagenomics

  • Page ID
    51551
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \( \newcommand{\dsum}{\displaystyle\sum\limits} \)

    \( \newcommand{\dint}{\displaystyle\int\limits} \)

    \( \newcommand{\dlim}{\displaystyle\lim\limits} \)

    \( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

    ( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\id}{\mathrm{id}}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\kernel}{\mathrm{null}\,}\)

    \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\)

    \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\)

    \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    \( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

    \( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

    \( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vectorC}[1]{\textbf{#1}} \)

    \( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

    \( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

    \( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

    \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \(\newcommand{\longvect}{\overrightarrow}\)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)

    Essential to Know

    • Within oceans, there are diverse microbial biomes, which are ecological niches where microbes interact with one another and their surrounding environment. 
    • Metagenomics can tell us what organisms are there and what they are capable of doing, but can’t tell us if they are actually doing it. Other related techniques are necessary to identify which species and genes are active and how this information impacts our understanding of marine ecosystems and ocean chemistry. 
    • Bioinformatics is used to assemble and analyze metagenomic data using powerful computers.  
    • Metagenomic studies have revealed a vast array of microbes in marine environments, many of which have never been identified, let alone cultured in the lab.
    • Viruses are present in significantly higher numbers than any other organism in the ocean, and recent metagenomic and related studies have shed light on their impact. 

    Understanding the Concept

    Within the ocean, a wide array of microorganisms actively engage in biogeochemical cycling while interacting with one another as well. The complex interactions exist within smaller ecological niches known as microbial biomes. These biomes comprise a diverse group of organisms, including bacteria, algae, viruses, and archaea (Fig. CC19-1). It is important not only to understand the role these biomes play in marine ecosystems but also to identify the players involved. 

    Environmental influences on microbes and microbial influences on the environment
    Figure CC19-1. Microbial biomes consist of thousands of microorganisms, including viruses, bacteria, and archaea. The community structure is shaped by the chemical and physical elements (sunlight, oxygen saturation, metal availability, temperature, salinity) within the environment. These diverse communities play a critical role in biogeochemical cycling and nutrient availability and can respond rapidly to environmental changes. 

    Investigating how microbial niches exist within marine environments has historically been challenging, primarily because many of the microbes found there cannot be cultured (grown in a lab). It was not until the discovery of DNA sequencing that we truly began to understand the diversity of microbes that exist in ocean biomes. DNA, which is the genetic makeup (code) of an organism and contains all the information needed to perform the life processes of that species, can be sequenced so that every nucleotide that makes up that code can be identified. 

    Thousands of novel organisms have been identified since DNA sequencing technology, primarily metagenomics, was discovered. Despite extensive efforts to better characterize marine microbes using DNA sequencing, a large portion of microbes in ocean ecosystems have yet to be identified. 

    Metagenomics is the analysis of all the genomic DNA in an environmental sample that contains multiple species (Fig. CC19-2). Metagenomics data provides information on what organisms are present in the sample and what metabolic capabilities they have. This allows scientists to infer what they may be doing in the microbial biome, including their possible involvement in biogeochemical cycling. Time-lapse studies or comparative studies can be used to determine how communities change over time and space, as well as how they react to environmental changes. It is important to note that the limitation of metagenomics is that while it can tell us what genes are present, it can’t tell us if or how often those genes are in use in the cell’s life processes. Essentially, metagenomics provides the map of that organism, similar to a map of a community, but we can’t know for sure that every business on the map is open. 

    Microbes in a flask
    DNA icons in a test tube
    DNA icons cut into color-coded pieces
    Computer and screen of identified pieces
    Lengths of sequences reads being overlapped and re-assembled
    Examples of taxonomical, functional and comparative data analysis
    Figure CC19-2. The metagenomics process. (a) Marine samples are collected either via water collection or sediment coring. (b) All of the DNA in the sample is extracted. (c) The DNA is sheared, or broken into smaller pieces, for sequencing. (d) Samples are sequenced where millions of copies of the fragmented DNA are made, and the nucleotide bases of those copies are identified. Sequences are returned as reads that report the nucleotide sequence and quality (Phred score) of each DNA strand. (e) Reads are aligned based on overlapping nucleotides, and similar sequences are binned (grouped) together to form metagenome-assembled genomes (MAGs). (f) The MAGs are then analyzed to identify the organisms present (taxonomic data), their metabolic pathways, and the dominant functions of the organisms present (functional data), as well as the overall community interactions with one another and the environment (comparative data).

    Samples can be taken from any type of environment (marine environments, soil, feces, gut) without any prior knowledge or understanding of the microorganisms present. Within marine environments, samples are either collected by seafloor coring or water collection. A sediment push core can be used to sample ocean floors and is typically then separated into about 2 cm sections before total DNA extraction is completed. Water sampling involves the collection of a large volume of water, at least 2 to 3 L, which can be collected at varying depths. Once collected, the water is processed in the lab using filtration methods to concentrate the microbes in the sample. The filter with the concentrated microbes is then processed for total DNA extraction. Once DNA has been extracted and confirmed to be of sufficient concentration and quality, the samples are ready for sequencing.  

    Metagenomics is the sequencing, or reading of the nucleotide bases (A, G, C, T) that make up DNA in the correct order, of all of the DNA present in a sample. In order to do this, the DNA is broken, typically through shearing, into pieces of approximately the same length. The size that the DNA is broken into depends on the sequencing technology, where the typical size for short-reads is around 400 nucleotides. There are two primary ways of shearing DNA: mechanical and enzymatic. Mechanical shearing is often completed using sonication, which transmits high-frequency, short-wavelength acoustic energy, breaking the DNA into smaller pieces at random spots in the DNA. Enzymatic shearing uses specific and non-specific enzymes to nick and cut the DNA. Shearing the DNA results in millions of sequences that need to be read then reassembled and analyzed (Fig. CC19-2). The sequences are typically recorded in text-based files that store DNA sequences with associated quality scores. 

    DNA sequences generally fall into two categories: short-read and long-read. The difference between short- and long-read sequencing refers to the length of DNA that is copied in a single sequence. Short-reads are typically about 400 nucleotides in length, while long-reads range from 1500 to >10,000 nucleotides in length. The longer read lengths are used to account for long sequences of repeats in the DNA and areas where shorter runs missed DNA sequencing. Shorter reads tend to be more accurate, making them more reliable (e.g., higher quality) than some long-read technologies. While other long-read technologies are extremely accurate (99.99%), their cost makes their use prohibitive. Often, to maximize coverage, a mix of short- and long-read sequencing is used, balancing high throughput and affordability. Different sequencing technologies vary in how the sequencing is performed, but the overall output is similar. 

    Before sequences can be reassembled, it is important to check the quality of the sequences and discard any low-quality sequences. This is a crucial step because low-quality reads have a high error rate and can result in the incorrect identification of a gene or even an organism. Quality is measured using Phred quality scores, which reflect the confidence that a nucleotide base read is correct. Phred scores (Q-score) typically range from 10 to 50 on a logarithmic scale, where a lower Phred score indicates a higher likelihood of error. For example, a Phred score of 20 (Q20) indicates that there is a 1 in 100 (1%) chance that a nucleotide is incorrect, and a Phred score of 40 (Q40) indicates that there is a 1 in 10,000 (0.01%) chance that a nucleotide is incorrect. The generally accepted benchmark for quality is a Phred score of 30 (Q30). Phred scores are reported in the final sequencing files. 

    Metagenomic data is assembled and analyzed using bioinformatics. Because the organisms in the original sample are unknown, the sequences must be reassembled without a template; this is known as de novo assembly. To do this, the sequences are aligned, or scaffolded, by looking for areas where the reads overlap. This process is repeated until there are no more available overlaps. The assembled reads are referred to as contigs. The contigs then undergo a process known as binning, which groups contigs that likely come from the same or closely related genomes. Once grouped, genomes are assembled to create metagenome-assembled genomes (MAGs). The MAGs are then analyzed to identify the organisms in the sample, and identify novel metabolic pathways and the specific genes present. This allows assessment of the potential role each microbe may have in the system. 

    Due to the sheer number of reads, processing them requires a high amount of computational power and is most often completed using supercomputers. 

    Different types of analysis of the assembled MAGs can be performed in order to answer many different questions. The data analyzed can be organized into three types: 

    • Taxonomic data provides insight into the identities of the organisms present in a sample. This includes matching the sequences to a database of known sequences. From there, the relative abundance (approximation) of specific species can be determined, and the taxonomic diversity can be examined. Additionally, novel species can be identified. 
    • Functional data includes annotating all known genes and identifying the metabolic pathways present for each organism. Additionally, the most abundant genes (functional abundance) in the system and the dominant functions in the system can be determined. 
    • Comparative data looks at how the MAGs interact within the biome in relation to one another and the environment as a whole. The types of analysis performed include correlation plots of interactions with the environment, such as depth, oxygen saturation, metal availability, temperature, and salinity, as well as between species (co-occurrence networks).  

    To organize and share sequencing data from the marine biome, the KMAP Global Ocean Gene Catalog 1.0, the largest open-source database of ocean microbes, was developed. At the time of its introduction in 2024, the catalog comprised 308.6 million gene groups from over 2100 ocean samples.    

    As mentioned above, metagenomics can be used to identify which microorganisms comprise the sampled biome and which genes these microorganisms possess, but it cannot tell us whether these genes are active. Many genes are inactive or are only active under certain environmental conditions. DNA is the repository for all the information needed to perform the life processes of that species. Active genes in this DNA library are passed to RNA, where they act as instructions to run the cell’s life processes, including the construction of essential compounds such as amino acids, other organic acids, sugars, lipids, and proteins. One particular set of RNA, called mRNA (messenger RNA), contains the information needed for the cell to construct proteins. 

    Determining which genes are active in an environmental sample is difficult, but can be addressed by using one or more of three techniques: metatranscriptomics, proteomics, and metabolomics. Metatranscriptomics is performed in a similar way to metagenomics, except that it extracts and analyzes mRNA in the environmental sample, so it provides information on which genes are active in protein construction. Proteomics and metabolomics analyze environmental samples: proteomics for proteins and metabolomics for various other compounds, including amino acids, other organic acids, sugars, and lipids. Each of these three techniques involves careful sample preparation and analysis and produces a large amount of data that must be processed. Further, the results need to be carefully interpreted to obtain accurate information about which genes identified by metagenomic analysis are actually in use under the environmental conditions present when each sample was collected.


    17.19: Metagenomics is shared under a CC BY-NC-ND 4.0 license and was authored, remixed, and/or curated by LibreTexts.