Login | Request Account
1. Sequence data for gene catalogs
Integrated non-redundant gene catalog (IGC, nucleotide sequences, fasta)
Integrated non-redundant gene catalog (IGC, amino acid sequences, fasta)
2. Gene profile, genus profile, KO profile
3. Gene annotation summary
Gene IDUnique ID
Gene NameUnique name
Gene LengthLength of nucleotide sequence
Gene Completeness StatusStating a gene is complete or partial according to the gene predictor
Cohort OriginStating the cohort contributing the representative gene
Taxonomic Annotation(Phylum Level)Annotated phylum for a gene
Taxonomic Annotation(Genus Level)Annotated genus for a gene
KEGG AnnotationAnnotated KO(s) for a gene
eggNOG AnnotationAnnotated eggNOG(s) for a gene
Sample Occurence FrequencyOccurrence frequency in samples based on gene profile
Individual Occurence FrequencyOccurrence frequency in individuals based on gene profile
KEGG Functional CategoriesKEGG functional category(ies) of the annotated KO(s)
eggNOG Functional CategorieseggNOG functional category(ies) of the annotated eggNOG(s)
Cohort AssembledStating the metagenomic sequencing cohort(s) contributing the representaive gene or a redundant gene belonging to it

MetaHIT 2010 KEGG annotation
File format specification :
Gene IDUnique ID
Gene NameUnique name
KEGG AnnotationAnnotated KO(s) for a gene

4. Assemblies and predicted ORFs of the 1267 samples
J. Li et al., Supporting data for the paper: "An integrated catalog of reference genes in the human gut microbiome," GigaScience Database (2014), doi:10.5524/100064.
5. Public data used

The public gut microbial metagenomes used in this Integrated Gene Catalog include: (1) 139 HMP samples from stool body site1, which were downloaded from http://www.hmpdacc.org/HMASM/; (2) 368 Chinese fecal samples2, which were downloaded from NCBI (accession codes SRA045646 and SRA050230); (3) 511 European fecal samples from the MetaHIT project, 331 of which were downloaded from EBI with the accession ERA000116 and ERP0036123,4, 180 of which were used in an unpublished study5 and shared within the MetaHIT consortium. All of these public metagenomic sequencing samples were processed by the MOCAT pipeline to extract high quality reads6.

Other gut metagenomic data used to validate representativeness of IGC include: (1) data from US individuals7, which was downloaded from NCBI with the accession SRA002775; (2) data from Japanese individuals8, which was downloaded from EBI with the accession code PRJNA28117; (3) data from European individuals9, which was downloaded from NCBI with the accession ERP002469.

Two previously published gene catalogs for the human gut microbiome used in this project include: (1) a gene catalog established from 124 Europeans by MetaHIT3, which was downloaded from http://gutmeta.genomics.org.cn/; (2) a gene catalog established by HMP1, which was downloaded from http://www.hmpdacc.org/HMGC/ in August 2013. Gut metatranscriptomic data from 59 samples were downloaded from the Gene Expression Omnibus under accession GSE4676110. All of these public metatranscriptomic sequencing samples were processed by the MOCAT pipeline to extract high quality reads6.

1. The Human Microbiome Project Consortium. A framework for human microbiome research. Nature 486, 215–221 (2012).

2. Qin, J. et al. A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature (2012). doi:10.1038/nature11450.

3. Qin, J. et al. A human gut microbial gene catalogue established by metagenomic sequencing. Nature 464, 59–65 (2010).

4. Le Chatelier, E. et al. Richness of human gut microbiome correlates with metabolic markers. Nature 500, 541–546 (2013).

5. Nielsen, H. B. A method for identifying metagenomic species and variable genetic elements by exhaustive co-abundance binning(in the press). Nat. Biotechnol. (2014).

6. Kultima, J. R. et al. MOCAT : A Metagenomics Assembly and Gene Prediction Toolkit. 7, 1–6 (2012).

7. Turnbaugh, P. J. et al. A core gut microbiome in obese and lean twins. Nature 457, 480–4 (2009).

8. Kurokawa, K. et al. Comparative metagenomics revealed commonly enriched gene sets in human gut microbiomes. DNA Res. 14, 169–81 (2007).

9. Karlsson, F. H. et al. Gut metagenome in European women with normal, impaired and diabetic glucose control. Nature (2013). doi:10.1038/nature12198.

10. David, L. a et al. Diet rapidly and reproducibly alters the human gut microbiome. Nature (2013). doi:10.1038/nature12820.

1.Gene catalog construction pipeline
MOCAT is a package for analyzing metagenomics datasets. Currently MOCAT supports Illumina single- and paired-end reads in raw FastQ format. Using MOCAT you can, for example, generate taxonomic profiles of, and assemble, metagenomes.
The configuration file we used is MOCAT_BGI.cfg
CD-HIT is a very widely used program for clustering and comparing protein or nucleotide sequences. CD-HIT is very fast and can handle extremely large databases.CD-HIT helps to significantly reduce the computational and manual efforts in many sequence analysis tasks and aids in understanding the data structure and correct the bias within a dataset.
2.Annotation database
3449 prokaryotic genomes collection (D. R. Mende, S. Sunagawa, G. Zeller, P. Bork, Accurate and universal delineation of prokaryotic species., Nat. Methods 10, 881–4 (2013).)
KEGG is a database resource for understanding high-level functions and utilities of the biological system, such as the cell, the organism and the ecosystem, from molecular-level information, especially large-scale molecular datasets generated by genome sequencing and other high-throughput experimental technologies
eggNOG (evolutionary genealogy of genes: Non-supervised Orthologous Groups) is a database of orthologous groups of genes. The orthologous groups are annotated with functional description lines (derived by identifying a common denominator for the genes based on their various annotations), with functional categories (i.e derived from the original COG/KOG categories).
3.Aligner against gene catalog
SOAPaligner/soap2 is a member of the SOAP (Short Oligonucleotide Analysis Package). It is an updated version of SOAP software for short oligonucleotide alignment. The new program features in super fast and accurate alignment for huge amounts of short reads generated by Illumina/Solexa Genome Analyzer. Compared to soap v1, it is one order of magnitude faster. It require only 2 minutes aligning one million single-end reads onto the human reference genome. Another remarkable improvement of SOAPaligner is that it now supports a wide range of the read length.
Bowtie is an ultrafast, memory-efficient short read aligner. It aligns short DNA sequences (reads) to the human genome at a rate of over 25 million 35-bp reads per hour. Bowtie indexes the genome with a Burrows-Wheeler index to keep its memory footprint small: typically about 2.2 GB for the human genome (2.9 GB for paired-end).
BWA is a software package for mapping low-divergent sequences against a large reference genome, such as the human genome.