Microbial Genes on Viral Genomes




only VirMic

Introduction

The goal of this project is to study genes whose origin is known to be microbial that were found on viral sequences in the GOS dataset. For this purpose we have developed a strategy that generalizes the methods from (Sharon et al., Nature, 2009). A brief description of the algorithm is given below; the full details as well as the results are described in Sharon et al., ISME J., 2011.
The microbial genes clusters page contains a list of manually verified microbial gene clusters that were found on viral sequences. The vast majority of these genes have never been observed before on viral geneomes. Each cluster is accompannied by a cluster page containing a complete description of the cluster and its supporting data. Refer to the cluster help page for more information.

What are the gene clusters?

A gene cluster is a collection of genes that represent the same function, possibly from several organisms. Our data consists of a set of GOS scaffolds that were identified as viral based on criteria listed below. Genes were identified on each scaffold based on sequence similarity to proteins from either refseq viral or refseq microbial (see description of the refseq database here). For each identified gene, we use its best hit from refseq viral/microbial (termed RefSeq hit) for the purposes of annotation, affiliation of viral/microbial origin (tagging) and clustering with other genes.
Gene clustering was done in two levels:
  • Sequence similarity-based clustering of RefSeq hits - metagenomic data consists of short sequences, and is likely to contain mostly fragmented genes. As a result, it is usually impossible to cluster several copies of the same gene due to the lack of overlapping regions. In order to overcome this problem we have performed a sequence similarity based clustering of the set of RefSeq hits for the genes rather then direct clustering of the genes themselves. All genes whose RefSeq hits were clustered together were grouped into the same core cluster.
  • Semantic clustering based on core cluster annotations - annotation of core clusters was done based on majority voting of all their RefSeq hit annotations. Core clusters sharing the same annotation were further clustered into final clusters or simply clusters.
  • The two-level clustering procedure is automatic for the most part, with conservative clustering criteria: core clustes were determined only for strongly connected sets of RefSeq hits in which each member of the cluster is connected to at least 70% of all other cluster members, and two core clusters were grouped only if their annotations matched perfectly. While this procedure was meant to generate high-quality clusters, it also resulted in redundant clusters, namely a few clusters representing the same gene. These cases were treated manually.

    Cluster tagging

    Each core cluster was assigned one of three tags:
  • Microbial (M) - represents clusters with RefSeq hits originating only from refseq-microbial with no homologs in refseq-viral. For example: T603 (NADH dehydrogenase subunit I) is present only in refseq-microbial and therefore was tagged as microbial (M).
  • Viral (V) - represents clusters with RefSeq-hits originating from refseq-viral with no homologs in refseq-microbial. Example: T1769 (T4-like endonuclease).
  • Viral-Microbial (VM) - clusters whose RefSeq-hits had homologs in both refseq-viral and refseq-microbial. Example: T2642 (photosystem II D2 protein).
  • Core clusters may be clustered together only if they have the same tagging. Core clusters with different tagging and the same annotation were grouped into separate clusters.

    The interesting cases: microbial gene clusters on viral scaffolds

    A scaffold is considered to be of likely viral origin if it follows one of the following two gene contents criteria:
  • It contains 3 genes, clusters for the outermost two are viral, or
  • It contains 4 genes or more, clusters of at least 20% of them are viral
  • We have also performed two other types of analysis for all scaffolds that passed one of the above criteria and also contained a cluster tagged as Microbial:
  • Line Islands recruitment - sequence recruitment is defined as the fraction of the sequence for which significant hits were found in a reference dataset. All scaffolds that passed one of the above mentioned gene-contents criteria served as recruiters against two datasests composed of the Northern Line Islands viral and microbial fraction datasets, except for the datasets from the Christmas Island which may be contaminated. In order to assess the recruitment of each cluster we have summarized information regarding the number of scaffolds containing cluster members that recruited at least one read from each dataset and the average recruitment of these scaffolds. Results of this analysis can be found in the cluster pages; a summary of recruitment for all scaffolds can be found here (X axis: recruitment against viral fraction, Y axis: recruitment against microbial fraction. viral scaffolds containing members of microbial gene clusters are represented by blue dots, control GOS scaffolds containing 16s rDNA genes are represented by red dots).
  • Similarity to RefSeq hits - it is assumed that true cases of microbial genes on viral genomes involves the modification of the genes so that they will fit the phage' needs. We have computed the average percent identity in the alignments of each cluster's members to its refseq hit and compared it to the average percent identity in the alignment of cluster members to their closest hit in the VirMic genes (at the protein level). Once again, this information is provided in the cluster pages.

  • Last updates: 25/Dec/2010