The goal of this project is to study genes whose origin is known to be microbial that were found on viral sequences in the GOS dataset. For this purpose we have developed a strategy that generalizes the methods from (Sharon et al., Nature, 2009). A brief description of the algorithm is given below; the full details as well as the results are described in Sharon et al., ISME J., 2011.
The microbial genes clusters page contains a list of manually verified microbial gene clusters that were found on viral sequences. The vast majority of these genes have never been observed before on viral geneomes. Each cluster is accompannied by a cluster page containing a complete description of the cluster and its supporting data. Refer to the cluster help page for more information.
What are the gene clusters?
A gene cluster is a collection of genes that represent the same function, possibly from several organisms. Our data consists of a set of GOS scaffolds that were identified as viral based on criteria listed below. Genes were identified on each scaffold based on sequence similarity to proteins from either refseq viral or refseq microbial (see description of the refseq database here). For each identified gene, we use its best hit from refseq viral/microbial (termed RefSeq hit) for the purposes of annotation, affiliation of viral/microbial origin (tagging) and clustering with other genes.
Gene clustering was done in two levels:
The two-level clustering procedure is automatic for the most part, with conservative clustering
criteria: core clustes were determined only for strongly connected sets of RefSeq hits in which each
member of the cluster is connected to at least 70% of all other cluster members, and two core
clusters were grouped only if their annotations matched perfectly. While this procedure was
meant to generate high-quality clusters, it also resulted in redundant clusters, namely a few clusters representing the same gene.
These cases were treated manually.
Each core cluster was assigned one of three tags:
Core clusters may be clustered together only if they have the same tagging. Core clusters with different tagging and the same
annotation were grouped into separate clusters.
The interesting cases: microbial gene clusters on viral scaffolds
A scaffold is considered to be of likely viral origin if it follows one of the following two gene contents criteria:
We have also performed two other types of analysis for all scaffolds that passed one of the above criteria and also contained a cluster tagged as Microbial:
Last updates: 25/Dec/2010