Interpreting the cluster pages




only VirMic
This page provides explanations for the different parts of the cluster pages. Please refer to the VirMic introduction page for an overview on the project and its goals.

Overview

The VirMic clusters represent groups of genes that are likely to be of viral origin and have the same functionality. The clusters are generated automatically for the most part, and may suffer from redundancies (several clusters representing the same functionality) or errors (wrong annotations, different functions within the same cluster etc.). The VirMic cluster pages are supposed to provide information that will help you to determine whether a cluster is authentic, and also to find its relations to other clusters. Specifically, you'll be able to find information regarding the following types of analysis:
  1. Cluster annotation: what confidence do we have in the annotation? (Sequence similarity statistics, RefSeq hits, COGs)
  2. Cluster members tagging: how likely it is that the origin of scaffolds carrying genes that belong to the cluster is indeed viral? (Line Islands recruitment, Sequence similarity statistics)
  3. Conext analysis: what other genes can be found in the vicinity of the cluster? (Neighboring clusters on the same scaffold)
In addition, cluster pages contain information which may help you to perform your own analysis, such as fasta files of all members of the cluster and their predicted proteins.

Line Islands recruitment

Each scaffold was blasted against combined datasets of three viral and three microbial samples coming from three of the four Northern Line Islands (one viral and one microbial samples from each location). Samples from the Christmas island were excluded due to a large portion of microbial sequences that were discovered in the viral samples that are probably not authentic. Recruitment is defined as the percentage of positions on the scaffold that were covered by at least one recruited Line Isnland read and therefore may vary between 0 to 100%.
High recruitment from the viral samples together with low recruitment from the microbial ones is probably indicative of a scaffold of viral origin. Zero or near-zero recruitment from both samples probably means that the sequence is not abundant in the Line Islands, and therefore no conclusion can be made as for its origin.
The figure below presents a summary of recruitment for all VirMic scaffolds (blue dots) against control GOS scaffolds carrying some fragment of the 16s rDNA genes (red dots). As can be observed, recruitment for the VirMic scaffolds is usually higher by the viral samples (X axis), in contrast to the 16s scaffolds whose microbial recruitment is much higher than their viral.


Sequence similarity statistics

Sequence simlarity analysis provides information about statistics of best hits when cluster members are BLASTed against
  • VirMic - using tblastx, against members of each cluster's set (e.g. a cluster that belongs to VirMic:Viral was blasted against VirMic:Viral)
  • Refseq hits - the set of all besy hits from the refseq databases proteins for all genes in the project. blastx is used in this case.
  • For VirMic:Microbial clusters, interpretation of the results may be done as follows:
  • Very low average % identity against the RefSeq hits probably means that the annotation is questionable and should be validated by other types of analysis (e.g. conformance of best hits or COGs).
  • Very high %identity against RefSeq hits probably means that the cluster is either not of viral origin (namely: it actually originated from microbial genomes), or it hasn't gone through much change with respect to its microbial homolog(s).
  • Small fraction of cluster members with hits against VirMic genes, relatively low % identity for alignments of those with hits may hint on inaccurate clustering, or on sequences coming from diverse species.
  • High % identity against VirMic:Microbial together with medium % identity against RefSeq hits probably refers to a true VirMic:Microbial cluster whose members have diverged from their microbial origins.
  • Note that for small clusters, the number of alignments against VirMic is expected to be low.

    RefSeq hits

    This table lists all RefSeq hits associated with this cluster. "Occurences" represents the number of cluster members each RefSeq hits "represents" (percent from all cluster members), "Description" is the protein annotation as appears in refseq-microbial or viral.
    When the description of majority of RefSeq hits is consistent the cluster is probably a true one. Otherwise (many inconsistent RefSeq hits) the cluster may be wrong and should be further validated using other methods.

    COGs

    This list contains the best hit statistics of BLASTing all cluster members against the COG database. Similarly to the RefSeq hits case, our confidence in the cluster annotation and makeup increases when its members are associated with one or a few consistent COGs.

    Neighboring clusters on the same scaffolds

    Lists other clusters that were found on the same scaffolds together with this cluster. This list can be used for additional validation of cluster members' tagging and also for searching for related clusters that are found in the context of this one.