OC Clone Classification
All ORFeome entry clones have been associated with confidence levels during the annotation process to assure consistent quality standards for the entire collection. Confidence levels are assigned according to the evidence that individual clones cover bona fide genes. This is based on the match of clone sequences with respective CCDS, RefSeq, and Ensemble genes and transcripts (goto the end of this page for a description of RefSeq, CCDS, and ENSEMBL). The respective confidence levels are provided for every OC clone as part of the annotation information.
NOTE: In the database, more than one confidence level may be associated to a particular clone. This is when a particular sequence has more than one hit in CCDS, RefSeq, and Ensemble databases.
Key to confidence levels of OC Clones*
- Reviewed by RefSeq and in CCDS
- Validated by RefSeq and in CCDS
- Reviewed by RefSeq
- Validated by RefSeq
- Provisional by RefSeq and in CCDS
- Provisional by RefSeq
- Predicted by RefSeq and in CCDS
- Predicted by RefSeq
- In Ensembl but not RefSeq
* These nine categories were developed by the ORFeome Collaboration to provide a measure of confidence in the protein-coding sequence of OC clones. They are listed from highest to lowest level of confidence.
Clone annotation provides links to the best matching RefSeq genes and proteins as well as Ensembl transcripts and proteins. Based on Blast alignment with the best RefSeq hit, the numbers of single nucleotide variants from that entry are provided, distinguishing between silent (same protein sequence) and non-synonymous (altered amino acid) variants.
Furthermore, clone annotation informs on the level of identity between each entry-clone sequence and the best RefSeq hit.
- EXACT: 100% identity (no alterations), 100% overlap
- SNPs: >95% identity (>=1 alteration), 100% overlap
- PART: 100% identity, >90% overlap
- PartWithSNPs: >90% identity, >90% overlap
Databases used in the OC clone annotation process
CCDS transcripts reflect a collaborative curation effort between the European Bioinformatics Institute (EBI), the National Center for Biotechnology Information (NCBI), the Wellcome Trust Sanger Institute (WTSI) , and the University of California, Santa Cruz.(UCSC) "to identify a core set of protein coding regions that are consistently annotated and of high quality." (For more information, see: CCDS)
RefSeq is a comprehensive, integrated, non-redundant, well-annotated set of reference sequences including genomic, transcript, and protein. RefSeq Status Definitions (For more information, see: RefSeq homepage and RefSeq Status):
- Validated: Curated. The RefSeq record has undergone an initial review to provide the preferred sequence standard.
- Reviewed: Curated. The RefSeq record has been reviewed to provide the preferred sequence standard and to add additional functional descriptive information and feature annotation, as relevant.
- Provisional: Not curated. Automatically provided based on GenBank sequence data; there is support for the transcript and protein. This is the default status code applied to some genomes for which there is no clear information about the method used to define the sequence.
- Predicted: Not curated. Automatically provided based on GenBank sequence data; limited or partial support for the transcript or protein. A portion of the transcript or protein may reflect an ab initio annotation prediction that was submitted to GenBank.
Ensembl is a joint project between EMBL - EBI and the Wellcome Trust Sanger Institute to develop a software system which produces and maintains automatic annotation on selected eukaryotic genomes, including that of homo sapiens.(For more information, see: human ENSEMBL)