OC Clone Classifications
The ClassificationFile includes all available ORFeome entry clones. These clones have a GenBank accession assigned with their annotated category. One column in this file indicates how good the best RefSeq hit actually is. Please note that the comparison is based on the amino acid sequence and therefore "SNPs" and "PartWithSNPs" indicate that there is at least one amino acid substitution.
Classification File Description:
- Column: GenBankID of the entry-clone sequence.
- Column: RefSeq ID of best hit in BLAST, "No hit found" is indicated when no RefSeq hits could be identified that matched at least with conditions of PartWithSNPs (see column 3). Such hits are either provisional genes (would be category 9) or are very much truncated as compared to a known gene (e.g. DQ892041).
- Column: Similarity is the level of identity between entry-clone sequence and RefSeq hit (of column 2).
- EXACT: 100% identity (no alterations), 100% overlap
- SNPs: >95% identity (>=1 alteration), 100% overlap
- PART: 100% identity, >90% overlap
- PartWithSNPs: >90% identity, >90% overlap
- Column: Category. Note: Category 9 (hits just with ENSEMBL) has not been implemented yet, however, this will be done in the next version.
For a complete listing of OC Classification files, click here.
For a complete listing of sequence-verified ORFeome clones (with plate and well information) click here.
Key to Categories of OC Clones*
- Reviewed by RefSeq and in CCDS
- Validated by RefSeq and in CCDS
- Reviewed by RefSeq
- Validated by RefSeq
- Provisional by RefSeq and in CCDS
- Provisional by RefSeq
- Predicted by RefSeq and in CCDS
- Predicted by RefSeq
- In Ensembl but not RefSeq
* These nine categories were developed by the ORFeome Collaboration to provide a measure of confidence in the protein-coding sequence of OC clones. They are listed from highest to lowest level of confidence.
CCDS transcripts reflect a collaborative curation effort between the European Bioinformatics Institute (EBI), the National Center for Biotechnology Information (NCBI), the Wellcome Trust Sanger Institute (WTSI) , and the University of California, Santa Cruz.(UCSC) "to identify a core set of protein coding regions that are consistently annotated and of high quality." (For more information, see: CCDS)
RefSeq Status Definitions (For more information, see: RefSeq Status):
- Validated: Curated. The RefSeq record has undergone an initial review to provide the preferred sequence standard.
- Reviewed: Curated. The RefSeq record has been reviewed to provide the preferred sequence standard and to add additional functional descriptive information and feature annotation, as relevant.
- Provisional: Not curated. Automatically provided based on GenBank sequence data; there is support for the transcript and protein. This is the default status code applied to some genomes for which there is no clear information about the method used to define the sequence.
- Predicted: Not curated. Automatically provided based on GenBank sequence data; limited or partial support for the transcript or protein. A portion of the transcript or protein may reflect an ab initio annotation prediction that was submitted to GenBank.
For a complete listing of OC Classification files, click here.
For a complete listing of sequence-verified ORFeome clones (with plate and well information) click here.