GenBank, the public repository for nucleotide and protein sequences, is a critical resource for molecular biology, evolutionary biology, and ecology. While some attention has been drawn to sequence errors (1), common annotation errors also reduce the value of this database. In fact, for organisms such as fungi, which are notoriously difficult to identify, up to 20% of DNA sequence records may have erroneous lineage designations in GenBank (2). Gene function annotation in protein sequence databases is similarly error-prone (3, 4). Because identity and function of new sequences are often determined by bioinformatic analyses, both types of errors are propagated into new accessions, leading to long-term degradation of the quality of the database.
Currently, primary sequence data are annotated by the authors of those data, and can only be reannotated by the same authors. This is inefficient and unsustainable over the long term as authors eventually leave the field. Although it is possible to link third-party databases to GenBank records, this is a short-term solution that has little guarantee of permanence. Similarly, the current third-party annotation option in GenBank (TPA) complicates rather than solves the problem by creating an identical record with a new annotation, while leaving the original record unflagged and unlinked to the new record.
Since the origin of public zoological and botanical specimen collections, an open system of cumulative annotation has evolved, whereby the original name is retained, but additional opinion is directly appended and used for filing and retrieval. This was needed as new specimens and analyses allowed for reevaluation of older specimens and the original depositors became unavailable. The time has come for the public sequence database to incorporate a community-curated, cumulative annotation process that allows third parties to improve the annotations of sequences when warranted by published peer-reviewed analyses (5).