Abstract
Data driven computational biology relies on the large quantities
of genomic data stored in international sequence data banks.
However, the possibilities are drastically impaired if the stored
data is unreliable. During a project aiming to predict splice
sites in the dicot Arabidopsis thaliana, we extracted a data set
from the A. thaliana entries in GenBank. A number of simple
`sanity' checks, based on the nature of the data, revealed an
alarmingly high error rate. More than 15% of the most important
entries extracted did contain erroneous information. In addition,
a number of entries had directly conflicting assignments of exons
and introns, not stemming from alternative splicing. In a few
cases the errors are due to mere typographical misprints, which
may be corrected by comparison to the original papers, but errors
caused by wrong assignments of splice sites from experimental data
are the most common. It is proposed that the level of error
correction should be increased and that gene structure sanity
checks should be incorporated - also at the submitter level - to
avoid or reduce the problem in the future. A non-redundant and
error corrected subset of the data for A. thaliana is made
available through anonymous FTP.
Original language | English |
---|---|
Journal | Nucleic Acids Research |
Volume | 24 |
Issue number | 2 |
Pages (from-to) | 316-320 |
ISSN | 0305-1048 |
DOIs | |
Publication status | Published - 1996 |