Page 5 - ME-436-v3-3

Basic HTML Version

Computational Molecular Biology
13
To satisfy the requirements of molecular biologists,
geneticists, and pathologists, we developed a
manually curated database of
complex
disease-related human haplotypes (CDRH,
http://bioinfo.hrbmu.edu.cn/cdrh.) by integrating
information on haplotypes and diseases scattered in a
large number of articles. CDRH is a comprehensive
and well-annotated database, and is a useful resource
for researchers to understand complex diseases at the
haplotype level.
1 Results
1.1 Data collection and database content
Text mining was used to collect complex
disease-related haplotypes and other detailed
information for database construction. We searched the
PubMed database (http://www.ncbi.nlm.nih.gov/pubmed)
with a series of keywords, such as ‘complex disease
haplotype’, ‘cancer haplotype’, ‘diabetes haplotype’,
limiting the results to publications before May 2010
for the current version of CDRH. For systematic and
reliable data collection, we checked the important
information manually and implemented the following
criteria: (i) the article must propose and elaborate a
relationship between a complex disease and
susceptible (or protective) haplotypes; and (ii) the
susceptible (or protective) haplotypes must have a
certain threshold or p-value of statistical tests.
Ultimately, a total of 1 125 haplotypes associated with
114 complex diseases were deposited and maintained
in the current CDRH database. Most of the archived
information in the database is for the SNP haplotypes,
and the rest consists of microsatellites.
In the CDRH database, each entry contains detailed
information regarding haplotypes and diseases. The
information collected includes the disease name,
haplotypes associated with the disease, haplotype
frequencies, the risk status of the haplotypes, the
p-value of the statistical tests, the chromosome upon
which the haplotypes are located, the gene symbol
with which the haplotypes are associated, SNPs (or
microsatellites) that make up the haplotype, and the
bibliographical information from the cited literature.
We not only collected a wide range of risk haplotypes,
but also considered protective haplotypes, both of
which provide valuable information for future genetic
studies of complex diseases.
We also integrated certain biological annotations from
external databases to complement and extend the
literature information. Basic information on the genes
that were identified by the related haplotypes was
retrieved from NCBI, including Entrez Gene ID,
Unigene ID, full gene name, chromosome location of
the gene, and a brief description of the gene function.
Most of the haplotypes in CDRH comprised a series
of SNPs; therefore, we collected information on
haplotype-related SNPs from dbSNP, including SNP
ID, physical position, and alleles for each SNP. In
addition, many convenient links were also provided to
external databases, such as dbSNP, PubMed,
D-HaploDB, and HapMap, which will facilitate the
future investigation of complex disease-related
haplotypes. Table 1 illustrates the statistical
information in the CDRH database.
Table 1 Summary of the data in CDRH
Resources
Number
Haplotype
1125
Disease
114
SNP
954
Microsatellite
288
Popoulation
66
Chromosome
22 autosomes, X Y and mitochondrion
Gene
218
1.2 Database implementation and web interface
The CDRH database uses MySQL 5.0 to store and
manage the data, and implements it in PHP scripts
running in an Apache/PHP environment.
1.3 Search page
The CDRH database is accessible online and allows
users to retrieve detailed information pertaining to
complex disease-related human haplotypes by disease
name, gene name, chromosome number, or SNP ID
(rs#). We first introduce the search by disease name,
which is sorted alphabetically in a drop-down list box.
For example, if a user selects ‘colorectal cancers’ as a
query term (Figure 1a), search and browse results will
Computational
Molecular Biology