Lecture 2: SEQUENCE INFORMATION RESOURCES

(BIOINFORMATICS)

LECTURE 2- BT-501

With the launch of the Human Genome Project, enormous data came to be known to humans. And it was almost impossible to store this enormous data in books. This led to the requirement of having database also known as sequence information resources.

Definition:

Databases are an e-book (book on the internet) in which we can store, search and retrieve (find out) any type of data. Data can be DNA sequence, RNA sequence, protein sequence, genome information, proteome information, transcriptome information etc.

For e.g. if “acgtcaaga” is the gene sequence of a protein “X”; then it can found out in a genomic database present on the internet. We can know that how many organisms in the universe (discovered so far) are having this sequence or its related sequence for the protein X? What is the amino acid composition and 3D structure of this protein in different organisms? What are the coding regions of this gene? On which chromosome it is present etc. etc.

Types of databases

A vast data was obtained from genome and peptide sequencing of various organisms which could not be stored in a single database. Therefore, different types of databases were constructed. Presently three types of databases are known:

Primary databases:

These are the primary storehouse of sequence information. If a new researcher works on a new organism, plant or microbe and obtains its sequence information, which is new to the world, then he/she should submit his/her data to a primary database. Every year large number of researchers, sequencing labs and industries contribute a large amount of data to primary data repositories on the internet. All these databases are freely available on the internet.

Primary databases can be of two types:

Nucleic acid databases – These are primary depot for nucleic acid sequences. Examples include GenBank, NCBI, EMBL, DDBJ (DNA databank of Japan)
Protein databases– These are primary depot for protein sequences. Examples include SWISSPROT, PIR, PDB

2. Secondary databases- These databases combine the data of primary databases, add more relevant data and then re-publish this data on the internet. TrEMBL, Pfam, PROSITE, CATH are some good examples.

3. Composite databases—In these databases, sequences from different databases are gathered altogether in a single database. So, by using a composite database, the user becomes free from the tedious task of gathering information from multiple sources. For e.g. HIV database contains collective information about HIV from different databases. It is important to note that all composite databases have their own format; which is created by the inventors of the particular database. OWL, MISPX, NRDB are e.g. of databases.

Redundant database– When we find the same information in two different places in the database, it is called redundant database. The question arises why same information will be present multiple times on the net? The answer is same data is submitted by different researchers around the globe working on the same organism.
Non-redundant database – Means only a single entry of a piece of data will be found. This data is not repeated anywhere. Examples of non-redundant databases are OWL. Almost all known databases, including primary databases, known in the world are non-redundant. When a particular sequence is submitted twice to the web, from different users, then it is merged into a single database. After all, there is no use of redundancy.