In 1976, for the first time, the complete genome of a living organism, bacteriophage MS2, was determined. That is, the sequence of 3569 nucleotide bases that constitute its genetic "recipe" was discovered. This sequence instructs the ribosomes of the infected bacteria to produce the humanly comprehensible protein machine that we call a bacteriophage. Subsequent advances in the tools used for determining nucleotide sequences have enabled the analysis of many DNA (bacteriophage MS2 had a RNA genome) sequences millions of times larger than this one, such as the 3 billion base-pair human genome. Such large quantities of data cannot be interpreted or understood without computer assistance. Bioinformatics is the use of computerized statistical and modeling tools in order to make sense of such data.
Scientists hope to use bioinformatics to ultimately produce accurate models of the way complex organisms function, develop, and evolve. With regard to evolution, the most relevant bioinformatics techniques are comparative genomics and sequence alignment techniques. The latter allow modern gene sequencing methods, so-called ‘shotgun’ sequencing, to function. In shotgun sequencing, computers to search through huge databases full of nucleotide sequences and to identify where the same long sequence is found in multiple locations in the database. Sequence alignment is a set of statistical techniques for joining together sequences at their regions of overlap. In this manner, continuous sequences are generated which match the total nucleotide sequence for an organism. Sequence alignment is algorithmically simple but computationally demanding.
Comparative genomics is more mathematically and conceptually demanding. Rather than looking for identical sequences in genetic material taken from a single organism, scientists use heuristics and Bayesian analysis in order to identify homologous genetic regions in related species. By counting the sequence difference on those parts of the genome that don’t influence an organism’s phenotype, scientists are able to calculate the length of time since two lineages diverged. The comparison of these sequences with the sequences of other parts of the genome enables the detection of evolutionary pressures which either change or stabilize a particular gene.
Functional genomics examines protein structure and function and the manner in which gene expression and protein synthesis vary during an organism’s life. Understanding protein function requires the development of computationally efficient molecular modeling programs, as computerized analysis of protein folding can be almost arbitrarily demanding. A sophisticated understanding of protein folding would be very useful however, as it would enable us to predict the affinities of a protein’s active sites, design new molecules to alter their activity, etc. It could also enable us to build molecules that didn’t interact with any proteins other than those which we wanted to interact with, and possibly to build new proteins or make improvements to existing ones, increasing or reducing their activity as needed.
In order to discover how protein synthesis varies within an organism during its life, protein and RNA expression levels have to be monitored in all of an organism’s tissues and at every life stage. Protein microarrays and high input mass spectroscopy are proteomics tools which can be used to gather data for bioinformatics. RNA microarrays and other tools gather equivalent information on gene regulation and expression. Once the data is fully analyzed, it is hoped that we will understand organismal function well enough to interfere in biology in much more effective and less destructive ways, enabling much safer and more effective medicine. For instance, an understanding of what regulatory mechanisms become distorted in cancer will enable the development of new cancer treatments which simply correct the regulatory dysfunctions causing the cancer rather than trying to kill the cancerous cells.