Introduction Ana Teresa Freitas Bioinformática – 55 Eng. Biomédica 2006/2007
Why Bioinformatics? • Bioinformatics is the field of science in which biology, computer science and statistics merge into a single discipline. • The ultimate goal of the field is to enable the discovery of new biological insights and to create a global perspective from which unifying principles in biology can be derived.
Why Bioinformatics? • DNA sequencing technologies have created massive amounts of information that can only be efficiently analyzed with computers. • So far 70 species sequenced – Human, rat, chimpanzee, chicken, and many others.
• As the information becomes ever so larger and more complex, more computational tools are needed to sort through the data.
A short history of Bioinformatics • To understand bioinformatics in any meaningful way, it is necessary for a computer scientist to understand some basic biology, just as it is necessary for a biologist to understand some basic computer science • Since bioinformatics encompasses the use of tools and techniques from molecular biology and computer science this brief history of the field incorporate events from these areas
A short history of Bioinformatics 1800 - 1870 • 1865 Gregor Mendel discover the basic rules of heredity of garden pea. – An individual organism has two alternative heredity units for a given trait (dominant trait v.s. recessive trait)
• 1869 Johann Friedrich Miescher discovered DNA and named it nuclein.
Mendel: The Father of Genetics
A short history of Bioinformatics 1880 - 1900 • 1881 Edward Zacharias showed chromosomes are composed of nuclein. • 1899 Richard Altmann renamed nuclein to nucleic acid. • By 1900, chemical structures of all 20 amino acids had been identified
A short history of Bioinformatics 1900 - 1940 • 1902 Emil Hermann Fischer wins Nobel prize: showed amino acids are linked and form proteins – Postulated: protein properties are defined by amino acid composition and arrangement, which we nowadays know as fact
• 1911 Thomas Hunt Morgan discovers genes on chromosomes are the discrete units of heredity • 1911 Pheobus Aaron Theodore Lerene discovers RNA • 1933 A new technique, electrophoresis, is introduced by Tiselius for separating proteins in solution
A short history of Bioinformatics 1940 - 1950 • 1941 George Beadle and Edward Tatum identify that genes make proteins • 1943 The first true general- purpose electronic computer (ENIAC) was constructed at the University of Pennsylvania between 1943 and 1946 • 1950 Edwin Chargaff find Cytosine complements Guanine and Adenine complements Thymine
A short history of Bioinformatics 1950 - 1960 • 1950s Mahlon Bush Hoagland first to isolate tRNA • 1951 Pauling and Corey propose the structure for the alphahelix and beta-sheet (Proc. Natl. Acad. Sci. USA, 27:205-211, 1951) • 1952 Alfred Hershey and Martha Chase make genes from DNA
A short history of Bioinformatics 1953 • 1953 Watson and Crick propose the double helix model for DNA based on x-ray data obtained by Franklin and Wilkins (Nature, 171:737-738, 1953) 1 Biologist 1 Physics Ph.D. Student 900 words Nobel Prize Their 1953 Nature paper: “It has not escaped our notice that the specific pairing we have postulated immediately suggests a possible copying mechanism for the genetic material.”
James Watson and Francis Crick
A short history of Bioinformatics 1950 - 1960 • 1954 Perutz’s group develop heavy atom methods to solve the phase problem in protein crystallography • 1955 The sequence of the first protein to be analyzed, bovine insulin, is announced by F. Sanger • 1958 The first integrated circuit is constructed by Jack Kilby at Texas Instruments
A short history of Bioinformatics 1960 - 1970 • 1969 The ARPANET is created by linking computers at Stanford, UCSB, The University of Utah and UCLA • 1970 The details of the NeedlemanWunsch algorithm for sequence comparison are published • 1970 Howard Temin and David Baltimore independently isolate the first restriction enzyme
A short history of Bioinformatics 1970 - 1980 • • • • •
1971 Ray Tomlinson invents the email program 1972 The first recombinant DNA molecule is created by Paul Berg and his group 1973 The Brookhaven Protein Data Bank is announced (Acta. Cryst. B, 1973, 29:1746) 1973 Robert Metcalfe receives his Ph.D from Harvard University. His thesis describes Ethernet 1974 Vint Cerf and Robert Kahn developed the concept of connecting networks of computers into an “internet” and develop the Transmission Control Protocol (TCP) 1974 Charles Goldfarb invents SGML (Standardized General Markup Language)
A short history of Bioinformatics 1970 – 1980 • •
1975 Microsoft Corporation is founded by Bill Gates and Paul Allen. 1975 Two-dimensional electrophoresis, where separation of proteins on SDS polyacrylamide gel is combined with separation according to isoelectric points, is announced by P. H. O'Farrell (J. Biol. Chem., 250: 4007-4021, 1975). 1975 E. M. Southern published the experimental details for the Southern Blot technique of specific sequences of DNA (J. Mol. Biol., 98: 503-517, 1975). 1977 The full description of the Brookhaven PDB (http://www.pdb.bnl.gov) is published (Bernstein, F.C.; Koetzle, T.F.; Williams, G.J.B.; Meyer, E.F.; Brice, M.D.; Rodgers, J.R.; Kennard, O.; Shimanouchi, T.; Tasumi, M.J.; J. Mol. Biol., 1977, 112:, 535). 1977 Allan Maxam and Walter Gilbert (Harvard) and Frederick Sanger (U.K. Medical Research Council), report methods for sequencing DNA.
A short history of Bioinformatics 1970 – 1980 • 1977 Phillip Sharp and Richard Roberts demonstrated that pre-mRNA is processed by the excision of introns and exons are spliced together. Phillip Sharp
• 1977 Joan Steitz determined that the 5’ end of snRNA is partially complementary to the consensus sequence of 5’ splice junctions.
A short history of Bioinformatics 1980 – 1990 • 1980 The first complete gene sequence for an organism (FX174) is published. The gene consists of 5,386 base pairs which code nine proteins. • 1980 Wüthrich et. al. publish paper detailing the use of multi-dimensional NMR for protein structure determination (Kumar, A.; Ernst, R.R.; Wüthrich, K.; Biochem. Biophys. Res. Comm., 1980, 95:, 1). • 1980 IntelliGenetics, Inc. founded in California. Their primary product is the IntelliGenetics Suite of programs for DNA and protein sequence analysis
A short history of Bioinformatics 1980 – 1990 • 1981 The Smith-Waterman algorithm for sequence alignment is published. • 1981 IBM introduces its Personal Computer to the market. • 1983 The Compact Disk (CD) is launched. • 1983 Name servers are developed at the University of Wisconsin. • 1984 The Macintosh is announced by Apple Computer. • 1985 The FASTP algorithm is published. • 1985 The PCR reaction is described by Kary Mullis and co-workers.
A short history of Bioinformatics 1980 – 1990 • 1986 Leroy Hood: Developed automated sequencing mechanism • 1986 The term "Genomics" appeared for the first time to describe the scientific discipline of mapping, sequencing, and analyzing genes. The term was coined by Thomas Roderick as a name for the new journal. • 1986 The SWISS-PROT database is created by the Department of Medical Biochemistry of the University of Geneva and the European Molecular Biology Laboratory (EMBL). • 1987 The use of yeast artificial chromosomes (YAC) is described (David T. Burke, et. al., Science, 236: 806-812). • 1987 The physical map of e. coli is published (Y. Kohara, et. al., Cell 51: 319-337). • 1987 Perl (Practical Extraction Report Language) is released by Larry Wall.
A short history of Bioinformatics 1980 – 1990 • 1988 The National Center for Biotechnology Information (NCBI) is established at the National Cancer Institute. • 1988 The Human Genome Initiative is started (Commission on Life Sciences, National Research Council. Mapping and Sequencing the Human Genome, National Academy Press: Washington, D.C.), 1988. • 1988 The FASTA algorithm for sequence comparison is published by Pearson and Lupman. • 1988 A new program, an Internet computer virus designed by a student, infects 6,000 military computers in the US.
A short history of Bioinformatics 1990 • The 15 year Human Genome project is launched by congress • The BLAST program (Altschul, et. al.) is implemented • The HTTP 1.0 specification is published. Tim Berners-Lee publishes the first HTML document
A short history of Bioinformatics 1991 - 1994 • • • • • • • •
1991 The research institute in Geneva (CERN) announces the creation of the protocols which make-up the World Wide Web 1991 Linus Torvalds announces a Unix-Like operating system which later becomes Linux 1991 The creation and use of expressed sequence tags (ESTs) is described (J. Craig Venter, et. al., Science, 252: 1651-1656) 1992 The Institute for Genomic Research (TIGR) is established by Craig Venter 1993 Affymetrix begins independent operations in Santa Clara, California 1994 Netscape Comminications Corporation founded and releases Navigator, the commercial version of NCSA's Mozilla 1994 Gene Logic is formed in Maryland. 1994 The PRINTS database of protein motifs is published by Attwood and Beck
A short history of Bioinformatics 1996 - 1998 • • • • • • • • •
1996 First eukaryotic genome is sequenced. The genome for Saccharomyces cerevisiae (baker's yeast, 12.1 Mb) 1996 The working draft for XML is released by W3C 1996 Oxford Molecular Group acquires the MacVector product from Eastman Kodak 1996 The Prosite database is reported by Bairoch, et.al. 1996 Affymetrix produces the first commercial DNA chips 1997 The genome for E. coli (4.7 Mbp) is published 1998 Craig Venter forms Celera in Rockville, Maryland 1998 PerkinsElmer, Inc.. Developed 96-capillary sequencer 1998 Complete sequence of the Caenorhabditis elegans genome
A short history of Bioinformatics 1999 - 2000 • 1999 First human chromosome (number 22) sequenced • 2000 The genome for Pseudomonas aeruginosa (6.3 Mbp) is published. • 2000 The A. thaliana genome (100 Mb) is secquenced. • 2000 The D. melanogaster genome (180Mb) is secquenced.
A short history of Bioinformatics 2001 • 2001 International Human Genome Sequencing: first draft of the sequence of the human genome published (3,000 Mbp)
A short history of Bioinformatics 2003 - Present • 2003 Human Genome Project Completed. • 2003 An international sequencing consortium published the full genome sequence of the common house mouse (2.5 Gb) • 2004 The draft genome sequence of the brown Norway laboratory rat, Rattus norvegicus, was completed by the Rat Genome Sequencing project Consortium (April 1 edition of Nature)
Genomics • Origin – Fragment assembly of the DNA sequence – Gene finding – Sequence databases
The Central Dogma
Genes TSS Promoter region UTR
Pre-mRNA Splicing mRNA Translation Protein
Analyzing a Genome • How to analyze a genome in four easy steps. – Cut it • Use enzymes to cut the DNA in to small fragments.
– Copy it • Copy it many times to make it easier to see and detect.
– Read it • Use special chemical techniques to read the small fragments.
– Assemble it • Take all the fragments and put them back together. This is hard!!!
Sequencing Clone-by-clone shotgun sequencing Human Genome Project
Whole-genome shotgun sequencing Celera Genomics
Assembling Genomes • Must take the fragments and put them back together – Not as easy as it sounds. • SCS Problem (Shortest Common Superstring) – Fit overlapping sequences together to get the shortest possible sequence that includes all fragment sequences
• DNA fragments contain sequencing errors • Two complements of DNA – Need to take into account both directions of DNA
• Repeat problem – 50% of human DNA is just repeats – If you have repeating DNA, how do you know where it goes?
Annotation Start End emit x:-
match intergenic x:y emit -:y start:start
emit -:y match intron x:y
emit x1x2x3:match exon x1x2x3:y1y2y3
emit -:y1y2y3 -:GT -:y1GT -:y1y2GT
emit x intron x:-
(…) (…) emit y intron -:y
-:AG -:AGy2y3 -:AGy3
Annotation Precision 100 95 90 85 80 75
Biological Databases Available data (Feb. 2005) • Nucleotides: 44,575,745,176 • Nucleotides registrations: 49,127,925 • Protein sequences: 5,785,962 • PDB, 3D structures: 28,905
Sequencing: it’s just the beginning
What’s next? • State-of-the-art – – – – – –
Motif finding Global gene expression analysis Regulatory networks inference Structure analysis Genetic and metabolic networks modeling Information systems
• For the future – Systems biology – Synthetic biology
Challenges for the future