Nucleic acids- sequencing
Two ways for sequencing: ?1. DNA molecules (radioactively labeled at 5’ termini) are subjected to 4 regiments to be broken preferentially at Gs, Cs, Ts, As, separately. (Maxam and Gilbert chemical method, not widely used) ?2. Chain-termination method (Sanger’s method, widely used)
Sanger’s enzymic method Maxam and Gilbert
ddNTPs are chain-terminating nucleotides: the synthesis of a DNA strand stops when a ddNTP is added to the 3’ end
The absence of 3’-hydroxyl lead to the inefficiency of the nucleophilic attack on the next incoming substrate molecule.
Tell from the gel the position of each G
If one ddGTP is added to 100 dGTP, DNA synthesis aborts at a frequency of 1/100 every time the polymerase meets a ddGTP
1. Fluorescence Labeled ddNTP
2. Polymerase catalyzed
Shotgun sequencing of a bacterial genome
1. The bacterial genome was randomly sheared into many random fragments with an average size of 1 kb, and cloned intro a vector. (Prepare what you are going to shot) 2. DNA was prepared from individual recombinant DNA clones and separately sequenced on automated sequencer. (shot) This is called shotgun sequencing. 3. To obtain all the DNA sequence in the bacterium Hemophilus influenzae genome, 10x sequence coverage was used.
10x Coverage example: If the H. influenzae genome is 1.8 kb, each read produces 600 bp of sequence, and 600 bp x 33,000 different colonies= 20 Mb.
That is to say 33,000 colonies are picked to prepare plasmid for sequencing.
The shotgun strategy permits a partial assembly of large genome sequence
The key technical insights that facilitating the sequencing of the human genome was the reliance on (1) automated shotgun sequencing (obtain sequence) (2) then the subsequent use of computer to assemble the different sequences (analyze sequence, which is the rate-limiting step).
1. Recombinant Plasmid Library
2. Shotgun sequencing 3. Sequence Assembly
Assembly Step 1: form contigs
(A single contig is about 50,000 to 200,000 bp. ) Sophisticated computer programs have been developed that assemble the short sequences from random shotgun DNAs into larger contiguous sequences called contigs.
Assembly Step 2: The paired-end strategy permits the assembly of larger scaffolds (1-2 Mb)
Fig 20-17. Contigs are linked by sequencing the ends of large DNA fragments (plasmid library containing larger DNA fragments).
1.Assemble the contigs from 1kb plasmid shotgun sequence. (50 kb200 kb) 2.Assemble the contigs to large scaffold by sequencing both ends of 5 kb plasmids. (<500 kb) 3.Assemble the larger scaffolds (>1 Mb) by sequencing the end of the BAC library.
The purpose of this analysis is to predict the protein coding genes (蛋白质编码基因) and other functional sequences (其他功能序 列) in the genome.
For the genomes of bacteria and simple eukaryotes： F
inding protein coding genes = Identification of ORF (open-reading frames). (1) straightforward; (2) fairly effective; (3) but not all ORF=real protein coding genes; (4) key change is in identifying the functions of these genes
For animal genomes with complex exonintron structures, the challenge is far greater： 1. A variety of bioinformatics tools are required to identify genes and genetic composition of complex genomes. 2. The computer programs identifying potential protein coding genes are based on many sequence criteria including the occurrence of extended ORFs that are flanked by appropriate 5’ and 3’ splice sites.
Limitations of the computer methods: 1) ~ one-fourths of genes cannot be identified by this way. 2) The failure to identify promoters because the core promoter elements are highly degenerate (退变的).
The most important method for validating predicted protein coding genes and identifying those missed by current gene finder program is the use of cDNA sequence data.
Although the transcription complex is smart enough to identify these elements in cell, we are not yet smart enough to write programs to identify them in silico (硅片，人工).
cDNA library generation, sequencing and application: 1. The mRNAs are firstly reverse transcript into cDNA, and these cDNA, both full length and partial, are cloned to make the cDNA library 2. Sequence the cDNAs using shotgun method to generate EST (expressed sequence tag) database. 3. These ESTs are aligned onto genomic scaffolds to help us identify genes and to assemble larger scaffolds.
Fig 20-18 Gene finder method: analysis of protein-coding regions in Ciona intestinalis (海鞘 )
A 20-kb genome sequence (scaffold) Predicted by a gene finder program
DNA library (DNA 文库)
DNA library is a collection of cloned DNA fragments.
Two types of DNA library:
1. Genomic library: contains DNA fragments representing the entire genome of an organism.
2. cDNA library: contains only complementary DNA molecules synthesized from mRNA molecules in a cell.
Genomic DNA library construction
How large a library would you need?
N = ln(1 -P)/ln(1 - f)
P is the desired probability f is the fractional proportion of the genome in a single recombinant N is the necessary number of recombinants
e.g. p=99%; digestion with a 6-cutter enzyme N = ln(1 - 0.99)/ln(1 - (4096/3x109)) N = 3.37 x 106 clones Human genome size
cDNA (complementary DNA) library construction
1. 2. 3. 4.
Make cDNA Nick the RNA strand Synthesis the second strand of cDNA Insertion of cDNA into plasmid
Clone CDNA into a plasmid
Treatment of cDNA with S1 nuclease (to remove possible 5' cap mRNA fragment remaining in cDNA duplex Convert potential "ragged" ends to blunt by treatment with Pol I (will fill in 5' overhangs and chew back 3' overhangs) Methylate cDNA at potential internal Eco RI sites by treatment with Eco RI methylase (plus S-adenosyl methionine) Ligate linkers to blunt, methylated cDNA using T4 DNA ligase
Cut linkers with Eco RI restriction endonuclease
Remove linker fragments from cDNA fragments by agarose gel electrophoresis Ligate cDNA to vector DNA fragment (opened up by Eco RI restriction endonuclease