When a gene is sequenced for the first time, it is submitted to a databank such as DNA Data Bank of Japan (DDBJ), GenBank or European Molecular Biology Laboratory (EMBL). The submission of the nucleotide sequence databanks via the internet, undergo a simple automatic process which done electronically with web-based procedures. The databanks that share their information on a day-to-day basis can be considered as one and the same, hence it merely necessary to choose one of them to submit the sequence.
When comparing EMBL with GenBank, both of the databanks enclose the similar data and computer-readable, however the entries in GenBank is different, some difficulty will encounter when switching from one to the other. By the way, in those Databanks, the accession number (AC) has a significant usage, it is the most useful way of retrieving a specific sequence from the databank as well as the one that is stated in publications. It is said to be the accession number is distinctive to that certain sequences, instead to the specific gene or locus. Hence, the accession number is the same in each of the different databanks for any known sequence. Yet, the same sequence from different variants or strains of the same species may have different accession numbers.
By using some specific tools such as BLAST, different versions of a particular gene can be search for in the databank by entering gene name or using a known sequence and search for similarities. A substantial amount of annotation is listed for each entry to make the sequence information more beneficial. The annotation includes the info about the origin of the sequenced DNA, the identification and the computer prediction of the protein sequence to some extent of open reading frame, intron or exon boundaries and the information about expression signals, motifs and structural elements. The annotation can be used to define the function of the gene and goes into the feature table (FT) to read by Artemis in order to produce a visual display of the features of the sequence.
UniProt is a databank of protein sequences which generally used to provide the info about the 3D formation of the protein and cross-references to several other databases such as Prosite and Pfam. In the protein sequence databanks, the data about protein sequences can be divided into two types, which are physical and predicted. The former which derived from direct protein sequencing is indicated as the physical information of protein sequences. While, the former which derived by computer translation of DNA sequences and may not having direct evidence that prove the protein is actually exist is indicated as predicted information of protein sequence.
b) Sequence Analysis
1. Identifying Coding Regions
A coding region in a gene is identified in order to translate into a protein. Introns are not necessary for the translation process, only exons are used for translation. Hence, the cloned fragment, the gene with bacterial in origin and cDNA eukaryotic gene can ignore the introns.
An open reading frame (ORF) is referred to the region of a sequence which code for a protein that must not have the presence of stop codons. The location of stop codon indicates the end of the protein. The reading frame with lots of stop codons is not used for translation. ORF can be identified using computers to code for a protein and directly translate the DNA sequence into a protein sequence. ORF is search in a sequence because mRNA is able to translate into any three different reading frames protein.
In addition, an ORF is also explained as the distance between a start codon and the first stop codon in the same reading frame. Although ATG is the start codon in DNA, codons such as GTG, TTG or CTG are used as start codon mainly in bacterial. The finding of the region between two stop codons is carried out, rather than looking for the region between start codon and the first stop codon due to the uncertainty to identify the real start codon.
2. Expression Signals
The coding region of an open reading frame (ORF) is considered useless if the region of the DNA does not transcribe in the correct direction. Hence, it is significant to identify a potential transcription start and stop codons. In prokaryotes, especially bacteria, it has a promoter with two conserved consensus regions: the -35 region and -10 region that located upstream of the promoter. Genes are transcribed at these promoter sites: TTGACA at -35 region and TATAAT at -10 region by RNA polymerase. Generally, the distance separates both of the regions can be different by a few bases, a putative promoter is labeled when the sequences close to a consensus promoter in the right place is found. However, a putative promoter sometimes cannot be find due to the gene may be transcribed as part of an operon or the changes of the specificity of RNA polymerase by substitution of a different sigma factor. Moreover, the putative promoter sites are necessary to view computer predictions due to it is just a beginning of an investigation and it able to provide clues that need to be confirmed by more direct, experimental evidence.
While, in eukaryotes, ‘boxes’ or ‘response element’ with TATA- box, GC- box or CAAT- box are used to mediate the binding of RNA polymerase. The distance between those elements are not regular which is more complicate than the finding the transcription start site in the bacterial. So websites are available to predict the transcription start sites of an organism and it also used to identify consensus sequence from the same organism’s known promoters. In order to identify the RNA polymerase binding sites, the sites of which regulatory proteins attach to the DNA are search primarily. The proteins contained well-conserved, characterized and recognition sequences which have the transcription activating and repressing functions.
In both prokaryotes and eukaryotes, different regulatory proteins bind to the ‘boxes’. DNA sequences for each of these boxes can be screened, they provide evidence about the regulation of the significant genes and their possible function.
c. Sequence Comparisons
1. DNA Sequences
The query sequence aligns with the second sequence to identify whether the sequence is identical or similar to other sequence that has been determined. Computer is used to calculate the number of bases that match in order to determine the characteristics of the two sequences. The query sequence slides along the sequence which being compared with to find out the best match because the two sequence may possibly be vary in lengths or may not start at the same site. Yet, query sequence may have one or more gaps within it when comparing to the second sequence. This happened due to a difference between the sequences when comparing the genes from different species or one of the sequences is incorrect. By adding or removing a single base can affect the simplistic approach to non-viable. A very small difference between query sequence and the sequence in the databank can make our sequence line up perfectly but also can make the rest of the sequence not match at all. Therefore, the computer is used to introduce gaps into one or both sequences in order to come up with the best possible match between the two sequences.
The system used to lining up the two sequences is incorporates with a gap penalty. Every single time a gap is presented into any sequence, the score will decreased. This type of algorithm will need a higher proportion of computer power as the sequences getting longer. Score has to be calculated for each position with all likely combinations of gaps. The query sequence is meant to be comparing with all sequences in the databank and not just comparing with a single other sequence, however it requires even more computational power. FASTA is the program that generally used to speed up the comparing process. The sequence that we are searching for is cut into small fragments which known as ‘words’. These ‘words’ is used to select the two ideal alignment and then followed by computing a score for the optimal alignment.
In addition, the mechanism of BLAST is to find a very short matches and followed by extending those match outward till the score drops below a fixed value. High-scoring Segment Pairs (HSP) is refers to the complemented pair of sequences above a certain length, it has started with those with the highest score. From a BLASTN search of databank with a query sequence, the first column classifies the sequence that matches by accession number (AC). While in the second column, it shows the annotation of the sequence. Lastly, the most significant column is the column labeled with ‘E value’ which put up an estimation for the probability to each match occurring by chance. It is said to be a low E values with a high negative logarithm referred to a strong match.
2. Protein Sequences
When searching an unknown sequence from databanks, protein sequence is used instead of nucleotide sequence. The protein level is searched due to the degeneracy of the genetic code, related proteins are usually more conserved than DNA coding sequence. The coding sequences in the DNA databanks are translated and integrated in the protein databank. Hence, by using BLASTX, a DNA query sequence is insert into the computer and six reading frames are translated that used for comparing in a protein databank. On the other hand, TBLASTN is used to compare a protein query to translated DNA sequences from the databank. Normally, BLASTP is used to compare protein sequence query to protein databank.
The computer is able to deal with a more complex scoring system based on a conditions of all the potential amino acid pairings. The scores in the matrix are used to calculate the overall score for the matching of amino acid, even though the sequence lining obtained merely mark matches above a selected cut-off score.
Dayhoff Point Accepted Mutation (PAM) matrices are derived from comparisons of related protein from distinct sources. The score is calculated from the numbers of time that the particular changes in amino acid sequence occur. It explains the similarity between the amino acids nevertheless is also affected by the nature of the genetic code. The Dayhoff PAM250 matrix is suitable for distantly related sequences, values of 0 refer to neutral changes, and increasing positive or negative values represent increasingly acceptable or unacceptable mutation respectively. A mismatch between phenylalamine (F) and tyrosine (Y) gives a greater score than a perfect match of all amino acids since F and Y match are quite rare and less possible occur by chance, so the score of 7 is given in the matrix.
An alternative of an amino acid within the same group can make a small influence on the structure and function of the protein, whereas a change of one amino acid between groups will make a greater effect to the protein.
While, the use of BLASTP as the default, an alternative family of matrices that known as BLOcks SUbstitution Matrix (BLOSUM), the numbering of the matrix title is totally different from the PAM matrices, it works in the opposite direction. BLOSUM is more appropriate for procedures based on local alignments.
3. Sequence Alignments
FASTA and BLAST only show pairwise alignments, instead of showing the ideal alignment between the search sequence and the target. Both are carry out rapidly and smaller amount of computer power. They make shortcuts to facilitate database searches.
The program Clustal includes ClustalW2 and Clustal X2, is used generally to produce optimized multiple alignment. They compared customize sequences to yield a matrix of pairwise alignment scores. Due to the similarity, two similar sequences are lining up together to generate new consensus until all the proteins in a multiple alignment had produced. Clustal alignments are same as Blast alignments, they concerned about the gap penalties and amino acid substitution matrices.
Furthermore, the substitutions of similar amino acids may influence the biological functionality of the protein. The alignment can be improve by introducing an extra gap into one of the sequences, rather than taking an alignment of sequences as an absolute, so as to aid towards the formation of evolutionary relationships. By identifying the sites of amino acids in the 3D structure of the proteins, we are able to distinguish whether alignment occurs in a region vital for the function of the protein.
Clustal is used to produce a phylogenetic tree based on the corresponding sequence of the proteins, and followed by Neighbour Joining to continue constructing the tree. The tree based on Clustal alignment is expected similar to the taxonomic relationships of the organisms.
J. Dale and M. von Schantz, Plant N (2012). Sequencing a Cloned Gene. From: Genes to Genomes. Concepts and Applications of DNA Technology. Wiley Blackwell.