Applications of Data Banks to Virology

Web sites should be chosen depending on the objective of the study (Table 17.1). Applications of nucleotide sequence data sets for one virus group or several groups for comparative purposes include calculation of genetic distances and establishment of phylogenetic relationships using methodology that has been well established in evolutionary biology, identification of functional domains in viral nucleic acids and proteins, taxonomic classification of viruses, and information for defining viral disease emergence and reemergence.

Essential to any phylogenetic analysis is an accurate sequence alignment. One method commonly used is the progressive alignment method. In this, the most closely related sequences are aligned first and then the progressively more divergent ones are added, allowing the introduction of gaps at those positions not present in all sequences [15]. A useful program for the alignment of viral genome sequences is CLUSTAL W [16], available in the main databases, as for instance, DDBJ (DNA Data Bank of Japan, where an expansion of CLUSTAL W has been recently created). If the number of sequences and their size are appropriate, a sequence alignment can be obtained online. To use CLUSTAL W it is important that the sequences be written in a suitable format, such as FASTA. There are programs used to visualize sequences obtained from an automated sequencer, and to write sequences in FASTA format directly from the chromatograms (for instance, CHROMAS: http://www.technelysium.com.au/chromas.htlm). Since different useful programs make use of different formats, it is important to use programs that permit transforming the presentation of sequences from one format into another; examples are SEQ-CONVERT (http://hcv.lanl.gov/content/ hcv-db/SEQCONVERT/seqconvert.htlm) and the EMBOSS sequence conversion site (http://ngfnblast.gbf.de/). Phylogenetic reconstructions will generally not be possible for distantly related viral sequences.

When a minimum genetic relatedness occurs, the three main groups of methods used to derive evolutionary trees are maximum parsimony, distance, and maximum likelihood (reviews in Refs. [7, 17]). Maximum parsimony predicts the minimum number of mutational steps required to produce the observed variation from the ancestral sequences. Most programs assume the existence a molecular clock, which is especially questionable for viruses, in particular for RNA viruses subjected to variations in the levels of population equilibrium (with unpredictable effects on the variations in consensus sequences, which are often the ones entered in the phylogenetic analyses). The method is most appropriate for sequences that have a high degree of similarity. It is time-consuming because often all possible trees are examined before a consensus tree may be produced. The main programs for maximum parsimony analysis in the PHYLIP package (J. Felsenstein's server: http://evolution.genetics.washington.edu/phylip.htlm) are: DNAPARS, DNA-PENNY (which limits the number of trees searched), DNAMOVE, DNACOMP (useful when the rate of evolution varies among sites), and other programs, some of which consider only transversion mutations. PROTPARS analyses protein sequences and does not score silent (synonymous) mutations. Other available programs include PAUP (phylogenetic analysis using parsimony) (D. Swofford's server: http://paup.csit.fsu.edu), MacClade, MESQUITE, and EMBOSS, among others.

Distance methods are based on the degree of difference (genetic distance) among pairs of sequences of a multiple sequence alignment, which is converted into a distance matrix. They are commonly used in molecular biology and can handle a large number of sequences. As the genetic distance increases, a correction for multiple step mutations should be applied, and different correction methods are currently available (e.g., Kimura two-parameter distance [18]). Distance methods are suitable when branch lengths vary, and in most distance methods the results are not significantly altered when a molecular clock does not operate. Several distance methods are included in different software packages, as for instance, PHYLIP, MEGA 2 (or its recent upgrade MEGA 3 [19]). Among them, the neighbor-joining (NJ) method does not assume a molecular clock and produces an unrooted tree, while the unweighted pair group method with arithmetic mean (UPGMA) assumes a molecular clock and produces a rooted tree. The least squares method can be used either without assuming a molecular clock (FITCH program) or with the assumption of an evolutionary clock (KITCH program). Bayesian statistics (based on conditional probabilities derived by Bayes' rule) have been used to provide best estimates of evolutionary distances between two nucleotide sequences [20, 21]. Computationally, bayesian methods are more practical than maximum likelihood methods.

Maximum likelihood methods use probability calculations to derive a branching pattern from the mutations found at different positions of the nucleic acids under comparison. They can be used to estimate both distances and the best mutational pathway between sequences, and are appropriate when sequences are diverse, and they have been used to analyze mutations in overlapping reading frames of viral genomes. They are computationally very complex because all possible trees are considered and, thus, the methods are limited to small numbers of sequences unless supercomputers are used. Maximum likelihood methods are included in the program PAML and in several programs of the PHYLIP package. Recently, the program TREE-PUZZLE [22] has been developed, and it allows maximum likelihood analyses to be performed in personal computers, without requiring long computing times.

A number of problems must be carefully evaluated prior to the application of phylogenetic methods to address any biological problem, and assumptions must be recognized in relation to the aims of the study [23]. As mentioned above, a problem is the treatment of gaps due to insertions or deletions in sequence alignments. Some programs ignore gaps while others treat them as substitutions. The difficulty will be posed with divergent viral sequences but not for closely related genomes such as in viral quasispecies (Section 17.4). Some phylogenetic analyses assume that the rates of evolution at different tree branches are the same (the molecular clock operates). As indicated above, this assumption is questionable for viruses, but it allows prediction of the root of a tree (for different points of view on the operation of a molecular clock during virus evolution, see Refs. [24-26]). An unrooted tree depicts the relationships among sequences but does not provide information of a possible common ancestor of the group.

In considering the choice of a phylogenetic method, when sequences show strong similarity (as in viral quasispecies), parsimony or maximum likelihood methods are preferred. With limited sequence similarity, distance methods are appropriate, although maximum likelihood methods can be used to analyze regions of localized similarity. When sequence variation is high, alternative multiple alignment methods can be tried (for example, global, progressive or iterative programs, or local alignment of protein motifs). It is always advisable to use two different phylogenetic methods to analyze the same set of sequences, and examine whether the tree topologies obtained are equivalent. In doing this, it must be taken into consideration whether the phylogenetic analysis assumes the operation of the molecular clock. In addition, a bootstrap resampling of data is recommended to assess the statistical robustness of the trees obtained by any phyloge-netic method. It usually involves 100 to 1000 data sets and gives confidence values to each branching point, those with bootstrap coefficients higher than 0.8 per unit being statistically supported [27].

M. Eigen, A. Dress and colleagues developed a method based on statistical geometry as a means to obtaining information (at the level of nucleic acids or proteins) on the common ancestor of a set of sequences [28-30]. A common branching point of a set of sequences is confined within a portion of sequence space (a concept described in Section 17.4), and distances for the different known sequences are calculated. This method has been applied to very different analyses, such as the dating of ancestral t-RNA sequences, the divergence of eukaryotes through comparison of cytochrome c sequences, and defining relationships with in highly variable viruses such as influenza virus or human immunodeficiency virus (overview in Ref. [31]).

Some computer programs have been developed to identify natural groupings of closely related sequences, and they have found an application to the analysis of viral quasispecies. These programs are described in Section 17.4.

An important application of virogenomics is in the identification of functional domains in the viral nucleic acids and encoded proteins (for example, see Ref. [32]). These include a large number of cis-acting and trans-acting functional motifs. For brevity only a few are mentioned here: origins of replication, promoters, transcription termination signals, intergenic regions, splice sites, polyadenyla-tion signals, ribosome-binding sites, protein-binding domains, enzymatic active sites (polymerases, proteases, kinases, etc.), types of protein domains (coiled coils, etc.), nuclear localization or nuclear export signals, and many other under study in molecular virology because they play key roles in the life cycle of viruses. These searches should take into account the multifunctionality of many viral proteins as well as the fact that many viral protein precursors can have functions other than (or additional to) the functions of the corresponding mature, processed proteins. Once identified, these specific domains may guide in the identification of complete regulatory regions and entire proteins or protein precursors with a predicted function. The sequences can then be compared with those of related and unrelated taxonomic groups of viruses included in the data banks.

Many viruses still remain unclassified for various reasons, and new groups are continuously being approved (see successive editions of the report of the International Committee on Taxonomy of Viruses, e.g., seventh report [12]; at the time of this writing, the eighth report is in press]. Therefore, virus sequence data banks may help in defining new taxonomic groups and in assigning unclassified viruses to existing groups. Incoherent grouping of regulatory regions and functional proteins may indicate ancient or recent recombination events. Even with the limited sequence data sets that were available more than a decade ago, sequence comparisons revealed with reasonable certainty that some animal viruses originated by recombination between two different parental viruses. For example, western equine encephalitis virus was probably generated as a result of a recombination event involving a virus related to eastern equine encephalitis virus and some New World relative of Sindbis virus [33, 34]. Recent recombination events are playing a crucial role in HIV-1 diversification in the human population [35]. Recombination may produce high-fitness genomes from two debilitated parents or may generate new combinations of genomic sequences with potential evolutionary novelty. As additional sequences enter the virus data sets, statistical procedures may unveil additional historical and recent recombination events, thereby contributing to clarification of the natural history of viral pathogens. Programs to identify putative recombinant sequences include SimPlot (http://sray.med.som.jhmi.edu/RaySoft/ SimPlot/; examples of application in Refs. [36, 37]) and LARD [38, 39].

Information on the types of evolutionary forces that have acted in the diversification of viral genomes can be obtained by analyzing the types of mutations that distinguish a set of homologous sequences (number of transitions versus trans-

versions; presence of insertions and deletions; number of nonsynonymous versus synonymous or silent substitutions). Many viral polymerases tend to introduce transitions at a much higher frequency than transversions during template copying [40, 41]. Thus, the proportion of transversion mutations may increase with the evolutionary distance between the sequences under comparison. The ratio of non-synonymous mutations (per nonsynonymous site) (dn) to synonymous mutations (per synonymous site) (ds) may indicate positive selection in the diversification of sequences under comparison (dn> ds). However, this assumes that synonymous mutations have a higher probability than nonsynonymous mutations to be selectively neutral. Yet multiple lines of evidence indicate that coding regions in RNA viruses may serve functions other than protein coding (cis-acting regulatory elements, structural roles affected by third base residues in codons, etc.). An additional cautionary note was unveiled by Wain-Hobson and colleagues when they were able to document evidence of positive selection (dn > ds) in an experimental setting in which positive selection was impossible [42]. With these limitations in mind, the reader can quantitate mutation types and calculate data for dn, ds in programs such as SNAP (http://hcv.lanl.gov/content/hcv-db/SNAP/SNAP.html; examples in Refs. [39, 43]) and K-estimator [44] (http://www.biology.viowa.edu/ comeron/page3.html; examples in Refs. [37, 45]).

Comparison of regulatory regions and proteins that belong to viral genomes and are also found in cellular genomes can provide information on the possible sharing of regulatory elements and functional modules. Similarities and differences between functional elements of viruses and different phyla of cellular organisms are relevant to the origin of viruses, to coevolutionary processes between cells and viruses, and to defining lateral (also called horizontal) gene transfer events in general evolution. Different theories on the origin of viruses have been proposed (reviewed in Ref. [46]). They include (a) viruses as descendants of primitive RNA or RNA-like replicons that preceded formation of the first cells; (b) viruses as the result of regressive evolution of complex, cell-like microbial forms; (c) viruses as descendants of cellular DNA or RNA, or of subcellular organelles; and (d) viruses as ancient autonomous, cell-dependent genetic elements that originated simultaneously with cellular organizations and have coevolved with them. The comparative genomics of cells from organisms belonging to different kingdoms and of viruses may eventually permit the defining of coevolutionary pathways, and further support or modify current evidence that viruses have survived as mediators of lateral gene transfer and have served as a selective force to promote cellular evolution [46-48].

Defining viral disease emergence and reemergence is not a simple task since several interwoven factors participate in a highly unpredictable fashion (reviews in Refs. [49-52]). In addition to multiple sociological and ecological factors, the adaptive potential and dynamics of viruses play an essential role in viral disease emergence. Surveillance of human and animal viruses with zoonotic potential [53] is regarded as essential for early detection of viral pathogens. As soon as viral nucleotide sequences associated with a disease emergence are determined, comparisons with sequences from the data banks may point to the reemergence of a

17.4 Beyond Reference Strains: Towards a Second-Generation Virogenomics? | 379

known viral pathogen, or identify the emergence of a virus that previously was known to infect unrelated hosts, or even suggest the presence of an entirely new (previously unidentified) viral entity. Thus, data banks for viral genomic sequences will find broad and important applications.

0 0

Post a comment