## Dna Fragment Assembly Using The Sequential Ga

The GA was invented in the mid-1970s by Holland . It is based on Darwin's evolution theory. GA uses the concept of survival of the fittest and natural selection to evolve a population of individuals over many generations using different operators: selection, crossover, and mutation. As the generations are passed along, the average fitness of the population is likely to improve. GA can be used for optimization problems with multiple parameters and multiple objectives. It is commonly used to tackle NP-hard problems such as the DNA fragment assembly and the traveling salesman problem. NP-hard problems require tremendous computational resources to solve exactly. GAs help to find good solutions in a reasonable amount of time. Next, we present the sequential GA for the fragment assembly problem. More details about the inner workings of the algorithm can be found in Ref. .

1. Randomly generate the initial population of fragment orderings.

2. Evaluate the population by computing fitness.

3. While (NOT termination condition)

(a) Select fragment orderings for the next generation through ranking selection.

(b) Alter population by i. Applying the crossover operator.

ii. Applying the mutation operator.

iii. Re-evaluate the population.

12.3.1 Implementation Details

Let us give some details about the most important issues of our implementation.

12.3.1.1 Population Representation We use the permutation representation with integer number encoding. A permutation of integers represents a sequence of fragment numbers, where successive fragments overlap. The population in this representation requires a list of fragments assigned with a unique integer ID. For example, eight fragments will need eight identifiers: 0, 1,2, 3, 4, 5, 6, and 7. The permutation representation requires special operators to make sure that we always get legal (feasible) solutions. To maintain a legal solution, the two conditions that must be satisfied are (a) all fragments must be presented in the ordering and (b) no duplicate fragments are allowed in the ordering. For example, one possible ordering for four fragments is 3 0 2 1. It means that fragment 3 is at the first position, fragment 0 is at the second position, and so on.

12.3.1.2 Population Size We use a fixed size population to initialize random permutations.

12.3.1.3 Program Termination The program can be terminated in one of two ways. We can specify the maximum number of generations to stop the algorithm or we can also stop the algorithm when the solution is no longer improving.

12.3.1.4 Fitness Function A fitness function is used to evaluate how good a particular solution is. It is applied to each individual in the population and it should guide the genetic algorithm toward the optimal solution. In the DNA fragment assembly problem, the fitness function measures the multiple sequences alignment quality and finds the best scoring alignment. Parsons et al.  mentioned two different fitness functions.

Fitness function F1 sums the overlap score for adjacent fragments in a given solution. When this fitness function is used, the objective is to maximize such a score. It means that the best individual will have the highest score.

Fitness function F2 sums not only the overlap score for adjacent fragments but also the overlap score for all other possible pairs.

This fitness function penalizes solutions in which strong overlaps occur between nonadjacent fragments in the layouts. When this fitness function is used, the objective is to minimize the overlap score. It means that the best individual will have the lowest score.

The overlap score in both F1 and F2 is computed using the semiglobal alignment algorithm.

12.3.1.5 Recombination Operator Two or more parents are recombined to produce one or more offspring. The purpose of this operator is to allow partial solutions to evolve in different individuals and then combine them to produce a better solution. It is implemented by running through the population, and for each individual, deciding whether it should be selected for crossover using a parameter called crossover rate (Pc). A crossover rate of 1.0 indicates that all the selected individuals are used in the crossover. Thus, there are no survivors. However, empirical studies have shown that better results are achieved by a crossover rate between 0.65 and 0.85, which implies that the probability of an individual moving unchanged to the next generation ranges from 0.15 to 0.35.

For our experimental runs, we use the order-based crossover (OX) and the edge-recombination crossover (ER). These operators were specifically designed for tackling problems with permutation representations.

The order-based crossover operator first copies the fragment ID between two random positions in Parent 1 into the offspring's corresponding positions. We then copy the rest of the fragments from Parent 2 into the offspring in the relative order presented in Parent 2. If the fragment ID is already present in the offspring, then we skip that fragment. The method preserves the feasibility of every string in the population.

Edge recombination preserves the adjacencies that are common to both parents. This operator is appropriate because a good fragment ordering consists of fragments that are related to each other by a similarity metric and should therefore be adjacent to one another. Parsons  uses edge-recombination operator as follows:

2. Select the first position from one of the parents, call it s.

3. Select s' in the following order until no fragments remain:

(a) s' adjacent to s is selected if it is shared by both parents.

(b) s' that has more remaining adjacencies is selected.

(c) s' is randomly selected if it has an equal number of remaining adjacencies.

12.3.1.6 Mutation Operator This operator is used for the modification of single individuals. The reason we need a mutation operator is for the purpose of maintaining diversity in the population. Mutation is implemented by running through the whole population, and for each individual, deciding whether to select it for mutation, based on a parameter called mutation rate (Pm). For our experimental runs, we use the swap mutation operator. This operator randomly selects two positions from a permutation and then swaps the two fragment positions. As this operator does not introduce any duplicate number in the permutation, the solution it produces is always feasible. Swap mutation operator is suitable for permutation problems such as ordering fragments.

12.3.1.7 Selection Operator The purpose of the selection is to weed out the bad solutions. It requires a population as a parameter, processes the population using the fitness function, and returns a new population. The level of the selection pressure is very important. If the pressure is too low, convergence becomes very slow. If the pressure is too high, convergence will be premature to a local optimum.

In this work, we use ranking selection mechanism  in which the GA first sorts the individuals based on the fitness and then selects the individuals with the best fitness score until the specified population size is reached. Note that the population size will grow whenever a new offspring is produced by crossover or mutation. The use of ranking selection is preferred over other selections such as fitness proportional selection .