Image processing

Raw signals are background-subtracted, normalized and corrected. The first step in processing data is to perform background subtraction for each acquired image at the pixel level, using an "erosion-dilation" algorithm that automatically determines the local background for each pixel. Then, for each nucleotide flow, the light intensities collected, over the entire duration of the flow by the pixels covering a particular well, are summed to generate a signal for that particular well at that particular flow. The acquired images are corrected to eliminate cross talk between wells due to optical bleed (the fiber optic cladding is not completely opaque and transmits a small fraction of the light generated within a well into an adjacent well) and to diffusion of ATP or PPi (generated during synthesis) from one well to another further downstream. To perform this correction, the extent of cross talk under low-occupancy conditions was empirically determined and de-convolution matrices were derived to remove from each well's signals, the contribution coming from neighboring wells. In order to account for variability in the number of enzyme-carrying beads in each well and variability in the number of template copies bound to each bead, two types of normalization are carried out: (i) raw signals are first normalized by reference to the pre- and post-sequencing run PPi standard flows and (ii) these signals are further normalized by reference to the signals measured during incorporation of the first three bases of the known "key" sequence included in each template.

The normalized and corrected signal intensity at each nucleotide flow, for a particular well, indicates the number of nucleotides, if any, that were incorporated. This linearity in signal intensity is observed to remain valid through homopolymers of length at least up to eight (Figure 6). However, in sequencing by synthesis a very small number of templates on each bead lose synchronism (i.e. either get ahead of or fall behind, all other templates in sequence) (Ronaghi, 2001). The effect is mostly due to undegraded or leftover nucleotides in a well (creating "carry forward") or to impaired polymerase activity (creating "incomplete extension''). Typically, carry forward rates of 1-2% and incomplete extension rates of 0.1-0.3% are seen. It is important to correct signals for these effects, because the loss of synchronism is a cumulative error that degrades the quality of sequencing at longer read lengths.

As a result, the impact of carry forward and incomplete extension is felt particularly toward the end of reads as illustrated in Figure 7 which shows the average read accuracy, at the single read level, as a function of base position.

0 0

Post a comment