Allergens are proteins that induce allergic responses. More specifically, they elicit IgE antibodies and cause the symptoms of allergy, which has been a major health problem in developed countries . With many transgenic proteins introduced into the food chain, the need to predict their potential allergenicity has become a crucial issue. Bioinformatics, more specifically, sequence analysis methods have an important role in the identification of allergenicity [25, 29].
One approach to allergenicity prediction is to determine, automatically, motifs from sequences in an allergenic database and then search for the identified motifs in the query sequences. Li et al.  described an approach where protein sequence motifs were identified using wavelet analysis . The particular example consists of 817 sequences in an allergen database. A 10-fold cross-validation test is conducted where 90% of the sequences are used for motif identification with the remaining 10% being used as query sequence for validation. This procedure is carried out a number of times to obtain averaged values for recall and precision. The workflow is shown in Figure 23.9. A brief description of the workflow is discussed as follows.
ClustalW is initially used to generate the pair-wise global alignment distances among the randomly selected protein sequences. The pair-wise distances so obtained are then used to cluster the protein sequences by partitioning around medoids using the statistics tool R . Each cluster of protein sequences is subsequently realigned using ClustalW. The wavelet analysis technique developed by Krishnan et al. is then used on each aligned cluster to identify motifs in the protein sequences.
HMM profiles [18, 20] are then generated for each identified motif using hmmbuild. We use these profiles to search for the motifs in each query sequence using hmmprofile, and thus predict whether it is an allergen. The accuracy of the predictions is computed to assess the effectiveness of this approach.
Figure 23.9a shows the workflow (on the left) as well as the inherent parallelisms (on the right). The two main areas of parallelism are in the identification of the motifs using the wavelet analysis technique (one for each cluster) as well as in building the HMM based profile and searching the query sequences (one for each motif identified).
Was this article helpful?