Reverseengineering and modeling a genetic network using limited data

Presently, microarray technology cannot yet measure the expression levels of all genes of an arbitrary biological system. Furthermore, there is the measurement technology-independent complication noted earlier such that it is generally impractical to obtain the full set of parameters that completely determine the state transitions of the system. Here, in essence, the reverse-engineering of a typical genetic network would mean uncovering dependencies and causal relationships between genes using a limited data set or knowledge base. Data in this context include empirical measurements of the respective components, in particular the mRNA levels, as well as a priori biological facts about the system under investigation. The information is limited in the sense that we do not have corresponding data from every gene present in the system and, furthermore, the data we do have inevitably contain measurement or experimental errors. An example: Suppose that a researcher wants to study the interaction between two mouse genes, gene A1 and gene A2, in liver from the day of birth until 1 week later. The researcher will use a robotically spotted microarray with probes for genes A1 and A2. As samples, she or he harvests liver cells at N uniformly spaced time points from postnatal day 1 to day 7. Let us consider the possibilities of interaction or regulation between A1 and A2, as discussed in [164]:

1. A1 regulates A2, or vice versa. This interaction may be indirect, propagating through an intermediate cascade of other genes which are not being assayed.

2. A1 and A2 are co-regulated by another gene or a subset of genes.

3. A1 and A2 do not interact with one another nor do they share a common coregulator.

The term regulate here carries causal implications. One way to translate the biological statements above into mathematical formalism is to say that (1) and (2) are equivalent to saying that A1 and A2 are not stochastically independent, whereas (3) says that A1 and A2 are stochastically independent. An immediate question that comes to mind is whether it is technically possible to infer causal relationships between genes by their expression level data alone. For such a scenario, the most typical mathematical tool that our researcher could apply is the linear or rank correlation coefficient, r—provided that the number of time point measurements at which a gene is assayed, N, is large enough,[14], for if N = 2 then any two genes are perfectly (and vacuously) correlated. As noted in [164], the observation of correlation between pairs of variables is commonly used in biology to predict causal relationships which have to be validated biologically. Indeed, that observation drives much of the data-mining machinery discussed in this chapter.

From a purely mathematical standpoint, correlation cannot be used to rigorously prove or disprove the existence of such relationships. Recall from our earlier review section on correlation (section 3.6.1) the basic rule of thumb that while independence implies zero correlation, the converse is not generally true. In spite of this, a high enough correlation (say |r| > 0.9, by no means a formal threshold) between genes A1 and A2 is a reasonably strong argument for further laboratory investigation of the pair, after taking into account the size of N, measurement errors in the primary data, and established facts about the pair's biological background.

Another approach to analyzing a genetic network is to view the whole as a dynamical system, not just consisting of pairwise interactions, and to take advantage of the traditional systems tools that already exist, (see [80] and [58]). Suppose that our researcher measures the expression level for genes Gj(j = 1,..., M) at times tj(i = 1, ..., N), denoted by Xj(tj). We also suppose that the expression level of a gene Ak at time tq (q > 0) is a linear combination of the expression levels of all other genes Gj from the previous time step tq [or more generally from all previous time steps tr (0 < tr < tq)], or rewriting in a matrix format,

Wl.l ttfa.i

where wkj is a weight term encoding the influence of the expression of gene Gj on the expression of Gk, and 2k is a parameter which may be stochastically distributed which could encapsulate a priori knowledge about regulation. Note that Eq. (4.13.1) is linear in the expression quantities and is first-order Markov, i.e., the state of the system at time tq is entirely determined by the time tq■ state.

We could also write Eq. (4.13.1) as a difference equation in keeping with our systems viewpoint,

kJ I

if k = j, otherwise, where "xk(tq ) P xk(tq) xk(tq ). Unlike the classic system of difference equations, w,j here are the unknowns and Xj(tq)are the known measured quantities. In general, wkj" Wj,k. Recall from linear algebra that in order to solve for the M2 unknowns w,j uniquely, one needs to have M2 different equations Eq. (4.13.2); in other words, there should be enough data time points. If the number of unknowns exceeds the number of equations, then the system is underdetermined and may have infinitely many solutions for Wj. Conversely, if there are more unknowns than there are equations, then the system is overdetermined and it may be algebraically insoluble. In this situation, one can attempt to find a weight matrix w,j that "solves" the system in the sense of minimizing an ad hoc energy functional. A typical example of this approach is the least squares fitting of data to a line where the energy functional is the sum of distances from the fitting line. Once the weights wk,j in Eq. (4.13.2) have been computed and we suppose as modelling assumption that wk,j are time-independent, then Eq. (4.13.2) is a well-defined difference system and one may apply standard dynamical systems or Markov chain techniques to explore interpolation and prediction-type questions such as:

1. As a reality check: Does the computational result or prediction agree with the outcomes from the original physical experiment?

2. How would the model Eq. (4.13.2) evolve under a different set of initial states—simulating a different experimental setting? Are there particular initial conditions which lead to dramatically different subsequent behavior in the presence of a small perturbation? Continuous dependence on initial state or bifurcations?

3. If we removed the time-series data for a gene Gk and recalculate the weights w,, how different is the resulting model from the one with Gk present?

4. What is the asymptotic or steady-state behavior of Eq. (4.13.2)? Attracting sets, hysteresis cycles, recurrent states?

The state of a system at any time tq refers to the collective expression profiles xk(tq) for k = 1, ..., M. In a developing hippocampi system modeled with Eq. (4.13.2), Somogyi et al. found that the weight matrices were sparse i.e., most of the entries were 0, in agreement with the generally accepted intuition that genes are not regulated equally by every other gene [164]. From a computation-theoretic point of view, it may also be interesting to study the time-continuous analog of the difference equation (4.13.2), which would be a system of linear differential equations, or stochastic differential equations, if we allow 2k to be a specific probability distribution as in [109]. When one has a deterministic system of linear ordinary differential equations, Fourier or wavelet methods could be used to transform into a system of algebraic equations.

An attendant complication with these methods is the aliasing phenomenon. One could potentially be undersampling the respective RNA levels below the fundamental frequency (probably cell cycle-related) of the biological system. As is well-known in signal processing, it is not possible to reconstruct the originating signal from samples that were obtained less than twice the threshold rate dxk(t)

dt known as the Nyquist frequency. The model can easily be generalized to become a nonlinear one by letting Wj,k be both a function of time and expression, or to a higher-order Markov by assuming that the expression level of Ak at tq depends upon the expression levels of Ay's for other previous times tr, 0 < tr < tq. Weaver et al. [186] and Wahde and Hertz [185] studied a modified version of Eq. (4.13.1),

where g(y) = (1 + e»y) is a modified step function where » > 0 is the rate that the value of g changes from 0 to 1 at y = 0. When » ' , g(y) = 0 (resp. 1) for y < 0 (resp. y > 0).

There are several biologically based reasons to motivate this step function-like form of g (e.g., upon binding of a transcriptional repressor or initiation of a kinase cascade). Note that g has a well-defined inverse function, g . The weights of the system above can be solved via linear algebra after applying g on both sides of the equation. Operating within the systems paradigm, Hanke and Reich [83]and Somagyi and Sneigoski [164] have also modeled Eq. (4.13.2) as neural networks which have the advantage that one could train the parameters—in this case, the weights Wjj—to match the data set. Potential issues here include the selection of the type of neural network that is optimally suited to the questions being studied, the length of time required for training, and the characteristics of the training set.

Thus far we have considered the expression levels xk to be essentially continuous variables interacting in a linear combinatorial manner. One may as validly let xk be a discrete-valued variable taking values 0 and 1, where 0 indicates that gene Ak is off and 1 on, to produce a Boolean network as in [119]. State transitions are captured by a truth table which we have to discover using ad hoc rules optimizing mutual information.

Recently, there has been active work in the area of analyzing genetic networks within a probabilistic framework; of particular note is the application and development of bayesian models of gene expression and interaction which will be detailed in the next section.

It probably bears repeating here that no mathematical treatment of observational data will be suffcient to be convincing about genetic causal relationships. For that to occur, the tried-and-true methods of scientific hypothesis testing (see figure 2.2) and the concomitant "wet" laboratory work will remain necessary. Therefore it should not be too disappointing to note that we have still not adequately addressed the issue of how one may infer causal relationships from expression data, in particular limited expression data. From the purely computational point of view,it is not possible to do this without the aid of established biological facts and experimental intuition. This in turn requires a productive functional genomic pipeline of the sort diagrammed in figure 1.3.1.

Was this article helpful?

0 0

Post a comment