Relevance Networks

Relevance networks offer a method for constructing networks of similarity, with the principal advantages being the ability to (1) include features of more than one data type, (2) represent multiple connections between features, and (3) capture negative as well as positive correlations. Like dendrogram construction, this algorithm begins by evaluating the similarity of features by comprehensively comparing all features with each other in a pairwise manner over the same cases. Several dissimilarity measures have been used in this methodology, including mutual information and the correlation coefficient. In this section, we describe the construction of relevance networks using . This is equivalent to r2, except the sign (positive or negative) from the r, or Pearson's correlation coefficient, is applied. This is crucial in maintaining information about negative correlations. After the measure-triangle is completed, a threshold measure is chosen. The entire measure-triangle is scanned and any measures below the threshold are set to zero.

define threshold as the strongest link we are going to rule out loop through all the measures in the measure-triangle if the measure is under threshold set the measure to zero in the measure-triangle else leave the measure alone end if end loop

The measure-triangle must then be scanned to determine the networks. First, for the connections that remained because of measures that were greater than the threshold, each of the two nodes participating in the connection is listed as being used. An array stores the number of associations in which each gene is participating.

make a new empty array called node-connections, with size = number_genes loop through all the measures in the measure-triangle if the measure is at or over threshold find the two genes A and B that were compared to make this measure add one to gene A's entry in the array node-connections add one to gene B's entry in the array node-connections end if end loop set number-nodes_used to zero loop through the entries in the array node-connections if the entry is greater than zero, increment number-nodes-used end loop

Next, each of the networks needs to be found and assigned a number. This is done using a combination linear-recursive methodology. A counter for the network number to be assigned is first set to 1. The list of features is iterated on, until a feature is found that is used in the relevance networks. That feature is assigned to the first relevance network. The algorithm then recursively descends to each of the connected features, and assigns them the same network number. This continues until all the features have either been assigned to a network, or have been eliminated.

define a new procedure called assign-network, that takes a current-gene-index and a current-network-number set G to be the gene indexed by current-gene-index in the array all-genes if G's entry in the array node-connections is greater than zero

(i.e., that node is being used in a connection) if G has not been assigned a network number assign it the current-network-number else just in case we find a gene that was already assigned a number, if G's current network number is larger than current-network-number change G's network number to current-network-number end if end if if we have assigned or changed G's network number, we need to do the same to G's connected genes loop through all of G's active connections, using the measure-triangle, using index H

if gene H's network number does not equal G's network number call assign-network recursively on gene H using current-network-number end if end if end procedure set the current-network-number to 1

to start the assignment, loop through all the genes, using index G

if G's entry in the array node-connections is greater than zero (i.e., that node is being used in a connection) and G has not been assigned a network number call assign-network on G using current-network-number increment current-network-number end loop

Finally, the results need to be displayed graphically. There are several standard formats for graph descriptions, and the specifics of these are beyond the scope of this book. We have successfully used both the GMF format used by Tom Sawyer Software, Inc. (Oakland, CA) in their Graph Editor Toolkit, as well as the DOT format used by AT&T Research Labs in their Graphviz package. For the purposes of this chapter, we will demonstrate how one of these networks may look in DOT format, primarily because software to format the graphs is freely available for academic use along with the source code.[7] One can output this type of file by first looping through all the nodes and genes in use and printing out their characteristics, then looping through and printing all the connections. A sample output file in this format is shown here.

graph RN {

node [ fontname = "Helvetica", fontsize = 8 ];

node001 [ shape=box, height=0.2, width=1.6, label="1318_at", style="setlinewidth(1.5)"]; node002 [ shape=box, height=0.2, width=0.8, label="3 62 62_at", style="setlinewidth(0.1)"]; node003 [ shape=box, height=0.2, width=0.8, label="422_s_at", style="setlinewidth(3.0)"]; node001 — node002 [weight=1,style="setlinewidth(0.1)"]; node001 — node003 [weight=1,style="setlinewidth(3)"];

After installing the Graphviz package, this sample graph can be stored in a file called "test.dot." Executing the following commands on a standard Linux computer will first translate the test.dot into a PostScript file called test.ps, which can then be processed to make an Adobe Acrobat PDF file called test.pdf.

dot -Tps -o test.ps test.dot ps2pdf test.ps

The output PDF file from this example will contain three nodes connected with two lines. Alternatively, a commercial software package can be used to lay out more sophisticated graphs, such as the Graph Layout Toolkit. Using this package, we have constructed relevance networks such as those in figure 4.10.

Figure 4.10: Sample relevance network. One of 78 networks formed by aggregating mouse samples from six different experiments. Threshold was set to .99, meaning that the "thinnest" line seen represented an of .99. The legend for this type of diagram is shown in figure 4.11. The hypothesis generated here is that RBBP4 expression correlates with GNS and MAX gene expression. The miniature histograms for each gene show no "spiking" in the expression measurements (see section 4.5.2). Each of these genes appears at a different chromosomal location. By including information from LocusLink (see section 5.5.1) and Gene Ontology (see section 5.1.1), we see that RBBP4 is known to control cell proliferation, and MAX is involved in oncogenesis.

Typically, several graphical conventions are used to highlight the findings in relevance networks, as shown in the figure 4.11. Since each line represents a strong connection or putative association, the thickness of each line is proportional to the strength of the dissimilarity measure, so thicker connections are stronger (and thus easier to find on the diagram). The width of each box representing a feature is proportional to the number of features to which it is connected, and the edge thickness of each box is proportional to the average strength of all the connections entering that box. The length of the connection lines is not significant, and is left variable so that the graph layout algorithms can more easily place the networks in a visually appealing manner (e.g., minimizing edge crossings).

Thickness of node edge is proportional to average link strength

Figure 4.11: Graphical convention for relevance networks. To optimize the hypothesis generation process, relevance network display software contains ties from microarray accession number to GenBank, UniGene, LocusLink, Gene Ontology, and other databases (see section 5.5 for details of this process).

A question immediately arises: Where does one set the threshold? One way to approach this question is to plot the distribution of dissimilarity measure scores, then recompute the scores using repeatedly randomly permuted data sets (see section 4.12.1). For instance, if one randomly permutes the data set 100 times and finds the highest dissimilarity measure score is x, then one could set the threshold to be greater than or equal to x. This approach is shown in figure 4.15. However, an important point to raise is that the threshold does not have to be a fixed measurement, but instead can be "dialed" up and down. In other words, one can set the threshold to be arbitrarily high to start with and the resulting relevance networks can be viewed. If the associations are strong but are already known and no longer worthy of pursuit, the threshold can be "dialed down" slightly until new genes and connections are added, creating potentially novel hypotheses. Of course, one must be careful not to drop the threshold so low as to allow potentially noisy and inaccurate associations. The permutation testing can help with this.

Published examples

Butte and Kohane described the methodology and first applied it to clinical laboratory data (i.e., phenotypic measurements) in [37], then applied it to yeast expression measurements using mutual information as a dissimilarity measure in [38].

Butte et al. [39] linked expression measurements in cancer cell lines to phenotypic measurements on those cell lines (i.e., susceptibility to pharmaceutical treatment).

Reis et al. [150] showed how dynamics (i.e., the slope or change in expression levels over a time series) could be used as a dissimilarity measure in creating relevance networks.

Klus et al. [110] constructed relevance networks from a breast cancer expression data set and linked the results to chromosomal position.

Advantages

• Captures negative correlations: Relevance networks can simultaneously display both positive and negative correlations. Negative correlations are seen commonly in biology, where greater expression of one gene (e.g., a tumor suppressor gene) is associated with decreased expression of other genes.

• Genes can be connected in more than one way (genes can be connected to more than one other gene). This gives rise to three advantages: First, in biology, certain genes (e.g., transcription factors) are responsible for the expression of several other genes. This type of multiple connection can be modeled using relevance networks. Second, certain features may display weak similarity to two different groups of features, and using relevance networks, these features can be seen to bridge the two groups. For example, certain anticancer drugs can perform as both topoisomerase I and topoisomerase II inhibitors, and the bridging action of these drugs between the two classes can be modeled using relevance networks.

• Multiple structures created: Certainly in biology, some gene regulatory networks are larger or smaller than others, and this variation in size can be modeled using this technique.

• Multiple measures allowable simultaneously: In theory, a relevance network could be drawn with the highest associations from multiple dissimilarity measures. For example, two genes could be linked by , while another two genes could be linked by mutual information.

• Multiple data types allowable simultaneously: As shown in [39], gene expression measurements can be combined with phenotypic measurements in the same data set and the resultant hypotheses easily viewed. For example, a "systolic blood pressure" measurement may appear linked to the specific genes whose expression measurement correlate with that measurement.

• Parallelizability: The most time-consuming aspect of the construction of relevance networks is the initial comprehensive precomputation of the dissimilarity measure across the set of features. This computation can be parallelized.

Disadvantages

• Excessive complexity at lower thresholds: Although it is tempting to drop the thresholds to reach a potentially novel hypothesis, empirically the number of links between the nodes already present in the graph quickly grows to an unmanageable number. Depending on the graph layout software used, these typically end up looking like large circular networks with hundreds of connections between them, such as figure 4.12. In these cases, it is advantageous that dendrograms have each gene attached to the overall structure by only a single link. One can display relevance networks in a similar manner by taking each network separately and calculating the minimum spanning tree by iterating through all the links from weakest to strongest and deleting those links that do not cause the removal of a node in the network. Fundamentally, however, relevance networks are most useful for identifying the subgroup interactions that are linked the strongest.

Figure 4.12: Relevance network laid out as a circle. Relevance networks with too many genes may be rendered in circles that minimize overlapping nodes and edge crossings. However, these may be impossible to visualize.

• Obscures large-scale partitional structure. Because all links that are subthreshold are discarded, some large-scale functional groupings will not be detected.

• Larger cliques cannot be found efficiently: Those portions of the relevance networks that are completely connected (i.e., all the nodes in the graph subset are fully connected to each other) are known as cliques in graph theory, and are significant in that they represent the most "believable" part of the relevance networks. In effect, if gene A and gene B are in a clique together, it means that not only is there a strong direct association between gene A and gene B but there are also indirect associations, where gene A and gene B both are associated with gene C. These cliques are useful structures, but finding them in an automated manner is an "NP-hard" problem (i.e., the size of the graph quickly overwhelms the ability of any processor to find them using brute force).[8]

• Entire vector is used: Like other analytic methods, this means that for each gene, the entire set of samples are used in calculating the dissimilarity measurements. However, there may be very little variation in most of the samples, and great variation in only a few of the samples. Two genes might be scored as being similar, when they may actually differ greatly in one of the many samples.

• This is not a globally optimal network: Early bad decisions cannot be corrected later as the threshold drops. Like dendrograms, relevance networks fall under the category of "greedy algorithms" in computer science, as the strongest links are always used at every step, regardless of the later consequences this may have for forming the networks.

• Computational requirements: This technique requires the comprehensive precomputation of the dissimilarity measure for all pairs of genes, which grows on the order of nowhere n is the number of genes being studied. This can be a time-consuming computation. However, compared to dendrogram formation, after the precomputation, no further dissimilarity measures need to be computed, and thus with proper programming, the network formation can be performed without much memory.

• Ternary and higher comparisons: As described here, this technique performs pairwise comparisons. However, there are several examples in biology where a combination of events is required for downstream action to take place. For example, several transcription factors may need to combine to form a larger complex before transcription is initiated. Performing pairwise comparisons is an expensive operation; performing three-way, four-way and higher comparisons comprehensively quickly becomes intractable.

• Display requirements: Optimally displaying relevance networks requires a large two-dimensional area (compared to a smaller columnar area for dendrograms) and requires the use of sophisticated graph layout algorithms for optimal node placement. (See figure 4.12 for an example of a particularly difficult network to visualize.)

[7]http://www.research.att.com/sw/tools/graphviz/.

[8]See sections 1.4.1 and 5.2 for more thoughts on "NP-hard" problems.

Was this article helpful?

0 0

Post a comment