## Linear correlation

Given data points (x, y) for i = 1,---, n, it is common to ask whether a linear relationship exists between the quantities xj and y, and to what degree are the data linear, at least statistically. The graphical equivalent would be to ask whether the scatter plot of xi versus yi looks like (discretizations of) a line. The linear or Pearson's correlation coefficient r is the mathematical device that is most often used to do this, r =

where denote the means of xh y respectively. As an aside, observe that, In other words, r is the geometric mean of the slope of the linear regression of x versus y and the slope of the linear regression of y versus x, denoted ±(x,y\ and ±(y,x)1 respectively. By the Cauchy-Schwarz inequality, where (x ,y) = ^/y- is the standard dot product on H", ||.r|| = (x.x)'* and , is the angle between vectors x and y, it can be shown that, r = 1, if and only if, x = y; x, y are positively correlated, r = ', if and only if, x = y; x, y are negatively correlated.

Suppose that one regards x, y as random variables with samples x, y indexed by i = 1,—, n. If x and y are stochastically independent, then r = 0. The converse is not generally true, as, e.g., when we take tj = i À 2À/n, Xj = cos tj and y = sin tj for any integer n. Exercise: Verify this. When r is near 0, x and y are said to be uncorrelated. A review of the concept of stochastic independence may be found in Shiryaev . Note that r has a nonrigorous transitive property such that if x and y, and y and z, are strongly correlated with correlation strength rx_ y and ryz, x and z have correlation strength r. As noted by Press et al. , while r provides a summary of the strength of the association if the correlation is found to be significant, r is not generally useful for deciding whether an observed correlation is statistically significant. This weakness arises from the fact that the computation (3.6.4) ignores the distributions of x and y. One solution to this problem is to use nonparametric or rank correlation, which details are also found in . It can also be shown that r is more sensitive to small perturbations of data points that are farther away from the centroid of the system than to perturbations near the centroid. This may result in a bias. Linear correlation has been used to find pairwise associated genes across different experimental conditions, as in .