Contents
Introduction
In this article, we provide an intuitive, geometric interpretation of the covariance matrix, by exploring the relation between linear transformations and the resulting data covariance. Most textbooks explain the shape of data based on the concept of covariance matrices. Instead, we take a backwards approach and explain the concept of covariance matrices based on the shape of data.In a previous article, we discussed the concept of variance, and provided a derivation and proof of the well known formula to estimate the sample variance. Figure 1 was used in this article to show that the standard deviation, as the square root of the variance, provides a measure of how much the data is spread across the feature space.
Figure 1. Gaussian density function. For normally distributed data, 68% of the samples fall within the interval defined by the mean plus and minus the standard deviation.
(1) ![Rendered by QuickLaTeX.com \begin{align*} \sigma_x^2 &= \frac{1}{N-1} \sum_{i=1}^N (x_i - \mu)^2\\ &= \mathbb{E}[ (x - \mathbb{E}(x)) (x - \mathbb{E}(x))]\\ &= \sigma(x,x) \end{align*}](https://lh3.googleusercontent.com/blogger_img_proxy/AEn0k_u0CI0k2hnGpVkfwX1NtJdBJxJIgOX5BehMNxavkWP8rmGuiK8qhS-bQFk_OxMcwdSKiyn0qFTp-QdoZ_cTZd4OnvXkK8zn2VmXBHq75Msg9eCmTkCzQmTvzkQ92e8YO2zei79btZtz7mdNXnyYSWODXhNwkMMGXp-vEYl1BQdKOnr9wfj1sj4d=s0-d)
However, variance can only be used to explain the spread of the data in the directions parallel to the axes of the feature space. Consider the 2D feature space shown by figure 2:
For this data, we could calculate the variance
(2) ![Rendered by QuickLaTeX.com \begin{equation*} \sigma(x,y) = \mathbb{E}[ (x - \mathbb{E}(x)) (y - \mathbb{E}(y))] \end{equation*}](https://lh3.googleusercontent.com/blogger_img_proxy/AEn0k_u-k-zxXy4YEMWMFvpgL-0fd5DCvjxSGbWfDyobhZ-nC3Cpy97Ml8XY3w2qGeed-VCC1mgXan0JuBDCESbARQYTZdzSlqvvwAkvdDUtuqR8C-oIRi1xljF2KoONsloiSAqIy1T7apIxBP1ws8p9yrsL5EnntOR01IvANvgam0zDveg2uFICVoxg=s0-d)
For 2D data, we thus obtain
(3) ![Rendered by QuickLaTeX.com \begin{equation*} \Sigma = \begin{bmatrix} \sigma(x,x) & \sigma(x,y) \\[0.3em] \sigma(y,x) & \sigma(y,y) \\[0.3em] \end{bmatrix} \end{equation*}](https://lh3.googleusercontent.com/blogger_img_proxy/AEn0k_tKJ4vVs5r4OZBBfW3_7PgH8U0EP-8sSClB6U6K-jHiLUKqzYBPOD72aTLvsf-eKFrT_v4-TMqnxWTJnLjcdGRprLXHsev6gIp3J7t1ildfn7u7exe_e-qYTxkD6z12z33Z30OSB_0udpL0pdUTv-aE7sGfLtGdLaCq0ggrFDb7r4L_otJIUno=s0-d)
If x is positively correlated with y, y is also positively correlated with x. In other words, we can state that Figure 3 illustrates how the overall shape of the data defines the covariance matrix:
Figure 3. The covariance matrix defines the shape of the data. Diagonal spread is captured by the covariance, while axis-aligned spread is captured by the variance.
Eigendecomposition of a covariance matrix
In the next section, we will discuss how the covariance matrix can be interpreted as a linear operator that transforms white data into the data we observed. However, before diving into the technical details, it is important to gain an intuitive understanding of how eigenvectors and eigenvalues uniquely define the covariance matrix, and therefore the shape of our data.As we saw in figure 3, the covariance matrix defines both the spread (variance), and the orientation (covariance) of our data. So, if we would like to represent the covariance matrix with a vector and its magnitude, we should simply try to find the vector that points into the direction of the largest spread of the data, and whose magnitude equals the spread (variance) in this direction.
If we define this vector as
In other words, the largest eigenvector of the covariance matrix always points into the direction of the largest variance of the data, and the magnitude of this vector equals the corresponding eigenvalue. The second largest eigenvector is always orthogonal to the largest eigenvector, and points into the direction of the second largest spread of the data.
Now let’s have a look at some examples. In an earlier article we saw that a linear transformation matrix
(4) 
where If the covariance matrix of our data is a diagonal matrix, such that the covariances are zero, then this means that the variances must be equal to the eigenvalues
However, if the covariance matrix is not diagonal, such that the covariances are not zero, then the situation is a little more complicated. The eigenvalues still represent the variance magnitude in the direction of the largest spread of the data, and the variance components of the covariance matrix still represent the variance magnitude in the direction of the x-axis and y-axis. But since the data is not axis aligned, these values are not the same anymore as shown by figure 5.
By comparing figure 5 with figure 4, it becomes clear that the eigenvalues represent the variance of the data along the eigenvector directions, whereas the variance components of the covariance matrix represent the spread along the axes. If there are no covariances, then both values are equal.
Covariance matrix as a linear transformation
Now let’s forget about covariance matrices for a moment. Each of the examples in figure 3 can simply be considered to be a linearly transformed instance of figure 6:Let the data shown by figure 6 be
(5) 
where
(6) 
These matrices are defined as:
(7) ![Rendered by QuickLaTeX.com \begin{equation*} R = \begin{bmatrix} \cos(\theta) & -\sin(\theta) \\[0.3em] \sin(\theta) & \cos(\theta) \end{bmatrix} \end{equation*}](https://lh3.googleusercontent.com/blogger_img_proxy/AEn0k_tvgGOluhVfZw3fsY0PUlIL0plbzpcmN1hD4xDJ9e4kBHB3MswcruhT8gXVMjpN02sHxSLFpWAi1fVNvhn7Fp7aUx7k-AlLcP2cnsxLxlXFHjh4GW0hOK6vb4mTOaI7EihQzEb3r_j01BA6P0se03NzaTANLARc9jt0M-6RcbwTygnMpQgA5YI=s0-d)
where
(8) ![Rendered by QuickLaTeX.com \begin{equation*} S = \begin{bmatrix} s_x & 0 \\[0.3em] 0 & s_y \end{bmatrix} \end{equation*}](https://lh3.googleusercontent.com/blogger_img_proxy/AEn0k_uBdxx0wCwVvLALdAaVep6d135AO5NsSNT0lczQCqh-BrQEHHZUkDR9VyRuJc9Pu3Y9dXs-U-tElvmLuo61OWEXRw4PjfJYfUgfF9Rkt2UP-f5Xk-YuXrCuaV45OIwCkXq5b6Al5bQHdxcKhxSH2TzGsOjbLswL8J27EsKLxhyShldCYXEaVqHU=s0-d)
where In the following paragraphs, we will discuss the relation between the covariance matrix
Let’s start with unscaled (scale equals 1) and unrotated data. In statistics this is often refered to as ‘white data’ because its samples are drawn from a standard normal distribution and therefore correspond to white (uncorrelated) noise:
The covariance matrix of this ‘white’ data equals the identity matrix, such that the variances and standard deviations equal 1 and the covariance equals zero:
(9) ![Rendered by QuickLaTeX.com \begin{equation*} \Sigma = \begin{bmatrix} \sigma_x^2 & 0 \\[0.3em] 0 & \sigma_y^2 \\ \end{bmatrix} = \begin{bmatrix} 1 & 0 \\[0.3em] 0 & 1 \\ \end{bmatrix} \end{equation*}](https://lh3.googleusercontent.com/blogger_img_proxy/AEn0k_sCg3r_pQ3k6hU3HH0kZbxT1IuysdewOPfX2_fw1IWMasf9GnFzywhaeVrTATAlnRzDZya_Uj6fHYD6DzPL53akzrDwBK-DoAKGPCu0osDHV8D9LmEPYi6M1FRlhuN7bVINJsQX7eQUPdQ_VKe1bMcrcUtH_zXeqDiEBwtal8O9meXKT9QIoCo=s0-d)
Now let’s scale the data in the x-direction with a factor 4:
(10) ![Rendered by QuickLaTeX.com \begin{equation*} D' = \begin{bmatrix} 4 & 0 \\[0.3em] 0 & 1 \\ \end{bmatrix} \, D \end{equation*}](https://lh3.googleusercontent.com/blogger_img_proxy/AEn0k_v5FezpfQIjAres9UUvZVPDJEhMXb82T8n1BzGpfbcHxwqggS-invks6JmPgCII382y16twYTRDAhbdpnMncma0yL_2exmBSGipvf-wdWiuZbndKr_6JJW80OD3m5tpuXl-LjoSAsH0FBPa9pMRNo45NqtlQyv1qILqyG90Rbo-c2HdUgGhe2O4=s0-d)
The data The covariance matrix
(11) ![Rendered by QuickLaTeX.com \begin{equation*} \Sigma' = \begin{bmatrix} \sigma_x^2 & 0 \\[0.3em] 0 & \sigma_y^2 \\ \end{bmatrix} = \begin{bmatrix} 16 & 0 \\[0.3em] 0 & 1 \\ \end{bmatrix} \end{equation*}](https://lh3.googleusercontent.com/blogger_img_proxy/AEn0k_vUNsVIahNkNIkW8GVQpJVNmUNwYqCPZSIrCzkduEf3kqzZASvTXr6mWvGBmPt4QrUFj9EBgGpmcl9ocvSTcirJOUVrDdDyhkGIRPPIyLHeGoQPVfA7-wCNxJgIwOZYxe57HH4Rwt4AUoGndKreXJddx2LuhEP__O5i3VlyLZX-ok21637GBvA=s0-d)
Thus, the covariance matrix
(12) ![Rendered by QuickLaTeX.com \begin{equation*} T = \sqrt{\Sigma'} = \begin{bmatrix} 4 & 0 \\[0.3em] 0 & 1 \\ \end{bmatrix}. \end{equation*}](https://lh3.googleusercontent.com/blogger_img_proxy/AEn0k_uPrrKRmnVbYzG0eEazvrFvFEDrkIE6olK4LAw-9e_wk8gyBe5Wpq8APe8ZtFSs57Z3ruIqydRcfgsEL1KCgbkp_Kina13xVkMkbhl4D9iNh8HSxplH-ZuiiRznF4PtckngljmNtP9APmqdgK_97LOAk4wtcVyeV83sGmrhFabHdj-2f5GJ_azv=s0-d)
However, although equation (12) holds when the data is scaled in the x and y direction, the question rises if it also holds when a rotation is applied. To investigate the relation between the linear transformation matrix As we saw earlier, we can represent the covariance matrix by its eigenvectors and eigenvalues:
(13) 
where Equation (13) holds for each eigenvector-eigenvalue pair of matrix
(14) 
where This means that we can represent the covariance matrix as a function of its eigenvectors and eigenvalues:
(15) 
Equation (15) is called the eigendecomposition of the covariance matrix and can be obtained using a Singular Value Decomposition algorithm. Whereas the eigenvectors represent the directions of the largest variance of the data, the eigenvalues represent the magnitude of this variance in those directions. In other words,
(16) 
where In equation (6) we defined a linear transformation
(17) 
In other words, if we apply the linear transformation defined by The colored arrows in figure 10 represent the eigenvectors. The largest eigenvector, i.e. the eigenvector with the largest corresponding eigenvalue, always points in the direction of the largest variance of the data and thereby defines its orientation. Subsequent eigenvectors are always orthogonal to the largest eigenvector due to the orthogonality of rotation matrices.
0 comments:
Post a Comment