Frequently Asked Questions

1) How to extract dominant modes of variability from a multivariate dataset using EOF analysis?
2) How to reconstruct the EOF/PC combinations to represent the original data?

 


1) How to extract dominant modes of variability from a multivariate dataset using EOF analysis?

I am not going to provide loads of mathematical equations to explain how this is all done. I would rather provide a description, which I hope would help you to understand the general concepts.

You do not have to be extremely good in mathematics/statistics, as you do not have to worry about solving a tri-diagonal system of equations or finding eigen values of a large matrix.

Now-a-days sophisticated software available to do all these for you. However you should be able to understand the fundamental statistical concepts behind EOF analysis. Here we go:

Suppose we are presented with a multivariate dataset of length N. This is going to be in the form of a (mxn) matrix for a given time point. For example it could be a winter mean rainfall over the United States for 100 years. The order of the matrix (mxn) represents the organization of rainfall station observations in the US and N is the length of the observational period. In this case N=100.

Any standard statistical package can take this multivariate dataset as an input and provide you with two sets of output. The first set is called loading or spatial pattern, also known as empirical orthogonal function (EOF). This is a matrix of size (mxn), same as the original data matrix for one season. You will be given N such patterns. For each pattern you will have a corresponding timeseries called principal component (PC). Each PC will be of length N.

Most often the packages will order these EOF patterns and the corresponding PC timeseries in the order of maximum variance explained. For example, the first leading EOF pattern explains the maximum variance contained in the original dataset and will show the spatial pattern of this variability. The corresponding PC timeseries will give an indication of how this spatial pattern will change over the period of N years.

Similarly, the second leading EOF pattern will show the second most important spatial variation in the seasonal rainfall, and the corresponding second PC will give an indication of the timescale of the variability. So you can see the last 100th EOF/PC combination will represent the least important mode of variability.

Usually the first few EOF/PC combinations are enough to explain the entire dataset. In this case the first five EOF/PC combinations might explain about 80% of the total variance. The rest of them would not provide any meaningful information. So the researcher might decide to work with the first five leading EOF/PC combinations and skip the rest.


2) How to reconstruct the EOF/PC combinations to represent the original data?

Please read the this topic on EOF before continuing:

1) Take the first pattern (mxn matrix) and multiply this pattern by the first value in the corresponding PC timeseries. Remember you will have N values in this time series. That is 100 in this example.

2) Now repeat the same procedure using the 2nd, 3rd, 4th, ....etc. values in the PC timeseries to multiply the same first loading pattern. Now you should have 100 loading patterns of size mxn.

3) Repeat the steps 1 and 2 for the second loading pattern using the second PC timeseries. Now you will have a second set of 100 loading patterns of size mxn.

4) Continue the steps 1 and 2 for the remaining 3 sets of EOF/PC patterns.

5) Now you will have five sets of 100 loading patterns.

6) All you have to do is to just add all five sets (vector addition) to get one set of 100 matrices of size mxn. This is the reconstructed multivariate dataset. This dimension is the same as the original data matrix, except that the reconstructed data will only explain a fraction of the variance explained by the original data set.