Principal component analysis (PCA) is a great tool for exploring high-dimensional data sets. Let's play with some data from NOAA and see if PCA can help visualize trends or unearth any interesting patterns.
This is not an explanation of PCA nor a code walkthrough of how to use it. But, if you're generally familiar with PCA, this might be interesting.
I pulled temperature data for about 250 weather stations in the continental US. Each station has a average temperature for every hour of every day except Feb 29. Here's the data for the station at Boston's Logan Airport:
Tables are good for understanding the format, but not the structure of the data. It's line graph time.
This looks like what we'd expect. Boston, like all places in the northern hemisphere, is warmer during the middle of the year.
But why is the line so thick? Actually, the line is pretty thin, but it wiggles up and down once per day. Let's zoom in to a single week so we can see the daily variation.
Ok, so we've looked at the data for one station. I want to understand how these hourly averages vary across the stations in this dataset.
Each station has 8,760 features describing it (365 days x 24 hours). That's a lot to visualize.
We can use PCA to find a new, smaller set of dimensions. Since these data have very obvious structure, I expect that even just a few of the principal components will capture most of the variation.
|Principal Component||Percent Variance Explained|
Wow! Even if we collapse each station down to one number, we can explain almost 90% of the variation in all 8,760 features.
Previously, it would have been hard to visualize all the cities and get a sense of which ones are similar. Now we can do that just by plotting the first two principal components:
Remember that PC1 explains much more of the variance than PC2. This means that if two dots are closer in the horizontal dimension it's more meaningful than being closer in the vertical dimension.
What does this first principal component represent? Why does it explain so much of the variance in the data?
The first principal component tracks closely with latitude. That's not a big surprise. We'd expect most variation between these stations to be explained by how close they are to the equator.
I wonder what's hiding in the next three principal components.
No patterns jump out at me here as clearly as the last one did, but there are two cells that piqued my interest.
This plot looks like there's a pretty strong correlation, but the pattern disappears for lower longitudes.
Wait a minute, let's look at a map.
Look at all those mountains! This shows us a likely reason why trends in the eastern US might not hold up in western states. Perhaps this is directly due to elevation. Or, maybe more uneven terrain disrupts what would be more regular weather patterns.
Whatever is different about the western weather stations, I think it would make sense to exclude them and rerun PCA to see if any more subtle patterns emerge. I won't revisualize everything here, but the variance explained by the the first principal component increases to 95.8% after excluding the weather stations west of -100°.
There was another cell in the scatter matrix that stood out to me.
There seem to be four distinct clusters. The two on the right are more clearly separated. Within each cluster, there a clear connection between PC4 and longitude. This trend repeats in each band of longitude.
Any idea what's causing this?
Timezones! The hourly data for each station is given in local time. Stations in the western part of a timezone have later sunrise and the hourly temperature curve is shifted from those in the east. This means that stations that are in a similar longitude within a timezone have more similar hourly temperature pattern.
I think PCA helped us learn a few things: