We'll look at COVID cases per day for each county in WI. We'll use three unsupervised learning techniques to understand multi-county patterns:
First, let's import all that we need:
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.decomposition import PCA
%matplotlib inline
We'll be creating maps of WI. I already grabbed shapefile of the counties from here: https://data-wi-dnr.opendata.arcgis.com/datasets/8b8a0896378449538cf1138a969afbc6_3. I renamed it "wi.zip". We rename "Saint Croix" to "St. Croix" for consistency with our later dataset.
wi = gpd.read_file("zip://wi.zip")
wi = wi.set_index(wi["COUNTY_NAM"].str.replace("Saint", "St.")).copy()
wi.head()
wi.boundary.plot(color="black")
Let's use wi-covid.csv that we used in previous lectures:
df = pd.read_csv("wi-covid.csv")
df.head()
For clustering, we'll want each county on its own row. Historical data should be spread across columns, so that we can group rows based on year-long trends. Pandas's pivot
function can automatially take three columns and put the values of one to the index, the values of another to the columns, and the values of the third to the cells of the final table:
df = pd.pivot(df, index="NAME", columns="DATE", values="POS_7DAYAVG")
df.head()
Now, let's use KMeans clustering to group similar rows. Note that for this example, we're just clustering based on total cases (for many use cases, it may make sense to scale by population or other otherwise standardize the data).
km = KMeans(n_clusters=8)
km.fit(df)
clusters = km.predict(df)
clusters
Let's put those cluster IDs back in the GeoDataFrame so we can plot it. The cluster IDs are in an order corresponding to the DataFrame we fit to (df
). This is probably the same as the order in the wi
DataFrame, but we should explicitly assign the index to each cluster ID before adding it to a different DataFrame, just to be safe.
wi["cluster"] = pd.Series(clusters, index=df.index)
wi.head()
wi.plot(column=wi["cluster"], cmap="tab10", figsize=(8,8))
Compactly, we could have done all the fitting and plotting in just three lines (note that clusters will probably be similar, but the colors for each cluster might change each time we run this):
clusters = KMeans(n_clusters=8).fit_predict(df)
wi["cluster"] = pd.Series(clusters, index=df.index)
wi.plot(column=wi["cluster"], cmap="tab10", figsize=(8,8))
We can simply replace KMeans
with AgglomerativeClustering
to see how the latter algorithm groups states:
clusters = AgglomerativeClustering(n_clusters=8).fit_predict(df)
wi["cluster"] = pd.Series(clusters, index=df.index)
wi.plot(column=wi["cluster"], cmap="tab10", figsize=(8,8))
For this analysis, we'll be looking at a tranpose of the earlier data:
dft = df.T
print("shape:", dft.shape)
print("variance:", dft.values.var())
dft.head()
We have 352 rows, but of course there will be similarities across them. Can we approximate the table by finding a few key rows (called components), such that every row in approximately some weighted combination of those component rows? In particular, how much of the 3230.9 variance can be explained by these few rows?
A principal component analysis (PCA) can help us find such components and tell us how much of the variance those components can explain.
Passing 0.95 tells the PCA that we want it to find however many component rows are necessary to explain 95% of the variance in our original table (alternatively, we could pass an integer to require an exact number of components):
pca = PCA(0.95)
pca.fit(dft)
explained_variance_ratio_
tells us how many components are necessary, and how much variance they each explain. In this case, one component row can explain 90.6% of the variance, and a second component row can explain an additional 4.5%.
pca.explained_variance_ratio_
Let's see what those two rows look like in a table. components_
gives us a numpy array, but we can use the same column names from our original DataFrame.
components = pd.DataFrame(pca.components_, columns=dft.columns)
components
Let's visualize those components on a map. We'll use the coolwarm
colormap (https://matplotlib.org/stable/gallery/color/colormap_reference.html). We want blues to be negative, reds to be positive, and gray to be about zero. This means we'll need to pass in a vmin and vmax that are equally distance from zero. Let's first find a value for this.
cap = components.values.max()
cap
Let's add the rows of the components table as columns to the GeoDataFrame (wi
).
wi["pc0"] = components.iloc[0]
wi["pc1"] = components.iloc[1]
wi.head()
fig, axes = plt.subplots(ncols=2, figsize=(10,5))
wi.plot(column=wi["pc0"], cmap="coolwarm", figsize=(8,8), ax=axes[0], vmin=-cap, vmax=cap)
wi.plot(column=wi["pc1"], cmap="coolwarm", figsize=(8,8), ax=axes[1], vmin=-cap, vmax=cap)
axes[0].set_title("Component 0")
axes[1].set_title("Component 1")
Although all 72 counties could in theory vary independently, we can take different combinations of just those two maps (added to a base map of the average per county) and explain 95% of all variance.
Let's see what that base map looks like (similar to component 0, it turns out):
mean = pd.Series(pca.mean_, index=dft.columns)
mean
wi["mean"] = mean
cap = mean.max()
ax = wi.plot(column=wi["mean"], cmap="coolwarm",
figsize=(5,5), vmin=-cap, vmax=cap, legend=True)
ax.set_title("mean")
How do we weight the combinations of component rows to approximately construct a given row in our original dataset? We can do this with component weights, as given by the PCA.transform
method.
pc_coef = pd.DataFrame(pca.transform(dft), index=dft.index).round(2)
pc_coef.head()
Even though we have 2 number per day instead of the 72 per-county numbers per day, we can use those two numbers as weights on the component rows to approximately reconstruct those 72 numbers. Let's try it for a specific date: Mar 1st, 2021.
mar1 = "2021-03-01"
pc_coef.loc[mar1]
components
-231.3 * components.loc[0] + 4.62 * components.loc[1] + mean
Or, more concisely using the dot product:
approx = pc_coef.loc["2021-03-01"] @ components + mean
approx
wi["approx"] = approx
ax = wi.plot(column=wi["approx"], cmap="coolwarm",
figsize=(5,5), vmin=-cap, vmax=cap, legend=True)
ax.set_title("Reconstructed from 2 PC weights")
How does that compare to what the state actually looked like on March 1st:
wi["actual"] = dft.loc[mar1]
ax = wi.plot(column=wi["actual"], cmap="coolwarm",
figsize=(5,5), vmin=-cap, vmax=cap, legend=True)
ax.set_title("Actual")
Not bad!
In these examples, we weren't trying to predict anything. We were looking for patterns and simplicity underlying our data; we were applying unsupervised learning. For the two clustering approaches, we tried to group similar counties. For the PCA, we tried to understand how counties vary together.