Question regarding last data science project: K-Means clustering - Handwriting Recognition using K-Means

https://www.codecademy.com/paths/data-science/tracks/dspath-unsupervised/modules/dspath-clustering/projects/clustering

I think there is a problem with this last exercise. Of course I could be mistaken or set something up wrong, nonetheless here’s what I’m noticing. The KMeans model is instantiated with default settings, meaning the centroids will be randomly seeded. As such the centroids will not resolve to the same digit each time. Therefore the mapping of centroid indices to a particular digit each time is not assured (step 17).

By looking at the scikit-learn page referenced in the exercise I was able to find a different method which allows for deterministic seeding of the centroids - PCA method. I haven’t really looked into what’s going on under the hood here, but this seems to work generating deterministic centroids. That being said, it seems sometimes to resolve to a centroid order other than what can be seen in the converter dictionary, but if you run the script some more times it seems to resolve mostly to the order seen in converter.

converter = {0:6, 1:0, 2:2, 3:5, 4:4, 5:8, 6:3, 7:7, 8:9, 9:1}

The results are only relevant when the centroid order matches the order seen in converter (the order can be observed each time the code is run by looking at the centroids plot). Here is the code:

-------------------------------------------------------------------Code-----------------------------------------------------------------
import numpy as np
from matplotlib import pyplot as plt
from sklearn import datasets
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

digits = datasets.load_digits()
labels = digits.target

data = digits.data

# using PCA method for deterministic seeding of centroids

pca = PCA(n_components=10).fit(data)
model = KMeans(init=pca.components_, n_clusters=10, n_init=1)
model.fit(data)

# predict the given data to compare to given labels

predict = model.predict(data)

# from test.html - this one is for all sevens

new_samples = np.array([
[0.08,3.88,4.03,3.80,3.72,2.35,0.81,0.00,0.53,7.47,7.62,7.61,7.62,7.62,7.61,7.14,0.00,1.44,2.28,2.28,3.27,6.54,7.62,7.38,0.00,0.00,0.00,0.00,1.36,7.31,7.62,4.49,0.00,0.00,4.64,6.86,7.24,7.62,7.62,7.62,0.00,0.00,5.95,7.62,7.62,7.09,6.10,5.03,0.00,0.00,5.16,7.62,6.54,1.14,0.00,0.00,0.00,2.43,7.62,6.93,0.76,0.00,0.00,0.00],
[0.00,3.98,5.32,4.80,4.57,4.10,3.50,2.18,0.00,6.37,5.98,6.09,6.30,7.22,7.60,7.59,0.00,0.00,0.00,0.00,0.61,6.47,7.34,3.20,0.30,3.92,6.04,7.52,7.60,7.61,6.08,2.64,0.99,6.75,6.01,7.38,7.53,5.66,6.69,5.24,0.00,0.00,3.27,7.62,4.29,0.00,0.00,0.00,0.00,0.68,7.16,5.77,0.07,0.00,0.00,0.00,0.00,4.11,7.46,1.41,0.00,0.00,0.00,0.00],
[0.74,1.52,1.59,2.74,3.03,2.27,0.07,0.00,4.87,7.62,7.62,7.23,6.24,7.62,6.62,2.78,0.00,0.38,0.38,0.00,0.68,5.94,7.37,2.86,0.13,2.51,3.04,3.11,6.59,7.28,2.25,0.00,1.06,6.83,7.61,7.62,7.62,7.62,7.62,4.83,0.00,2.79,7.53,2.72,0.00,0.38,1.90,1.37,0.23,6.73,4.66,0.00,0.00,0.00,0.00,0.00,0.31,5.10,0.82,0.00,0.00,0.00,0.00,0.00],
[0.00,3.70,4.56,4.57,5.33,4.87,4.25,1.29,0.00,3.70,4.56,4.57,4.48,6.07,7.62,3.32,0.00,0.00,0.00,0.00,0.74,7.14,4.94,0.00,0.00,2.03,6.69,7.52,7.44,7.61,4.34,1.67,0.00,0.81,2.88,4.49,7.58,4.69,5.71,3.73,0.00,0.00,0.51,6.83,5.06,0.00,0.00,0.00,0.00,0.00,4.96,6.90,0.53,0.00,0.00,0.00,0.00,0.00,6.46,2.41,0.00,0.00,0.00,0.00]
])

# predict labels for new samples

new_labels = model.predict(new_samples)

# centroids plot ------------------------------------------------------------------------

fig = plt.figure(figsize=(8,3))
plt.suptitle(‘Centroids’)

for i in range(10):

# Initialize subplots in a grid of 2X5, at i+1th position

ax = fig.add_subplot(2, 5, 1 + i)

# Display images

ax.imshow(model.cluster_centers_[i].reshape((8, 8)), cmap=plt.cm.binary)

plt.show()

#---------------------------------------------------------------------------------------------

looking at plots of centroids index is linked to the apparent number seen

converter = {0:6, 1:0, 2:2, 3:5, 4:4, 5:8, 6:3, 7:7, 8:9, 9:1}

# plot first 64 images------------------------------------------------------------------

fig = plt.figure(figsize=(6, 6))
plt.title(
“First 64 images in dataset\n\n”
“Lower left: actual label | Lower right: predicted label”)

plt.tick_params(
axis=‘x’, # changes apply to the x-axis
which=‘both’, # both major and minor ticks are affected
bottom=False, # ticks along the bottom edge are off
top=False, # ticks along the top edge are off
labelbottom=False) # labels along the bottom edge are off

plt.tick_params(
axis=‘y’,
which=‘both’,
left=False,
right=False,
labelleft=False)

# Adjust the subplots

fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)

# For each of the 64 images

for i in range(64):

# Initialize the subplots: add a subplot in the grid of 8 by 8, at the i+1-th position
ax = fig.add_subplot(8, 8, i+1, xticks=[], yticks=[])

# Display an image at the i-th position
ax.imshow(digits.images[i], cmap=plt.cm.binary, interpolation='nearest')

# Label each image 
# lower left of image is known target 
# lower right of image is class predicted 
ax.text(0, 7, labels[i])
ax.text(6, 7, str(converter[predict[i]]))

plt.show()

--------------------------------------------------------------------------------------------------

print("\nPredicted labels for new samples:\n"
“Did you enter?”)
for label in new_labels:
print(converter[label], end=" ")

‘’’

# all zer0s from test.html that actually works

new_samples = np.array([
[0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,1.09,4.69,6.98,6.34,1.87,0.00,0.00,0.46,6.17,5.36,1.41,3.51,7.21,2.34,0.00,3.57,5.59,0.07,0.00,0.00,2.76,6.51,0.00,5.29,2.71,0.00,0.00,0.00,0.75,6.84,0.00,4.71,4.83,0.00,0.00,0.00,2.93,6.21,0.00,0.82,6.67,4.53,2.04,3.87,7.26,2.17,0.00,0.00,0.92,5.28,5.92,4.84,1.88,0.00],
[0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.45,2.27,2.88,1.58,0.07,0.00,0.00,1.20,6.91,6.05,5.25,6.99,6.64,2.40,0.00,4.84,4.38,0.00,0.00,0.07,2.98,6.97,0.00,5.82,2.57,0.00,0.00,0.00,0.82,7.50,0.00,4.20,3.79,0.00,0.00,0.00,3.32,5.45,0.00,3.19,6.89,4.68,3.63,2.92,7.19,2.34,0.00,0.07,1.96,3.48,4.40,5.31,3.22,0.00],
[0.00,0.00,0.00,1.07,3.80,1.67,0.00,0.00,0.00,0.23,3.72,7.15,5.40,7.14,5.63,1.67,0.00,3.80,6.30,1.66,0.00,0.38,4.26,5.68,0.38,7.15,1.36,0.00,0.00,0.00,1.52,6.09,2.12,5.93,0.00,0.00,0.00,0.00,1.51,6.09,1.36,6.78,0.08,0.00,0.00,0.00,2.20,5.78,0.00,5.93,5.93,4.26,3.80,4.33,6.39,4.55,0.00,0.53,2.58,3.73,3.80,3.72,2.26,0.37],
[0.00,0.00,0.00,0.00,1.67,4.86,2.95,0.00,0.00,0.00,0.07,3.20,7.06,4.17,7.13,0.90,0.00,0.23,5.55,6.66,2.13,0.00,4.49,4.03,0.00,2.42,6.23,0.22,0.00,0.00,1.90,6.00,0.00,4.17,4.02,0.00,0.00,0.00,1.52,6.07,0.00,3.27,6.61,2.42,0.00,0.00,3.34,5.85,0.00,0.00,3.19,6.77,5.94,6.78,6.76,2.04,0.00,0.00,0.00,0.61,2.28,1.27,0.00,0.00]
])
‘’’
----------------------------------------------------------End of Code--------------------------------------------------------------------------

The model seems to predict it’s own training data well.

As far as recognizing the handwritten digits from test.html - the algorithm seems to perform poorly. The two included sets (new_samples) are identified accurately by the algorithm, but I had to draw the digits very carefully. Many other tests were complete failures.

1 Like