PCA Project question (steps 13 and 14)


I have a question about the last steps of PCA Project in “Data Science: Machine Learning Specialist” path.
In step 13 and 14, the exercise asks to use the 2 PCA components as input to a support vector classifier, find the score and compare this score with the one obtained using 2 random features from the original standardized data matrix in the same classfier. It suggests to “Notice the large difference in scores”…but the large difference is not in PCA components favor al all:

Score for model with 2 PCA features: 0.8649036163772503
Score for model with 2 randomly selected features: 0.9993627529074398

What is that supposed to mean? That in this case is better using a support vector classifier with any random features of the original matrix, without doing PCA?
I can’t understand this output in a project that aims to show the pros in using PCA
Am I missing something?

Thanks for your helpfulness

1 Like

I am on the same page as you. I really don’t don’t get it. Though this isn’t the first time I have been utterly confused by a project having seemingly non-sensical results. Let me know if you figured it out please!

Think I figured it out actually, because of how the csv file is written it is using the index as feature 0 (or so I think) So it has a correlation of almost 100% because its just 1,2,3,4,5… so on. I am going to message codecademy about it.

I have the same issue when I put the code into a Notebook and run it there (ie PCA underperforms).

Something is squirrely.

The PCA section needs a lot of work to improve it I feel. It doesn’t really fit in with how well written the rest of the course is, and still introduces concepts and code that wasn’t even mentioned earlier.


I have a question about #13 and #14, too, but of a different kind. The below part, which was Codecademy’s orginial code, led to “ValueError: array length 13611 does not match index length 19020”:

principal_components_data = pd.DataFrame({
‘PC1’: principal_components[:, 0],
‘PC2’: principal_components[:, 1],
‘class’: classes,

I ran the project on Jupyter Notebook. So I didn’t import the “codecademylib3” library. But usually that isn’t an issue.

Any suggestions? Besides reporting to Codecademy, there might be a workaround.

Thanks a lot. :smiley: