Tennis Ace Challenge Project (Python)

Congratulations on completing your project!

Compare your project to our solution code and share your project below! Your solution might not look exactly like ours, and that’s okay! The most important thing right now is to get your code working as it should (you can always refactor more later). There are multiple ways to complete these projects and you should exercise your creative abilities in doing so.

This is a safe space for you to ask questions about any sample solution code and share your work with others! Simply reply to this thread to get the conversation started. Feedback is a vital component in getting better with coding and all ability levels are welcome here, so don’t be shy!

About community guidelines: This is a supportive and kind community of people learning and developing their skills. All comments here are expected to keep to our community guidelines


How do I share my own solutions?

  • If you completed the project off-platform, you can upload your project to your own GitHub and share the public link on the relevant project topic.
  • If you completed the project in the Codecademy learning environment, use the share code link at the bottom of your code editor to create a gist, and then share that link here.

Do I really need to get set up on GitHub?
Yes! Both of these sharing methods require you to get set up on GitHub, and trust us, it’s worth your time. Here’s why:

  1. Once you have your project in GitHub, you’ll be able to share proof of your work with potential employers, and link out to it on your CV.
  2. It’s a great opportunity to get your feet wet using a development tool that tech workers use on the job, every day.

Not sure how to get started? We’ve got you covered - read this article for the easiest way to get set up on GitHub.

Best practices for asking questions about the sample solution

  • Be specific! Reference exact line numbers and syntax so others are able to identify the area of the code you have questions about.
1 Like

Check out my code:

2 Likes

This is my code similar to the solution… (would have been easier if I knew there was one)

1 Like
import codecademylib3_seaborn
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# load and investigate the data here:
data=pd.read_csv('tennis_stats.csv')
print(data.head())
print(data.columns)
print(data.describe())

# exploratory analysis
plt.scatter(data.FirstServeReturnPointsWon, data.Winnings)
plt.title('FirstServeReturnPointsWon vs Winnings')
plt.xlabel('FirstServeReturnPointsWon')
plt.ylabel('Winnings')

plt.show()
plt.clf()


plt.scatter(data.BreakPointsOpportunities, data.Winnings)
plt.title('BreakPointsOpportunities vs Winnings')
plt.xlabel('BreakPointsOpportunities')
plt.ylabel('Winnings')

plt.show()
plt.clf()


plt.scatter(data.BreakPointsSaved, players.Winnings)
plt.title('BreakPointsSaved vs Winnings')
plt.xlabel('BreakPointsSaved')
plt.ylabel('Winnings')
plt.show()
plt.clf()


plt.scatter(data.TotalPointsWon, players.Ranking)
plt.title('TotalPointsWon vs Ranking')
plt.xlabel('TotalPointsWon')
plt.ylabel('Ranking')
plt.show()
plt.clf()

plt.scatter(data.TotalServicePointsWon,data.Wins)
plt.title('TotalServicePointsWon vs Wins')
plt.xlabel('TotalServicePointsWon')
plt.ylabel('Wins')
plt.show()
plt.clf()

plt.scatter(data.Aces, data.Winnings)
plt.title('Aces vs Wins')
plt.xlabel('Aces')
plt.ylabel('Wins')
plt.show()
plt.clf()

plt.scatter(data.ServiceGamesWon, data.Winnings)
plt.title('ServiceGamesWon vs Wins')
plt.xlabel('ServiceGamesWon')
plt.ylabel('Wins')
plt.show()
plt.clf()


## single feature linear regression (FirstServeReturnPointsWon)
X=data[['FirstServeReturnPointsWon']]
y=data['Winnings']
X_train,X_val,y_train,y_val=train_test_split(X,y,train_size=0.8,test_size=0.2)
mdl=LinearRegression()
my_mdl=mdl.fit(X_train,y_train)
print('FirstServeReturnPointsWon vs Winnings Score > ' , my_mdl.score(X_val,y_val))
pred=my_mdl.predict(X_val)
plt.scatter(y_val,pred,alpha=0.4)
plt.title('Predicted Winnings vs. Actual Winnings - Single Feature')
plt.xlabel('Actual Winnings')
plt.ylabel('Predicted Winnings')
plt.show()
plt.clf()


## single feature linear regression (BreakPointsOpportunities)
X2=data[['BreakPointsOpportunities']]
y2=data['Winnings']
X2_train,X2_val,y2_train,y2_val=train_test_split(X2,y2,train_size=0.8,test_size=0.2)
mdl2=LinearRegression()
my_mdl2=mdl2.fit(X2_train,y2_train)
print('BreakPointsOpportunities vs Winnings Score > ',my_mdl2.score(X2_val,y2_val))
pred2=my_mdl2.predict(X2_val)
plt.scatter(y2_val,pred2,alpha=0.4)
plt.title('Predicted Winnings vs. Actual Winnings - Single Feature')
plt.xlabel('Actual Winnings')
plt.ylabel('Predicted Winnings')
plt.show()
plt.clf()



## two feature linear regression
features=['BreakPointsOpportunities', 'FirstServeReturnPointsWon']
X3=data[features]
y3=data['Winnings']
X3_train,X3_val,y3_train,y3_val=train_test_split(X3,y3,train_size=0.8,test_size=0.2)
mdl3=LinearRegression()
my_mdl3=mdl3.fit(X3_train,y3_train)
print('2 Features vs Winnings Score > ', my_mdl3.score(X3_val,y3_val))
pred3=my_mdl3.predict(X3_val)
plt.scatter(y3_val,pred3,alpha=0.4)
plt.title('Predicted Winnings vs. Actual Winnings - 2 Features')
plt.xlabel('Actual Winnings')
plt.ylabel('Predicted Winnings')
plt.show()
plt.clf()

## multiple features linear regression
features2=['BreakPointsOpportunities','FirstServeReturnPointsWon','ServiceGamesPlayed','ServiceGamesWon','BreakPointsOpportunities','BreakPointsSaved','DoubleFaults','ReturnGamesPlayed']

X4=data[features2]
y4=data['Winnings']
X4_train,X4_val,y4_train,y4_val=train_test_split(X4,y4,train_size=0.8,test_size=0.2)
mdl4=LinearRegression()
my_mdl4=mdl4.fit(X4_train,y4_train)
print('Predicting Winnings with Multiple Features Test Score > ' , my_mdl4.score(X4_val,y4_val))
pred4=my_mdl4.predict(X4_val)
plt.scatter(y4_val,pred4,alpha=0.4)
plt.title('Predicted Winnings vs. Actual Winnings - Multiple Features')
plt.xlabel('Actual Winnings')
plt.ylabel('Predicted Winnings')
plt.show()
plt.clf()
4 Likes

This is my own solution to the Tennis Ace Project Challenge.

https://github.com/TheLegend192/tennis_ace_starting/blob/master/script.py

My solution

https://github.com/Goananda/Codecademy-project-Tennis-Ace/blob/master/Tennis%20Ace.ipynb

1 Like

https://gist.github.com/8822d965cb552a3c4e2ff66179dda083

My Link!!

My solution:


Score on training data:
0.8686792385781952
Score on test data:
0.8670324678334735
1 Like

Tennis Ace prediction project: https://gist.github.com/0524aabe1a129a445a3e75d8b8472257

Here is the link to my solution:

df = pd.read_csv('tennis_stats.csv')
df

x = df[['FirstServe', 'FirstServePointsWon',
       'FirstServeReturnPointsWon', 'SecondServePointsWon',
       'SecondServeReturnPointsWon', 'Aces', 'BreakPointsConverted',
       'BreakPointsFaced', 'BreakPointsOpportunities', 'BreakPointsSaved',
       'DoubleFaults', 'ReturnGamesPlayed', 'ReturnGamesWon',
       'ReturnPointsWon', 'ServiceGamesPlayed', 'ServiceGamesWon',
       'TotalPointsWon', 'TotalServicePointsWon']]
y = df[['Winnings']]
#print(x)
#print(y)

x_train, x_test, y_train, y_test = train_test_split(x, y, train_size = 0.8, test_size = 0.2, random_state=6)

mlr = LinearRegression()

model = mlr.fit(x_train, y_train)
print(model.coef_)

plt.subplot()
plt.scatter(y[['Winnings']],x[['FirstServeReturnPointsWon']], alpha=.4)
plt.show()
plt.close()
plt.subplot()
plt.scatter(y[['Winnings']],x[['BreakPointsOpportunities']], alpha=.4)
plt.show()



# perform exploratory analysis here:
x1 = df[['FirstServeReturnPointsWon']]
y1 = df[['Winnings']] 
x_train1, x_test1, y_train1, y_test1 = train_test_split(x1, y1, train_size = 0.8, test_size = 0.2, random_state=6)
# split the data into training and tests
mlr1 = LinearRegression()
model1 = mlr1.fit(x_train1, y_train1)
# created linear regression model and trained it on the training data
score1train = model1.score(x_train1,y_train1)
print(score1train)
score1test = model1.score(x_test1,y_test1)
print(score1test)
y1_predict = mlr1.predict(x1)
plt.close()
plt.subplot()
plt.scatter(x1, y1, alpha=.4)
plt.plot(x1, y1_predict)

For the project. I tried to print model.coef_ to find which independent variables would have the highest coefficient against winnings. In my understanding the higher the absolute number, the more correlated it is with winnings.

But when I was running the single linear regression for example when I chose the variable ‘FirstServePointsWon’ because it had the highest coef_ number to ‘Winnings’, the Rsquare (.score) on both the train and test data came out very low.

I then chose a different variable for the single linear regression with a lower coef_ score for example ‘Aces’ to ‘Winnings’ and the Rsquare (.score) came out much higher.

Is my understanding of .coef_ incorrect? Just because the .coef_ is high, it doesn’t mean the Rsquare will be high?

My code with some… interesting colour choices :slight_smile:
Hopefully I can start making 3D plots soon for the simpler MLRs where necessary.

The coefficient is the impact of changing your variable (like Aces) by 1 unit. In other words, it’s the partial effect on y (which is “Winnings” in your case). That does not necessarily mean your model is better, however.
R^2 tells you how much of the variation (in Winnings, in this case) is explained by your model.

Your example can be interpreted as:
FirstServePointsWon has a bigger partial effect (in an SLR) on the value of Winnings than Aces.
Aces gives us more information on how tennis players wages vary.

The effect can be better seen if you plot one graph with:
(1) Scatter plot of FirstServePointsWon and Winnings
(2) Line of FirstServePointsWon and Predicted Winnings

and a second graph with Aces in place of FirstServePointsWon.

Also when you do an MLR, say Winnings = FirstServePointsWon + Aces + Losses, you may find different coefficients and incercept values (and R^2) because you have more information, and can therefore have more accurate coefficients/partial effects.

Hope that helped.

My code is at https://github.com/msmadscientist/codecademy_projects/blob/master/TennisStats.py. I did things a little differently. Eyeballing graphs seemed very inefficient, so I fitted each column (except for names) against Winnings and printed out its R^2 value. I could instantly tell which columns had the best correlation against Winnings - and I could prove it with more than “the graph looked linear to me”.

Hello all. It’s my git project link and that is a link to my Codecademy project. Please review my project. I will glad any comments. Thank’s.

Here is my code for this project. I ran a couple of nested loops to be able to try every combination. You can view these in the variable explorer. My multi feature regression ended up with R = 0.87. I had 14 options of two feature regressions vs Winnings with an R > 0.85. One or none of the single feature regressions had an R > 0.85 depending on the iteration.

Let me know what you think.


import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# load and investigate the data here:

data = pd.read_csv('tennis_stats.csv')
df = pd.DataFrame(data)

def calc_R(x_values, y_values):
    x_train, x_test, y_train, y_test = train_test_split(x_values, y_values,train_size = 0.80, test_size = 0.20)
    slr = LinearRegression()
    slr.fit(x_train, y_train)
    y_predict = slr.predict(x_test)
    R2 = slr.score(x_train, y_train)
    return R2
#Skips string filled series
labels = df.columns[2:25]
print(list(labels))
# perform exploratory analysis here:





#Performs singular regression against Winnings
#I found the best fit was Return games played vs Winnings
rs_winnings = []
for i in range(len(labels)):
    x = df[[labels[i]]]
    y = df['Winnings']
    R = calc_R(x,y)
    if (R > 0.85):
        rs_winnings.append([R, labels[i]])

#Running singular regression on winnings as the y shows us one or zero parameters with an R > 0.85
print(rs_winnings)
print()

## perform single feature linear regressions here:
#I made a nested loop to perform singular regression against every parameter combo
#I stored this in a diction where the keys are the constant set of y values and the values are a list (R, parameter)
#best y axis set are found in the dictionary
rs=[]
rs_dict={}
for i in range(len(labels)):
    for ii in range(len(labels)):
        if (i == ii):
            pass
        else:
            x = df[[labels[ii]]]
            y = df[[labels[i]]]
            R = calc_R(x,y)
            if R > 0.7:
                rs.append([R, labels[ii]])
    rs_dict[labels[i]] = rs
    rs = []
    

## perform two feature linear regressions here:
#runs nested loop to test each two parameter combo vs winnings
#14 options of two feature linear regression with R>0.85 and only one or zero greater than R>0.85 for single regression 
rs_winnings_double = []
for i in range(len(labels)):
    for ii in range(len(labels)):
        if (i == ii):
            pass
        elif ((labels[i] == 'Winnings') | (labels[ii] == 'Winnings')):
            pass
        else:
            x = df[[labels[i], labels[ii]]]
            y = df['Winnings']
            R = calc_R(x,y)
            if (R > 0.85):
                rs_winnings_double.append([R, labels[i], labels[ii]])

## perform multiple feature linear regressions here:
labels_no_winning = list(labels)
labels_no_winning.remove('Winnings')
print(labels_no_winning)

X = df[labels_no_winning]
y = df['Winnings']
R = calc_R(X,y)
print(R)

#using all parameters for x values allows us to get a good R of 0.87
#these results are comparable to a few of the two feature linear regressions```

Haven’t heard of .clf()! Better than creating subplots for quickly generating plots!