Tennis Ace Challenge Project (Python)

Hello Data Scientists Machine Learning Specialists,

The following is my GitHub Repository of the completed Tennis Ace Challenge Project.

I found that if I repeat the regression model multiple times, the double and multiple features regression model will always change. Only single feature regression model is consistent if we use accuracy more than 70%.

Hope my findings are useful.
Appreciate your feedback!

Happy Coding!

:disguised_face: :cowboy_hat_face: :nerd_face:

2 Likes

Hello everyone,

This is the link to find my solution for the project Tennis Ace Challenge Project (Python): Tennis Ace Challenge code and results (Google Colab notebook).

I am looking forward to hearing your suggestions and tips that will help to improve my approach and the code as well.

(In case you can’t access the notebook, feel free to contact me. I will be more than happy to answer).

May the Data be with you!

Hey Everyone,

This is the link to my Tennis Ace project on github. I completed it in a jupyter notebook.

I used iteratations for each step to go over the features and determine which had the highest test scores and were therefore the best features to use.

Questions and feedback welcomed.

Cheers,

Tom

Here is my code: Gist_Tennis
Please review this and give me feedback. Thank you in advance.

Check my solution for this project:

Tennis Aces Project

1 - How can a player have 0 wins and 0 losses and still have non-zero values for other statistics?. Or how can a player have played 0 serving games and still have 2 wins and 2 losses?. I assume some aspects of the data are faulty and there’s some type of cleaning that may make for an more accurately predicting model. However, although there are some obvious inconsistencies with data like these, there may be other incorrect data that is not distinguishable.

2 - Each player appears multiple times through the years. For a given player, the values of most variables change for each year, except for “Ranking”.
May that variable correspond to the Ranking of the player for the last year of recorded data? And if so, if the player’s most recent entry is lets say, 2013, and the most recent entry in the entire dataset is 2017. Does the ranking of that player correspond to 2013, or to 2017?

Here is my try at this projects. I’d love feedback!:

Hello Everyone, check out my solution to the Tennis Ace Project. I appreciate any feedback!

Hi. This is my project´s code. Any feedback is greatly appreciated. Thanks.

import codecademylib3_seaborn
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

load and investigate the data here:

df = pd.read_csv(“tennis_stats.csv”)
print(df.head())
print(df.info())

perform exploratory analysis here:

plt.scatter(df[“FirstServe”], df[“Ranking”])
plt.show()
plt.clf()
plt.scatter(df[“FirstServePointsWon”], df[“Ranking”])
plt.show()
plt.clf()
plt.scatter(df[“Aces”], df[“Ranking”])
plt.show()
plt.clf()
plt.scatter(df[“BreakPointsConverted”], df[“Ranking”])
plt.show()
plt.clf()
plt.scatter(df[“BreakPointsFaced”], df[“Ranking”])
plt.show()
plt.clf()
plt.scatter(df[“TotalPointsWon”], df[“Ranking”])
plt.show()
plt.clf()
plt.scatter(df[“Winnings”], df[“Ranking”])
plt.show()
plt.clf()

perform single feature linear regressions here:

X = df[[“FirstServe”]]
y = df[[“Winnings”]]
x_train, x_test, y_train, y_test = train_test_split(X, y, train_size = 0.8, test_size = 0.2, random_state = 6)
line_fitter = LinearRegression()
line_fitter.fit(x_train, y_train)
line_fitter.score(X,y)

y_predict = line_fitter.predict(x_test)
plt.scatter(x_test, y_test)
plt.plot(x_test, y_predict)
plt.show()
plt.clf()
plt.scatter(y_test, y_predict, alpha=0.4)
plt.show()
plt.clf()
#Number 5
feature5 = df[[“TotalPointsWon”]]
outcome5 = df[[“Winnings”]]
feature5_train, feature5_test, outcome5_train, outcome5_test = train_test_split(feature5, outcome5, train_size = 0.8, test_size = 0.2)
number_5 = LinearRegression()
number_5.fit(feature5_train, outcome5_train)
outcome5_predict = number_5.predict(feature5_test)
plt.scatter(outcome5_test, outcome5_predict, alpha=0.4)
plt.show()
plt.clf()

best_feature = df[[“BreakPointsOpportunities”]]
best_outcome = df[[“Winnings”]]
best_feature_train, best_feature_test, best_outcome_train, best_outcome_test = train_test_split(best_feature, best_outcome, train_size = 0.8, test_size = 0.2)
best_fitting = LinearRegression()
best_fitting.fit(best_feature_train, best_outcome_train)
best_outcome_predict = best_fitting.predict(best_feature_test)
plt.scatter(best_outcome_test, best_outcome_predict, alpha=0.4)
plt.show()
plt.clf()

#line_fitter.score(x_test, y_test)

perform two feature linear regressions here:

#Number 6
x_number_6 = df[[“Wins”, “Ranking”]]
y_number_6 = df[[“Winnings”]]
x_number_6_train, x_number_6_test, y_number_6_train, y_number_6_test = train_test_split(x_number_6, y_number_6, train_size = 0.8, test_size = 0.2)
number_6 = LinearRegression()
number_6.fit(x_number_6_train, y_number_6_train)
y_number_6_predict = number_6.predict(x_number_6_test)
plt.scatter(y_number_6_test, y_number_6_predict, alpha=0.4)
plt.show()
plt.clf()
print(number_6.score(x_number_6, y_number_6))

feature_6_2 = df[[“FirstServePointsWon”, “BreakPointsConverted”]]
outcome_6_2 = df[[“Winnings”]]
feature_6_2_train, feature_6_2_test, outcome_6_2_train, outcome_6_2_test = train_test_split(feature_6_2, outcome_6_2, train_size = 0.8)
model_6_2 = LinearRegression()
model_6_2.fit(feature_6_2_train, outcome_6_2_train)
outcome_6_2_predict = model_6_2.predict(feature_6_2_test)
plt.scatter(outcome_6_2_predict, outcome_6_2_test, alpha=0.4)
plt.show()
plt.clf()
print(model_6_2.score(feature_6_2, outcome_6_2))

perform multiple feature linear regressions here:

#Number 7
features_7_1 = df[[“FirstServe”, “BreakPointsConverted”, “ServiceGamesWon”]]
outcomes_7_1 = df[[“Winnings”]]
features_7_1_train, features_7_1_test, outcomes_7_1_train, outcomes_7_1_test = train_test_split(features_7_1, outcomes_7_1, train_size = 0.8, test_size = 0.2)
model_7_1 = LinearRegression()
model_7_1.fit(features_7_1_train, outcomes_7_1_train)
outcomes_7_1_predict = model_7_1.predict(features_7_1_test)
plt.scatter(outcomes_7_1_test, outcomes_7_1_predict, alpha=0.4)
plt.show()
plt.clf()
print(model_7_1.score(features_7_1, outcomes_7_1))

features_7_2 = df[[“BreakPointsConverted”, “Wins”, “Ranking”]]
outcomes_7_2 = df[[“Winnings”]]
features_7_2_train, features_7_2_test, outcomes_7_2_train, outcomes_7_2_test = train_test_split(features_7_2, outcomes_7_2, train_size = 0.8, test_size = 0.2)
model_7_2 = LinearRegression()
model_7_2.fit(features_7_2_train, outcomes_7_2_train)
outcomes_7_2_predict = model_7_2.predict(features_7_2_test)
plt.scatter(outcomes_7_2_test, outcomes_7_2_predict, alpha=0.4)
plt.show()
plt.clf()
print(model_7_2.score(features_7_2, outcomes_7_2))

My Solution for this project :

In this project i spent quite a lot of time plotting the data in different ways, so if anyone could find it interesting, here is the link to my GitHub: GitHub - Morsianeren/codecademy_courses: Repository for completing projects in Codecademy

Congrats on finishing the project.
It’s great that you added comments in your code, but it might be a good idea to show the results of your plots, functions, etc. so anyone reading the notebook can follow along with your analysis and see the relationships (or lack thereof) in the data. It might also be a good idea to add a quick blurb at the top of the notebook as to what the details of the project/dataset are, or add a readme file.

I used this loop to plot all potential predictors against the wins/losses to explore correlations. Here you can se two of the resulting plots. Some had normal distribution like shapes where a relationship wasn’t clear while others had more of a line shape where as numbers went up, losses would curve down while wins would curve up.

load and investigate the data here:

players = pd.read_csv(‘tennis_stats.csv’)

print(players.head())

print(players.info())

Potential predicting variables

potential_predictors = [‘FirstServe’, ‘FirstServePointsWon’, ‘FirstServeReturnPointsWon’, ‘SecondServePointsWon’, ‘SecondServeReturnPointsWon’, ‘Aces’, ‘BreakPointsConverted’, ‘BreakPointsFaced’, ‘BreakPointsOpportunities’, ‘BreakPointsSaved’, ‘DoubleFaults’, ‘ReturnGamesPlayed’, ‘ReturnGamesWon’, ‘ReturnPointsWon’, ‘ServiceGamesPlayed’, ‘ServiceGamesWon’, ‘TotalPointsWon’, ‘TotalServicePointsWon’]

Outcomes

wins = players[‘Wins’]
losses = players[‘Losses’]

perform exploratory analysis here:

for predictor in potential_predictors:
X = players[predictor].values.reshape(-1, 1)
plt.scatter(X, wins, c=‘g’, alpha=0.1)
plt.scatter(X, losses, c=‘r’, alpha=0.1)
plt.title(f’{predictor} vs. Wins / Losses’)
plt.xlabel(predictor)
plt.ylabel(‘Wins/Losses’)
plt.show()
plt.clf()

My best model using multiple variables for the Winnings outcome:
[‘Aces’, ‘BreakPointsFaced’, ‘BreakPointsOpportunities’, ‘DoubleFaults’, ‘ReturnGamesPlayed’,‘ServiceGamesPlayed’]
image

I thought that maybe the computer could find a better performing model than my brain logic could for the Winnings outcome. So I coded a small random subset selector and let it run 1000 times (doesn’t register past subsets so some trials were double). It found one that performed 0.001 better:
[‘BreakPointsOpportunities’, ‘ServiceGamesWon’, ‘TotalServicePointsWon’, ‘FirstServePointsWon’, ‘ReturnGamesPlayed’, ‘BreakPointsFaced’, ‘BreakPointsSaved’, ‘ReturnGamesWon’, ‘SecondServePointsWon’, ‘ReturnPointsWon’, ‘DoubleFaults’]

image

The random selector also found a really neat one for the Losses outcome!
[‘BreakPointsSaved’, ‘ServiceGamesPlayed’, ‘ReturnGamesWon’, ‘BreakPointsFaced’, ‘ReturnPointsWon’, ‘BreakPointsConverted’, ‘ReturnGamesPlayed’, ‘Aces’, ‘BreakPointsOpportunities’, ‘SecondServeReturnPointsWon’, ‘FirstServeReturnPointsWon’]

image

Where can I place some bets on tennis tournaments for the losing player?

The code:
best_score = 0
best_subset =
for i in range(1000):
num_subset = random.randint(1, len(potential_predictors))
subset = random.sample(potential_predictors, num_subset)

X = players[subset]
y = winnings
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=8)
regr = LinearRegression()
regr.fit(X_train, y_train)
score = round(regr.score(X_test, y_test), 3)
if score > best_score:
best_score = score
best_subset = subset

X = players[best_subset]
y = winnings
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=8)
regr = LinearRegression()
regr.fit(X_train, y_train)
score = round(regr.score(X_test, y_test), 3)

m = regr.coef_
b = regr.intercept_
y_predict = regr.predict(X_test)

plt.scatter(y_test, y_predict, alpha=0.5)
plt.title(f’Best Model performance: {score}')
plt.xlabel(‘Actual’)
plt.ylabel(‘Predicted’)
plt.show()
plt.clf()

print(best_subset)

Hi Everyone, Ben here from the UK. Just finished my Tennis Ace project submission for the Data Science Learning Path and I am now 78% complete. Any feedback would be hugely appreciated. Thank you all!!

My GitHub repository for the Tennis Ace project

Hi all,

I recently finished the Tennis Stats linear regression project. If anybody has feedback, I would love to hear your thoughts!

Gideon: Tennis Stats Linear Regression

Thank you!