Decision Trees Project - Find the Flag: DecisionTreeClassifier.score changes with train_test_split?

During the Decision Trees Project - Find the Flag, you are asked - in question 10. - to create a for loop to iterate different max_depths over a piece of code, which creates, trains, and tests a decision tree, resulting in the DecisionTreeClassifier.score. In the subsequent two questions these results are then stored and graphed. Finally, - in question 13. - you are asked to say something meaningful about this graph and here we find my issue.

Each time I run the code, the graph is markedly different, which prevents me from providing a meaningful answer to the question AND, more importantly invokes a question in me:
Is this correct? Why is the graph markedly different each time I run the code? Is this because the train_test_split is different each time and the sample size is “too” small to have the different split result in a similar graph?

Below is the code:

import codecademylib3_seaborn
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt

flags = pd.read_csv('flags.csv', header = 0)
labels = flags[['Landmass']]
data = flags[["Red", "Green", "Blue", "Gold", "White", "Black", 
"Orange", "Circles", "Crosses", "Saltires", "Quarters", 
"Sunstars", "Crescent", "Triangle"]]

train_data, test_data, train_labels, test_labels = 
train_test_split(data, labels)

scores = [ ]
for i in range(1,20):
  tree = DecisionTreeClassifier(random_state = 1, max_depth = i), train_labels)
  scores.append(tree.score(test_data, test_labels))

plt.plot(range(1, 20), scores)