Complex JSON to CSV

Since the NHL season is about to start I am planning on downloading some stats and team info through the NHL API. However, I got stuck trying to get the first JSON response to a CSV for further analysis. I have Googled this quite a bit and the code for flattening the json/dict is from this post https://stackoverflow.com/questions/41180960/convert-nested-json-to-csv-file-in-python. However, the CSV that I get only has two cells. “copyright” (Cell A2) and “teams” (A3) so it doesnt flatten the dict as I want it to.

The desired result would be to like this (this is a shortened version):
id, name, link, venue_id, venue_name, venue_link…
1, New Jersey Devils, /api/v1/teams/1, Prudential Center, /api/v1/venues/null
2, New York Islanders, /api/v1/teams/2, 5026, Barclays Center, /api/v1/venues/5026
Notice how the first team did not have the “Venue.ID” which makes things even more difficult.

Here is the code that should work apart from not being able to have a CSV friendly dict.

import json
import csv
import requests

response = requests.get('https://statsapi.web.nhl.com/api/v1/teams')
json_data = response.json()

def get_leaves(item, key=None):
    if isinstance(item, dict):
        leaves = {}
        for i in item.keys():
            leaves.update(get_leaves(item[i], i))
        return leaves
    elif isinstance(item, list):
        leaves = {}
        for i in item:
            leaves.update(get_leaves(i, key))
        return leaves
    else:
        return {key : item}

# First parse all entries to get the complete fieldname list
fieldnames = set()

for entry in json_data:
    fieldnames.update(get_leaves(entry).keys())

with open('output.csv', 'w', newline='') as f_output:
    csv_output = csv.DictWriter(f_output, fieldnames=sorted(fieldnames))
    csv_output.writeheader()
    csv_output.writerows(get_leaves(entry) for entry in json_data)

If you want 1 row per team, a start would be to iterate through the teams

2 Likes

(type(teams_dict[‘teams’])) is a list and for example
type(teams_dict[‘teams’][1]) is a dict, so I have trouble looping through these two types of data structures and picking the fields I want.

I can get the “1st level” data such as team id, name etc out, but what about the inner data, for example timezone.id?
https://statsapi.web.nhl.com/api/v1/teams

I have been banging my head against this virtual wall for a better part of the week but havent gained much progress :confused:

dicts are collections of keyvalue pairs.
if you give a dict a key, it gives you the corresponding value back

mydict = {'a': 3}
print(mydict['a'])  # 3

If they’re nested then keep doing the same thing.

Thanks, but I am fully aware of that as I have done the Python 3 “basic course” here. In practice, if inside a list there are dicts as a value for the outer dicts the looping just isnt so easy. At least not for me.

Could you give an example how you would get any of the values from the inner dicts, for example venue.id?

Inside the dict you have dicts and lists. It’s the same thing, same operations. Nothing new.

If you’re able to obtain dicts from the outermost dict, then you have dicts, meaning you can obtain their values…and so on.

So I think we are both on the same boat here. Its easy to think its easy, but its not, and I assume you actually also have no idea how to get it done in practise :slight_smile: thanks anyway.

@leville,

I assure you with 100% certainty that @ionatan knows how to:

His point is that if you know how to access a value in a dictionary by it’s key, then you already posses the knowledge to access nested dictionary values. Here’s a hint if you don’t want to google it.

Hint
mydict = {
    'a': 3,
    'b': {
       'c': 4,
       'd': 5,
       'e': {
          'f': 6  
        }
    }


}
print(mydict['a'])  # 3
print(mydict['b']['e']['f']) # 6
1 Like

I’ve read through the topic and indeed @ionatan is correct about the fact that this JSON response is basically a dictionary with some keys being assigned value of other dictionaries and this goes a couple of “dimentios” down. (Although, I admit it was a bit of a confusing reply to read…:thinking:)
As for the moment, I have only completed Python3 course in 66%, so there could be a more optimized way to do this, but just using what I already learned I was able to accomplish all the described results.

First things first - I think there is no other way to build a code that works with JSON file from API without analyzing the structure of the reply. The structure of the data here and it’s formatting is very simple from python3 programmer’s point of view
we receive:

  1. a dictionary that consists of 2 keys (with names copyright and teams),
  2. the values of those keys both happen to also be dictionaries
  3. these have “inside” more specific keys and values such as team name, team id, api link but also keys of which value are other dictionaries e.g as the dictionary under key name venue
  4. And those, in turn, have their corresponding keys and values.
  • EDIT: If you’re not sure what do I mean, see the reply to your ‘‘timeZone’’ question in EDIT: TimeZone.

We’re interested in teams, so we look at the response[‘teams’].
It’s easy to see that some of the team information is available here, right away by just calling the dictionary with the appropriate key name, but the venue information, which we also want to retrieve selectively is included all together in the “venue” dictionary.
All we have to do is extract the values by using key names some if statements and just one loop.
I’ve written the code for you quickly (it was a nice exercise) and with extensive commentary so you should be able to clearly understand everything and be able to make changes (which you will probably want to make…).

For best readability copy and paste this code to your text editor to read it, it’s going to be way more comfortable, becuse of the lenght of the comment lines cause scroll bar on the bottom in the post… :confused:

import requests

###########################JSON TO PYTHON LST FOR NHL.COM API##############################
#======Loading API responce============================================================
response = requests.get('https://statsapi.web.nhl.com/api/v1/teams')
if response.status_code != 200: #checks if json was properly downloaded
	print("Promblem retreiving data from API, status code: "+response.status_code)
#=======intializing variables==========================================================
full_list = []
jsondata = response.json()	#selecting the actual data from JSON file
teams = jsondata['teams']	#choosing the 'teams' dictionary from the available data
for t in teams: #loading details of each team in teams data
#====== Initiating lists to hold read data ============================================
#i've used extra lists for easier debuging,
#in reality it can be done with just one list
	team_details = [] #will hold all wanted information about specific team
	venue_details = [] #will hold all wanted information about the specific team's venue
	complete_entry = [] #will hold compiled entry of info about the team and it's venue

#====== Creating Team Details List ====================================================
	if 'id' in t: #checks if there is an 'id' key in the current team data
		team_id = t['id'] #if it's found team_id is equal to the value of that team id
	else: #otherwise the team ID is set 'manually' to 0
		#print("No team id found, setting to 0")
		team_id = 0
	if 'name' in t: #checks for presence of 'team name', if not found set manually
		team_name = t['name']
	else:
		#print("No team name found, setting to Unknown")
		team_name = 'Unknown'
	if 'link' in t: #checks for presence of api link
		team_api  = t['link']
	else:
		#print("No team API found, setting to null")
		team_api = 'null'
	team_details.extend([team_id, team_name, team_api]) #combines the team details into a list
	#print(team_details)
#==========Team Details List Complete=========================================================

#====== Creating Venue Details List ==========================================================
	if 'venue' in t: #checks if there is venue entry for the team
		venue = t['venue'] #if there is we take those details into a variable
		if 'id' in venue: #checks venue details for ID
			venue_id = venue['id']
		else:
			#print("no venue id available, setting to 0")
			venue_id = 0
		if 'name' in venue: #checks venue details for name
			venue_name = venue['name']
		else:
			#print("Name not present, setting to Unknown")
			venue_name = 'Unknown'
		if 'link' in venue: #checks for api
			venue_link = venue['link']
		else:
			#print("no venue api, setting to null")
			venue_link = 'null'
		venue_details.extend([venue_id, venue_name, venue_link])
	else: #if there was no venue details at all it sets the complete details infromation to
		venue_details = [0,"No venue details found","No venue details found"]
	#print(venue_details)
#==========Venue Details List Complete====================================================	

#===========Building entry==============================================================
	complete_entry = team_details+venue_details
#===========Entry complete==============================================================

#==========Adding entry to the final list==================================================
	full_list.append(complete_entry)
#============FULL LIST COMPLETE===========================================================
#print(full_list)

#========QUERY EXAMPLE AND FORMAT==============================
#each team and it's venue information is at separate index
# each index has 6 sub index holding the information such as:
# 0 - team id, 1 - team name, 2 - team_api,
# 3 - venue_id, 4 - venue name, 5 - venue_api

#to print a complete information for a single team at index 0 (and leave an empty line under):
print(str(full_list[0])+".\n")
#to print team name and venue name of a team at index 10:
index10 = full_list[10]
print("Team "+index10[1]+", playing at home at "+index10[4]+".\n")


#to find team's venue name by typing some part of the team's name:
for x in full_list:
	team_src = 'Oil'#change str value to search for name of different team's venue, you can also drop an input command here instead
	current_team_name = x[1] #setting team's name to the currently loaded team's name
	if team_src in current_team_name: #checks if the search querry is in the current team name
		print("Found a match: "+current_team_name+ ". Searched phrase: "+team_src+".\n")
		print("Team's venu name: "+x[4]) # fith position holds the team name

#to call venue API by looking up venues name:
for x in full_list:
	venue_name = 'One'#change str value to search for api of different venue, you can drop an input here instead
	current_venue_name = x[4] #setting current venue name to the name currently loaded in the loop variable
	if venue_name in current_venue_name: #checks if the give venue name is included in the currently loaded venue name
		print("Found a match: "+current_venue_name)#if it finds a match it gives the information which name matched the querry
		print("Pulling out API: "+x[5]+"\n") #and prints the api which is at sixth position



#Feel free to make these into functions or whatever is more comfortable,
#I didn't bother, just wanted to show you the mechanics of worknig solutions

EDIT: timeZone

If you want to access the time zone ID or any other nested value you will need to apropriately extract it. To to this - again -we need to follow the reply structure.
Let’s see a whole respense for a single team using this code:

response = requests.get('https://statsapi.web.nhl.com/api/v1/teams')
jsondata = response.json()
only_teams_data = jsondata['teams']
our_example_team = only_teams_data[1]  #an example team, at index 1
print(our_example_team) 

The result:

{'id': 2, 'name': 'New York Islanders', 'link': '/api/v1/teams/2', 'venue': {'id': 5026, 'name': 'Barclays Center', 'link': '/api/v1/venues/5026', 'city': 'Brooklyn', 'timeZone': {'id': 'America/New_York', 'offset': -4, 'tz': 'EDT'}}, 'abbreviation': 'NYI', 'teamName': 'Islanders', 'locationName': 'New York', 'firstYearOfPlay': '1972', 'division': {'id': 18, 'name': 'Metropolitan', 'nameShort': 'Metro', 'link': '/api/v1/divisions/18', 'abbreviation': 'M'}, 'conference': {'id': 6, 'name': 'Eastern', 'link': '/api/v1/conferences/6'}, 'franchise': {'franchiseId': 22, 'teamName': 'Islanders', 'link': '/api/v1/franchises/22'}, 'shortName': 'NY Islanders', 'officialSiteUrl': 'http://www.newyorkislanders.com/', 'franchiseId': 22, 'active': True}

We can now see that we will need to acces the venue key which is directly in the team dictionary (you can recognize a dictionary by butterfly brackets { } ):
{[BEGINNING OF THE DATA] 'venue': {'id': 5026, 'name': 'Barclays Center', 'link': '/api/v1/venues/5026', 'city': 'Brooklyn', 'timeZone': {'id': 'America/New_York', 'offset': -4, 'tz': 'EDT'}}, [...REST OF THE DATA]}

We can assing that part of the responce to a variable using command:

venue_data = our_example_team['venue']

So now our venue_data holds the following data:
{'id': 5026, 'name': 'Barclays Center', 'link': '/api/v1/venues/5026', 'city': 'Brooklyn', 'timeZone': {'id': 'America/New_York', 'offset': -4, 'tz': 'EDT'}}

Our timeZone details are in a dictionary assigned as value of the timeZone key.
In order to acces the key we use the syntax:

dictionary_name['key-name']

Just like we did before to get the venue data.
In case of time zone code reads:

timezone_data = venue_data['timeZone']

Now our timezone_data variable holds:
{'id': 'America/New_York', 'offset': -4, 'tz': 'EDT'}

We’re home. Now we can use timezone[‘key-name’] to extract any information from this dictionary. For example:

timezone_name = timezone_data['id']
timezone_offset= timezone_data['offset']

Let’s print our extracted information:

print(timezone_name, timezone_offset)

The printed message will say:

America/New York -4

If you have any questions let me know!

EDIT Extracting more values:
If you’d like to extract any more values just retrieve its key parent value to a temporary dictionary and point it to the key name. Add another if statement to check if it’s key is present. I wrote If statements for each value to avoid errors in case some parts of the dataset were inconsistent or incomplete.

EDIT2 advanced: If you want to have a really flexible tool for looking into this kind of data you could probably build a function that iterates through all elements in the response JSON and checks for the number of received keys and records their names into a list and for each of them checks if the value is of the type dictionary
If it is a dictionary it’s name is recorded and then it’s key names are read and checked for the same condition. It also should check if the same key name was already recorded under that parent key name and only recorded new keys. All this would go on until the last key name in the last dictionary was recorded. All of these names (along with any other data recorded, e.g. corresponding keys datatype) are saved to a multi-dimensional list (with elements made of other lists). You can iterate a print of this list formatted in a very readable way and this would give you a complete understanding of the structure and have a full iterable keys reference that you can retrieve from any json.

1 Like

:heart_eyes:
Thank you so much @ljh1830181547. I will return your kindness on somebody else on this forum later.

I added a small piece of code to get the results on a CSV file and now I have a way to get the data I want. I am sure there is a better way to get the headers than typing them manually everytime I change which data is included, but that can wait as I really want to get this project moving forward ie. to start analyzing some actual results.

headers = ['team_id', 'team_name', 'team_api' ,'venue_id', 'venue_name', 'venue_link']

with open('191001-1-output.csv','w', newline='') as result_f:
    writer = csv.writer(result_f, dialect='excel')
    writer.writerow(headers)
    writer.writerows(full_list)
1 Like

No worries, my pleasure bro. :slight_smile:

I have also looked into other solutions, and you are 100% right that there is a better way.
It is called a recursive function.
I have been learning about it for a few hours already and honestly, it still feels so abstract…:confounded:
I understand the mechanics, but to apply it to the current example is difficult, not to say very.

So far I have managed to write a function that

  1. Reads dictionary keys
  2. Checks the values of those keys
  3. if the value is a list it converts it to a dictionary
  4. if the value is a dictionary it calls back on itself (that’s called recursion) providing newly discovered dictionary as an input, instead of the main dictionary
  5. if the value is a singular value it prints the key and the value

I’ve run it on an example dictionary I wrote by hand as it’s easier to see this way.

import requests
#======Loading API responce============================================================
response = requests.get('https://statsapi.web.nhl.com/api/v1/teams')
if response.status_code != 200: #checks if json was properly downloaded
	print("Promblem retreiving data from API, status code: "+response.status_code)
response = response.json()


#Example dictionary, small enough to understand what's up
mydic = {'disclaimer':'legal content', 'person':{'name':'baltazar', 'surname':'babuszko', 'nickname':'bladzioch',
 'favorite':{'foods':{'sovery':'mcdonalds', 'sweet':'candyflosk', 'spicy':'habaneros'}, 
 'car':'cheap skoda', 'boat':'the one on the lake', 'sweets':['sugar', 'sweetener', 'salt?']}}}
#print(mydic)


def l2d(dic): #LIST TO DICTIONARY CONVERTER -> INDICES BECOME KEYS
	if isinstance(dic, list):
		print("detected a list value, converting indexes to keys...")
		cdictionary = {}
		lindex = 0
		for each in dic: #takes every value in the list
			cdictionary.update({lindex:each}) #creates new dictionary entery, using index integers as dictionary keys
			lindex += 1 #updating list index to the next element
		print(cdictionary) #once complete debug print of the outcome dictionary
		return cdictionary	


def rf(dictionary, parent='none'):
	if isinstance(dictionary, dict): #if the received variable is a dictionary
		print("All keys available in the current dictionary at current depth: "+str(dictionary.keys())) #prints out its keys
		print("iterating keys values")
		for key, value in dictionary.items(): #cycles through all keys and corresponding values in the current dictionary
			if isinstance(value, list): #if value is a list, runs l2d converter and updates the value
				value = l2d(value)
			
			if isinstance(value, dict):# in case of every value that turns to be a dictionary as well
				print("***["+str(key)+"] key has a dictionary value, key's parent key: ["+str(parent)+"]" )
				print("opening ["+str(key)+"] dictionary to view keys") #prints notification
				rf(value, key) #calls itself (recursion) passing in arguments newly discovered dictionary and current parent key
			else:#if the keys value is not a dictionary, just print the value
				print("***["+str(key)+"] key, has a singular value: >>"+str(value)+"<< nested under parent key: ["+str(parent)+"]")

rf(mydic)
#rf(response)

If you plug your JSON result instead it will print out all keys and their values for each possible entry keeping their chronology and the parent key info. I recommend you first read mydic dictionary variable (or uncomment print(mydic)) then see function output on this dictionary, to understand how the function works, only then plug your JSON.

You could tweak this code to save the key values from the last else statement to a mono-dimensional list, notice how all the possible data runs on just three variables that you would need to retreive -key, value, parent, however, I am not sure how to make the code know, when information of separate entries is done and when it should move on to the next index…
Could have something to do with checking for change of the parent key… MAYBE.
It would be nice to get a real pro on the mic here. :smiley:
I’ll leave this tinkering for you. Let me know if you choose to follow this path at all and of course if you find results because I admit it is a difficult concept to implement and I am curious for the right solution too.

If you are interested in recursions, the best lesson I found so far on the subject is
MIT OpenCourseware - Recursion and Dictionaries
It has the “factorial of n” example, which every other recursion explanation uses, but 3 or 4 more, completely unique examples and overview of working with dictionaries and recursion.

Just found this, probably a better source to learn than anything else I found on the internet (couldn’t edit the last post anymore):
CodeCademy PRO - Learn Recursion with Python

Thanks for the link! I am sure that dictionaries would be the way to go because then you could just iterate through the keys. I think I will try to do the mini course tomorrow because that could potentially save a lot of time and blood pressure later.