Python Pandas List of Dictionaries to DataFrame

I’m new to the group and Python Pandas and I’m hoping someone can assist.

I have a folder full of nested json files (400+) which I’ve created a for loop to extract into a list of dictionaries.

path = r'xxxx'
files = os.listdir(path)
reading all the json files
filenames = glob.glob(path + "\*.json")

def get_json(file):
f = open(file)
data = json.load(f)
return data

jsv2 = [get_json(i) for i in filenames]

The file is then a list with a nested set of dictionaries, all keys are included below but I’d like to build a pandas dataframe which expands each files keys to columns with the values as rows.

The jsv2 variable above produces the following nested structure but at a high level it goes as per the text below (so 6 levels of nesting per file, all with variable keys for each nested). The additional data below the typed JSON structure includes all the keys I’d like to expand:

Happy to share the merged data if it helps seperately?

[
{File
{Country
{Course
{Times
{Race Detail
{Horse detail

The detailed breakdown and keys of each of the above is as follows:

[
    {0
        {GB
            {Kempton (AW)
                {7:20
                   {race_id :
                    date :
                    course_id :
                    course :
                    off_time
                    race_name :
                    distance_round :
                    distance :
                    distance_f :
                    region :
                    pattern :
                    race_class :
                    type :
                    age_band :
                    rating_band :
                    prize :
                    field_size :
                    going_detailed :
                    rail_movements :
                    stalls :
                    weather :
                    going :
                    surface :
                    runners
                           {horse_id :
                            name :
                            dob :
                            age :
                            sex :
                            sex_code :
                            colour :
                            region :
                            breeder :
                            dam :
                            dam_region :
                            sire :
                            sire_region :
                            grandsire :
                            damsire :
                            damsire_region :
                            trainer :
                            trainer_location :
                            {trainer_14_days
                                runs :
                                wins :
                                percent :
                                owner :
                            prev_trainers
                            prev_owners :
                            comment :
                            spotlight :
                            quotes :
                            stable_tour :
                            number :
                            draw :
                            headgear :
                            headgear_first :
                            lbs :
                            ofr :
                            rpr :
                            ts :
                            jockey :
                            last_run :
                            form :
                            trainer_rtf :

Then the next file has a similar structure to the above (see below e.g. {1, {2 and so on)

    {1
        {GB
            {Kempton (AW)
                {7:50
    {2
        {GB
            {Kempton (AW)
                {8:20

I’ve been searching online over the last week and tried using json_normalize, explode etc but it seems to include the prior levels of nesting in the column name suffix and I end up with 6k+ columns.

Please excuse my ignorance as I’m new to coding and still learning the ropes, this is likely quite straightforward for someone in the know :slightly_smiling_face:

Any assistance offered would be gratefully received.

How about using pd.dataframe(data_filename) ? The keys would be the column names and the rows would be the values.

ex:

data = {key: [values...], key: {values, ...], 
        key: [ ]...etc}

df = pd.DataFrame(data)
df.head()

Thanks for the reply, I can create a dataframe the problem I’m struggling with is the data I want to access is nested several levels within it. For example in the descriptor above, the {Country{Course{Times nestings are actually already included in the Race Detail nesting?

I can get it to work on one file using the following code, however it doesn’t work on a list (only a dictionary, which returns TypeError: can only concatenate str (not “dict”) to str) and the for loop to extract the data from the files in a folder puts them in to a list:

# Opening JSON file
f = open("xxxxxxx")

# Reading from file
data = json.loads(f.read())

cleaned_data = []

for value in data:
    country = ""
    country += value
    for track in data[country]:
        courses = ""
        courses += track
        for time in data[country][courses]:
            off_times = ""
            off_times += time
            cleaned_data.append(data[country][courses][off_times])

df_racecards = pd.DataFrame(cleaned_data)

Hm. I wonder if DataFrame.from_dict(data) would work for a nested dictionary to a df?

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.from_dict.html

Or, using pd.DataFrame.from_dict(data, orient='columns')

See:

Or, am I not understanding correctly?

Hi thanks again for the response.

When I try to create from a DataFrame from dict and use json_normalize I end up with the data expanded to columns and circa 5k rows (with column names taking the suffix of the original top level nested dictionaries)?

This code seems to work on a single dictionary and may explain what I’m trying to do a little better:

import json

# Opening JSON file
f = open("C:/Users/markp/OneDrive/Desktop/2022-08-10.json")

# Reading from file
data = json.loads(f.read())

cleaned_data = []

for value in data:
    country = ""
    country += value
    for track in data[country]:
        courses = ""
        courses += track
        for time in data[country][courses]:
            off_times = ""
            off_times += time
            cleaned_data.append(data[country][courses][off_times])

df_racecards = pd.DataFrame(cleaned_data)   
temp_df = df_racecards.set_index('race_id')['runners'].explode().reset_index()
df1 = temp_df.join(pd.json_normalize(df_racecards['runners'].explode()))

This results in the following DataFrame, all of the JSON objects have the same data structure:

Did you pass any of the arguments for the pd.json_normalize() method?
See:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html

I wonder if you added the max_level parameter if it would fix this issue?

Thanks again for the response lisalisaj, I’m really struggling to shape the data into something workable.

I’ve been trying things like transpose (df.T) and then resetting index(), then trying wide to long and explode and nothing I seem to be applying is working (been on with this for a couple of weeks now lol).

I can reshape data in things like Power Query / Power BI with ease, it seems massively complicated in Pandas (although its raw power as a module is blatantly evident, mastering it would be fantastic but I’m at very early stages in the learning process).

is there a course on Codecademy which goes in to detail on json_normalize, explode, working with nested lists / dictionaries as I have a yearly subscription and would be keen to look at it and try to learn. The pandas courses I’ve looked at are quite basic?

You have to choose what’s best. If Pandas doesn’t cut it right now, then you’re doing the right thing by using other tools to investigate the data. Have you tried loading the data into Excel or (free) Tableau? There’s also a neat (free) web tool called OpenRefine that might help you clean & transform the data.

No, not to my knowledge.
Maybe there’s some help on YouTube?

Through further massasing of the data I’ve managed to get into the nesting I need using the following code:

import pandas as pd
import glob
import os
import json
pd.set_option('display.max_rows', 5000)

# path of the folder
path = r'C:/Users/markp/OneDrive/Desktop/Daily Race Cards/Racecard_JSON/'
files = os.listdir(path)
# reading all the json files
filenames = glob.glob(path + "\*.json")

def get_json(file):
    with open(file, encoding= 'utf-8') as fh:
        data = json.load(fh)
        return data

cleaned_data = []

for file in filenames:
    jsv2 = get_json(file)
    cleaned_data.append(jsv2)

with open('merged_data.json', "w") as outfile:  # save to json file
    json.dump(cleaned_data, outfile)
    
length = len(cleaned_data)-1


# In[137]:


df = pd.DataFrame(cleaned_data)
df = df[["GB"]].dropna()
df = df.reset_index()
df = df.set_index('index')["GB"].apply(pd.Series).stack().reset_index()
df_2 = df.set_index('index')[0].apply(pd.Series).stack().reset_index().rename(columns= {"level_1": 'time'})
df_3 = df_2.set_index('index')[0].apply(pd.Series).stack().reset_index()
df_3 = df_3.rename(columns = {'index': 'JSON File No.', 'level_1': 'Keys', 0: 'Values'})

This gives me the following dataframe with the rows repeated for each key and value:

I’ve then tried to pivot the data e.g. transpose rows to columns and it throws a ton of NaNs (likely because its trying to aggregate values, which I’d like to keep as strings and do the conversion once the rows keys become columns themselves):

The code I used to pivot is the last two lines below:

df = pd.DataFrame(cleaned_data)
df = df[["GB"]].dropna()
df = df.reset_index()
df = df.set_index('index')["GB"].apply(pd.Series).stack().reset_index()
df_2 = df.set_index('index')[0].apply(pd.Series).stack().reset_index().rename(columns= {"level_1": 'time'})
df_3 = df_2.set_index('index')[0].apply(pd.Series).stack().reset_index()
df_3 = df_3.rename(columns = {'index': 'JSON File No.', 'level_1': 'Keys', 0: 'Values'})
df_3 = df_3.reset_index()
df_3 = df_3.pivot(index=['index'], columns = ["Keys"], values=['Values'])

What I basically want to do is for each unique key element in the Keys column I’d like to make these the new columns with the values columns becoming the row values?

I feel like I’m so close but can’t crack it, I’ve tried converting the column to string value using but it still didn’t work.

df_3 = df_3.astype({"Values": str})

Possibly unstack but maybe using a multi-index?