Iterating through big data, creating unique dataframes based on one column sort

*Hello, this is my first post here and it’s based upon an issue I’ve created and tried to solve at work. I’ll try to precisely summarize my issue as I’m having trouble wrapping my head around a preferred solution. #3 is the real stumper for me.

1.    Grab a large data file based on a parquet - no problem

2.    Select 5 columns from the parquet and create a dataframe - no problem

import pandas
df = pd.read_parquet(’/Users/marmicha/Downloads/sample.parquet’,

                 columns=["ts", "session_id", "event", "duration", "tags__artifact"])

3.    But here is where it gets a bit tricky for me.      One column(a key column) is  called "session_id" .     Many values are unique.     Many duplicate values(of session_id) exist and have multiple associated entry rows of data.     I wish to iterate through the master dataframe, create a unique dataframe per session_id.       Each of these unique (sub) dataframes would have a calculation done that simply gets the SUM of the "duration" column per session_id.   Again that SUM would be unique per unique session_id, so each sub dataframe would have it's own SUM with a row added with that total listed along with the session_id      I'm thinking there is a nested loop formula that will work for me but every effort has been a mess to date.     

4.    Ultimately, I'd like to have a final dataframe that is a collection of these unique sub dataframes.     I guess I'd need to define this final dataframe, and append it with each new sub dataframe as I iterate through the data.     I should be able to do that simply

5.     Finally, write this final df to a new parquet file.     Should be simple enough so I won't need help with that.      

But that is my challenge in a nutshell. The main design I’d need help with is #3. I’ve played with interuples and iterows

but I’m somehow missing exactly how to get what I want. I’ll include a sample of the data I’m looking for as it may prove useful to visualize the solution.

Thank You


Welcome to the forums!

I think the method you choose will depend on what exactly you’re looking for. If you only want a care about the session_id and sum of the duration per session_id, that’s pretty straightforward. You would only need your new DataFrame to contain those two columns and you would rid yourself of all the duplicates.

However, it looks like you want to keep the info from all columns and rows, but also add in a sum of duration. Is that correct?

If so, this may not be the best way to go:

If you do this, what will go in the other columns that do not contain the sum of duration? nulls? I think perhaps a column rather than a row would be better in this case. You will have duplicates in that column (as each record containing the same session_id will have the same sum) but that is preferable over having intermittent rows that contain only one value and many nulls for the other columns.

What do you think? Again, your approach will be entirely unique to the end result you need.

Hi, yes I can see now why you would assume that. Let me clear up the goal here. I need all the data associated with a particular session_id. Where there are multiple rows of the same session_id, I need the data and the sum from the “duration” column values. So if there are 5 rows with same session ID, I need the dataframe to contain all the associated data, and a 6th row with the SUM of the duration totals from the 5 rows. When there is only 1 unique session_id row, I basically just leave it as is, and add it to the final dataframe. Thank you. Mark

I see. However, I still think it might be better to have a column rather than a row for the sum of duration. Is there a reason that wouldn’t work for you?

If you have a column, it could look like this:

ts session_id event to_event duration sum_duration
123 001 11780.0 117526.0
123 001 233.0 117526.0
123 001 105513.0 117526.0

rather than something like this:

index ts session_id event to_event duration
1 123 001 11780.0
2 123 001 233.0
3 123 001 105513.0
Total null 001 null null 117526.0

There is a way to add a column total row to the bottom of a DataFrame, but the problem is, it will sum all numeric columns, not just the one:

df.loc['Column_Total']= df.sum(numeric_only=True, axis=0)

To me, the new column solution is the best, as it will leave the DataFrame open for further calculations and also leave out any unwanted nulls or sums.

It’s a good suggestion. Do you have any thoughts on the best way to write the loop? I’ve played with some nested loops but not really getting the result that I need.

I would probably make a second DataFrame (we can call it df_sum) that is the original df, grouped by session_id and then summed. Then I’d assign df['new_column'] to the value of df_sum['duration'] where df_sum['session_id'] is the same as df['session_id'].

Iterating through pandas dataFrame objects is generally slow. Iteration beats the whole purpose of using DataFrame. It is an anti-pattern and is something you should only do when you have exhausted every other option. It is better look for a List Comprehensions , vectorized solution or DataFrame.apply() method for iterating through DataFrame.

List comprehensions example

result = [(x, y,z) for x, y,z in zip(df['column1'], df['column2'],df['column3'])]