Data Consolidation: Dates all mixed up (Dataframes, Numpy. Pandas)

Hi all,

I have a very large set of data in different columns in the format below.

DATE        Data    DATE.2      Data2   DATE.3      Data3   DATE.4      Data4 Data5
0   2018-01-01  2.4054  2018-01-02  9.77    2018-01-02  2695.81 2018-01-01  98  358
1   2018-01-02  2.4633  2018-01-03  9.15    2018-01-03  2713.06 2018-01-02  98  355
2   2018-01-03  2.4471  2018-01-04  9.22    2018-01-04  2723.99 2018-01-03  99  348
3   2018-01-04  2.4525  2018-01-05  9.22    2018-01-05  2743.15 2018-01-04  98  340
4   2018-01-05  2.4763  2018-01-08  9.52    2018-01-08  2747.71 2018-01-05  98  336
5   2018-01-08  2.4800  2018-01-09  10.08   2018-01-09  2751.29 2018-01-08  97  335
6   2018-01-09  2.5530  2018-01-10  9.82    2018-01-10  2748.23 2018-01-09  96  333

I need to do a wee bit of cleaning and consolidation with only one date column instead of 4. You’ll be able to see that dates don’t match up on each row, therefore I need to work out a way of getting the code to create a new row with N/A in it if there is no data in the relevant date column for that day.

To give an example of the intended output:

    DATE        Data     Data2    Data3  Data4 Data5
0   2018-01-01  2.4054   N/A      N/A     98   358
1   2018-01-02  2.4633   9.77     2695.81 98   355

What I have done is as follows :

Creating a new blank DataFrame and consolidates the dates. It then creates the data columns with N/A as default values

new = af.filter(['DATE', 'DATE.2', 'DATE.3', 'DATE.4])
new['New Date'] = new['DATE']
new['New Date'].append(new['DATE.2']).reset_index(drop=True)
new['New Date'].append(new['DATE.3']).reset_index(drop=True)
new['New Date'].append(new['DATE.4']).reset_index(drop=True)
new = new.drop_duplicates(subset=['New Date'], keep='first')
new.drop(['DATE', 'DATE.2', 'DATE.3', 'DATE.4'], axis=1, inplace=True)
new['Data'] = 'N/A'
new['Data2'] = 'N/A'
new['Data3'] = 'N/A'
new['Data4'] = 'N/A'
new['Data5'] = 'N/A'

Here I’m attempting to insert the data from the previous dataframe (df) on the condition that the corresponding date (from the old df) matches the consolidated date, if not it should have an ‘N/A’ in the field.

new['Data'] = np.where(df['DATE'] == new['New Date'], df['Data'], 'N/A')
new['Data2'] = np.where(df['DATE.2'] == new['New Date'], df['Data2'], 'N/A')
new['Data3'] = np.where(df['DATE.3'] == new['New Date'], df['Data3'], 'N/A')
new['Data4'] = np.where(df['DATE.4'] == new['New Date'], df['Data4'], 'N/A')
new['Data5'] = np.where(df['DATE.4'] == new['New Date'], df['Data5'], 'N/A')

Unfortunately however, nomatter what I do when it gets on to Data2, it leaves the whole column with N/A in the field. Any idea how to fix my problem?

Thanks in advance for your advice.

1 Like

I’d be tempted to make a DataFrame based on the full date range, e.g. daterange = pd.DataFrame(index=pd.date_range('2018-01-01', '2018-01-31')) or whatever your range is and then just join each ['DATE', Data'] group to this dataframe.

Maybe not the most efficient but that’d be the first thing that came to mind since it fills out NaNs automatically with that kind of join anyway. Do you have an example of the output you get with your current set up (maybe a codebyte, repl or similar)?

Hi, thanks for your response.

The output is as follows:

import numpy as np import pandas as pd af = pd.DataFrame({'DATE':['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04', '2018-01-05', '2018-01-08', '2018-01-09'], 'Data':['2.4054', '2.4633', '2.4471', '2.4525', '2.4763', '2.4800', '2.5530'], 'DATE.2':['2018-01-02', '2018-01-03', '2018-01-04', '2018-01-05', '2018-01-08', '2018-01-09', '2018-01-10'], 'Data2':['9.77', '9.15', '9.22', '9.22', '9.52', '10.08', '9.82'], 'DATE.3':['2018-01-02', '2018-01-03', '2018-01-04', '2018-01-05', '2018-01-08', '2018-01-09', '2018-01-10'], 'Data3':['2695.81', '2713.06', '2723.99', '2743.15', '2747.71', '2751.29', '2748.23'], 'DATE.4':['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04', '2018-01-05', '2018-01-08', '2018-01-09'], 'Data4':['98', '98', '99', '98', '98', '97', '96'], 'Data5':['358', '355', '348', '340', '336', '335', '333']}) new = af.filter(['DATE', 'DATE.2', 'DATE.3', 'DATE.4']) new['New Date'] = new['DATE'] new['New Date'].append(new['DATE.2']).reset_index(drop=True) new['New Date'].append(new['DATE.3']).reset_index(drop=True) new['New Date'].append(new['DATE.4']).reset_index(drop=True) new = new.drop_duplicates(subset=['New Date'], keep='first') new.drop(['DATE', 'DATE.2', 'DATE.3', 'DATE.4'], axis=1, inplace=True) new['Data'] = 'N/A' new['Data2'] = 'N/A' new['Data3'] = 'N/A' new['Data4'] = 'N/A' new['Data5'] = 'N/A' new['Data'] = np.where(af['DATE'] == new['New Date'], af['Data'], 'N/A') new['Data2'] = np.where(af['DATE.2'] == new['New Date'], af['Data2'], 'N/A') new['Data3'] = np.where(af['DATE.3'] == new['New Date'], af['Data3'], 'N/A') new['Data4'] = np.where(af['DATE.4'] == new['New Date'], af['Data4'], 'N/A') new['Data5'] = np.where(af['DATE.4'] == new['New Date'], af['Data5'], 'N/A') new

Coe outputs:

     New Date	 Data Data2    Data3  Data4    Data5
0	2018-01-01	2.4054	N/A	    N/A	   98	     358
1	2018-01-02	2.4633	N/A	    N/A	   98	     355
2	2018-01-03	2.4471	N/A	    N/A	   99	     348
3	2018-01-04	2.4525	N/A	    N/A	   98	     340
4	2018-01-05	2.4763	N/A	    N/A	   98	     336
5	2018-01-08	2.4800	N/A	    N/A	   97	     335
6	2018-01-09	2.5530	N/A	    N/A	   96	     333

If you look at the output you’ll see if it hasn’t actually included all the relevant dates, 2018-01-10 is missing in this example. This leads on to the second issue where your equality comparison is working by index that is is row 1 of ['new date'] comparable to row 1 of [DATE.2]. But you know these dates are out of sync so that vast majority of them then evaluate to False (this is why the 'DATE' and 'DATE.4' columns are in sync with the 'New Date' column but the others are not).

You might need to find an alternative way to do your comparisons in this case since it’s typically row based.

Managed to fix it:

df_1 = af[['DATE', 'Data']]
df_2 = af[['DATE.2', 'Data2']].rename(columns = {'DATE.2': 'DATE'})
df_3 = af[['DATE.3', 'Data3']].rename(columns = {'DATE.3': 'DATE'})
df_4 = af[['DATE.4', 'Data4', 'Data5']].rename(columns = {'DATE.4': 'DATE'})

new = df_1.merge(df_2, on = 'DATE', how = 'outer').merge(df_3, on = 'DATE', how = 'outer').merge(df_4, on = 'DATE', how = 'outer')
new['Data'].fillna("N/A", inplace = True)
new['Data2'].fillna("N/A", inplace = True)
new['Data3'].fillna("N/A", inplace = True)
new['Data4'].fillna("N/A", inplace = True)
new['Data5'].fillna("N/A", inplace = True)
new