Trying to use loc on multiple criteria without success

Hi,

I’m not really sure why this is proving to be such a struggle, but I want to create a smaller subset from my main dataframe to work on.

I’m using:

mar_mort_df = pd.read_csv("mat_mort.csv")

temp = mar_mort_df.loc[(mar_mort_df['Sex'] == "Female") & (mar_mort_df['Year'] == '1987')]
print(temp)

and some sample rows from (the very large) mar_mort.csv looks like:

Region Code,Region Name,Country Code,Country Name,Year,Sex,Age group code,Age Group,Number,Percentage of cause-specific deaths out of total deaths,"Age-standardized death rate per 100 000 standard population","Death rate per 100 000 population"
EU,Europe,ALB,Albania,1987,All,Age85_over,[85+],0.00000000,0.00000000,,0.00000000,
EU,Europe,ALB,Albania,1987,All,Age_unknown,[Unknown],0.00000000,,,,
EU,Europe,ALB,Albania,1987,Female,Age_all,[All],26.00000000,0.33810143,1.74750634,1.74274415,
EU,Europe,ALB,Albania,1987,Female,Age00,[0],0.00000000,0.00000000,,0.00000000,
EU,Europe,ALB,Albania,1987,Female,Age01_04,[1-4],0.00000000,0.00000000,,0.00000000,
EU,Europe,ALB,Albania,1987,Female,Age05_09,[5-9],0.00000000,0.00000000,,0.00000000,
EU,Europe,ALB,Albania,1987,Female,Age10_14,[10-14],0.00000000,0.00000000,,0.00000000,
EU,Europe,ALB,Albania,1987,Female,Age15_19,[15-19],0.00000000,0.00000000,,0.00000000,
EU,Europe,ALB,Albania,1987,Female,Age20_24,[20-24],3.00000000,3.40909091,,2.12464589,
EU,Europe,ALB,Albania,1987,Female,Age25_29,[25-29],6.00000000,6.06060606,,4.30725054,

When I run this, it runs fine, but returns an empty list, which seems like its not finding any values that match the criteria, and therefore isn’t including those rows in the new dataframe.

I suspect I’m making a very simple error, but any suggestions would be gratefully received!

Cheers

So, you’re trying to create a new DF based on certain conditions.
Maybe try np.where()? Also, is 1987 a string? If not, lose the quotes around it.

Ex:

new_df = np.where((dataFrame['Sex']=="Female") & (dataFrame['Year']== 1987))

new_df.head()

#or, ditch the quotes around the year 1987:

temp = mar_mort_df.loc[(mar_mort_df['Sex'] == "Female") & (mar_mort_df['Year'] == 1987)]
print(temp)
1 Like

Thanks - I gave this a go, but no joy.

It’s so strange - this is (I think) effectively producing the same error, but in this version, as am empty array.

(array([], dtype=int64),)

Did you try just using print() with .loc to see what happens? Also, I think you might be missing some parens ; the outer ones. See 2nd ex below.

ex:

print(df.loc[(df['Sex'] == "Female") & (df['Year'] == 1987)])

#or, just `df.loc` to extract the rows meeting your conditions:

df.loc[((df['col'1] == 'X') & (df['col2'] == 'A'))]

See also:

1 Like

Using print by itself produces the same null result:

The Statology link is exactly what I’m trying to acheive, and seems to be exactly how I’m doing it.

I’m wondering whether it’s something to do with the environment I’m in. I’m using Visual Studio Code (which updated today, and was a little grumpy in doing so), and I might try it in a Jupyter Notebook or on a different machine or something, because it almost feels like there’s so little code here, there are only so many things to try…

Ah, I’m not sure about VS vs. Jupyter. But, it’s something to try.
(I use Google Colab.)

Are you sure that the search conditions are valid? Is 1987 is the Year column?

1 Like

I hadn’t come across Google Colab and might give that a try out of interest, although initial attempts haven’t been great there, because I can’t get it to recognise that the csv file is in the same location - however, that’s another matter.

Yes, I believe the search criteria are valid and present. By way of example:

In Jupyter, I get the error message:

And this suggests it’s to do with the ‘unhashable series’, which I’ve looked up online and sort of understand, but I’m not really clear on what the solution to this would be…

I wonder if the use of brackets is creating a series and thus the hash error…

And,
https://www.learndatasci.com/solutions/python-pandas-typeerror-unhashable-type-series/

Did you try using .iloc?

ex:

subset_df = df.iloc[((df['xyz']=='yes') & (df['abc']=='blah')).values]

subset_df.head()

As for getting .csv files into Colab. If they’re under 25mb, just upload the csv files to GitHub. Then, view the csv file in “raw”, copy the address and then in Colab, do:

df = pd.read_csv("https://raw.githubusercontent.com/repo_name/etc.csv")

Or, add/import the csv file to your Google Drive and while in the colab notebook, select the files icon (left side) or you can connect/mount the Drive to your notebook:
from google.colab import drive drive.mount('/content/drive')

1 Like

I couldn’t get anywhere with iloc as it grumbles that there isn’t an index.

df.info brings up:

<class 'pandas.core.frame.DataFrame'>
Index: 204897 entries, EU to AS
Data columns (total 12 columns):
 #   Column                                             Non-Null Count   Dtype  
---  ------                                             --------------   -----  
 0   Region_Code                                        204897 non-null  object 
 1   Region_Name                                        204897 non-null  object 
 2   Country_Code                                       204897 non-null  object 
 3   Country_Name                                       204897 non-null  int64  
 4   Year                                               204897 non-null  object 
 5   Sex                                                204897 non-null  object 
 6   Age_group_code                                     204897 non-null  object 
 7   Age_Group                                          203783 non-null  float64
 8   Number                                             198700 non-null  float64
 9   Percentage_cause-specific_deaths_of_total_deaths   9556 non-null    float64
 10  Age_standardized_death_rate_per_100k_standard_pop  189400 non-null  float64
 11  Death_rate_per_100k_pop                            0 non-null       float64
dtypes: float64(5), int64(1), object(6)
memory usage: 20.3+ MB

And it’s a bit odd that a bunch of those are showing as “object”, I think? I would have just expected them to be strings? But I might be reading too much into that. I just tried df.convert_dtypes() which generally makes them the field types they should be, but it still doesn’t work…

Thanks for the Google info. I got that working, and rather liked it - it’s very similar to Jupyter, but I’m already pretty heavily invested in the Google ecosystem, so it might well make sense.

objects are used to store strings in pandas.

For reference, if you want to change the dtype, you can use astype(),
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html

In the pasted image of your code & errors above, you’re using .loc(), when it should be .loc[ ] (use brackets, not parens).

My example from above:

df.loc[((df['col'1] == 'X') & (df['col2'] == 'A'))]

Yea, Colab is built like Jupyter, but it’s cloud-based.

1 Like

Useful to know about the objects point - thank you.

Yes, I tried astype() first, but couldn’t get anywhere with that, so tried convert_dtypes(), but in the event, that wasn’t the problem.

On trying df[“Country_Code”].unique(), for example, I realised that the column headings have shifted along one. I can totally therefore find things that match specific criteria, it’s a matter of searching for the ‘right thing’ in the ‘wrong column’.

I think this has been caused by something to do with the trailing commas at the end of every line? But not totally sure. So I haven’t quite worked out how to fix this, or indeed what I’m fixing, but that’s certainly what’s causing the weirdness in not finding search results. (apols for the hastiness in posting example with wrong brackets)