Biodiversity Project - unique value for scientific name

Hi fellow students and experts,

I have a question about the Biodiversity project that I’m currently working on.

There’s a column called ‘scientific_name’ and when I found that unique values in that column is 5541 by using the nuniuqe method, which is smaller than the total number of rows: 5824. The means that this column contains duplicates but scientific names are supposed to be unique, so I dug deeper to see what the duplicates are. I used .duplicated() to find the rows with duplicates and printed them but I don’t see any duplicates. Can anyone answer my question? you can find my repo below. Thanks!!!

.nunique() omits NANs with inplace = True which is the default.

Oh, also, your line of code at 102 (where you’re counting conservation statuses) needs to be tweaked. Don’t use .value_counts()
You need to use groupby on “Conservation Status” on the nunique values of Scientific Names.
:slight_smile:

2 Likes

There are indeed some repeats and @lisalisaj’s advice may help you find them. Since your notebook file is public I’d suggest you remove cells where errors have occurred so that any viewers have an easier time reading as tracebacks can be scary looking.

Edit: Had a look at the data and it’s just as likely to be grouped by location as by anything to do with subspecies. I think you’d have to check where the data is sourced from to know for certain.
I’ve not looked into the origin of this data but perhaps the following example might have some relation- https://en.wikipedia.org/wiki/Subspecies_of_Canis_lupus

1 Like

Thank you so much for your comment, lisalisaj!

I used groupby on ‘Conservation Status’ and got the same result. Aren’t they giving us same results? Are there other technical reasons why you recommended to use groupby instead of value_counts?
:+1:

Thank you tgrtim!

As you said there are indeed some repeats!! Also, thank you for your guidance on how to tackle the repeats. I was like then ‘should I delete the duplicates and have only one entry since scientific names are supposed to be unique?’ I’m going to check if this has anything to do with location or something else! I learn alot of things!!!

:laughing:

I was just following the directions which asks one to use groupby. :slight_smile:

Plus, you do get different results when this is used:
species.groupby('conservation_status').scientific_name.nunique().reset_index()

1 Like

Hi Lisalisaj,

Thank you for your quick response!!!

Yes, I see the difference between species.groupby(‘conservation_status’).scientific_name.nunique().reset_index() and species.groupby(‘conservation_status’).scientific_name.count().reset_index().

That’s interesting… look like they will have the same output… I should dig deeper! Thanks!

Hi tgrtim & lisa,

I couldn’t just ignore the fact that there’s ‘duplicates’ under scientific names before creating plots to find pattern and themes. Thus, based on the wikipedia link you gave me, I wrote below and will move on. What do you guys think?

'According to wikipedia, Canis (meaning ‘dog’ in latin) lupus has 38 subspecies https://en.wikipedia.org/wiki/Subspecies_of_Canis_lupus. Given that there’s no timestamp attached to each row, we can reasonably say that this dataset is cross-sectional. Thus, these two rows are not duplicates or mistakes but are representing two different subspecies of Canis lupus. I will treat them as separate date points.