My verison of Biodiversity Project

This Project is the second part of Portfolio Project requirements. I kindly ask all of you data coders to check what did I do and what could I do better.

On this Project I took some more time then neceserly (around a week, because as a marine biologist I liked the data.

I hope you will enjoy reading it ■■■ much was me coding it :))

Good morning! I enjoyed reading over your project. I finished this project recently as part of finishing the BI Data Analytics career path. I agree with your ‘mostly fictional’ statement; I found a number of species that ‘magically appeared’ in some of the parks.

Your project is fantastic, and I enjoyed reading it. I can tell that you spent a lot of time on it. However, I found a few red flags in your process - I hope you can learn from these to produce even more fantastic work in the future.

Questions as I explore your code:

  • Is there a reason you will not use common names? For example, if you found duplicate Latin names for a number of species due to various common names, it might be worth mentioning that. I know that this project is for a National Parks person, but having the common names handy could be nice. *** I see that you add it later; maybe don’t delete it early on?
  • Removing duplicated data
    • I see a big problem! You just deleted a bunch of observations, and now your results are inaccurate! Let’s take a closer look.
      • I would recommend doing more exploration before getting to this step, because I think you missed an important pattern. I would recommend taking time to write down your initial observations and any questions you have about the data.
        • If you were paying very close attention, when you printed obs_df.describe(), you should have seen a min of 9 and thought – ‘Hmm, it seems strange that every single species has representation in each of the four parks. The min should be zero’
        • Also, you should have noticed that in species.describe(), the count for Latin name is different from the unique values by 283. I would have expected these to match – since they don’t, that warrants further investigation before cleaning the data.
        • You would have seen more suspicious numbers had you done obs_df.describe(include=’all’).
          • For example, for ‘Latin name’ you would have seen a freq of 12 (if there was one observation at each park for each species, at most we would have expected 4; seeing 12 indicates duplicates)
          • If you divide the number of observations (23,296) by the number of parks (4), you get 5,824, but there are only 5,541 unique values for Latin name – indicates duplicates.
      • Checking for duplicates should be an extension of data exploration
        • In this dataset, there are four observations per species (one for each park). It might not have been obvious at first, but if you sorted your data, this pattern would become obvious.
        • Similar to looking for missing data, you should take the time to examine what the duplicates are, see if you can explain why they are there, and determine what you should do with them.
        • In my code below, see if you can find the reason for duplicate Latin names (which I left as scientific_name):

Here was my code for exploring duplicates:

********************** Markup:

Checking for duplicates

Duplicates in observations_csv.scientific_name

  • Given that there are 23,296 observations, splitting it between four parks should result in no fewer than 5,824 unique values for scientific_name, but there are 5,541 unique values, indicating duplicates.
  • Maybe I don’t quite understand what observations means
    • I assumed it was all of the instances a given species was found during a certain time-period
    • Maybe it could represent (for example) six different researchers covering unique areas of the park submitting their own reports, resulting in overlap on the same species within that park?
    • Either way, warrants further investigation
# First, I will sort the data by 'scientific_name', and look at the first few instances of duplicates. sorted_observations_csv = observations_csv.sort_values(by=['scientific_name', 'park_name']) # print(sorted_observations_csv.head(20)) # Ok that didn't work like I had hoped. # Instead, I will: # Group by scientific name # Filter for results greater than four grouped_observations_csv = observations_csv.groupby('scientific_name').count().reset_index() print('First few rows when grouped by \'scientific_name\':\n', grouped_observations_csv.head()) print('\nLength of above grouped df: \n', len(grouped_observations_csv)) # Checking that there are still 5,541 unique observations (there are) # Seeing how many duplicates there are for 'park_name' and 'observations'; I hope they match print('\nDuplicates in column for park_name:\n', grouped_observations_csv.park_name.value_counts()) print('\nDuplicates in column for observations:\n', grouped_observations_csv.observations.value_counts()) # I find these results interesting: # Eight observations for a species found 265 times, and twelve found nine times # These are in multiples of four, so they are likely duplicated evenly across the parks # Gathering a list of 'scientific_names' to filter 'observations_csv' using '.isin()' duplicated_observations_list = grouped_observations_csv.scientific_name[grouped_observations_csv['observations'] > 4].reset_index(drop=True) print('\nLength of series containing duplicate rows:\n', len(duplicated_observations_list)) # Expecting 274 (265+7 from 'value_counts' above that were greater than four) # Output 274 duplicated_observations_df = observations_csv[observations_csv.scientific_name.isin(duplicated_observations_list)].sort_values( by=['scientific_name', 'park_name']).reset_index(drop=True) # Investigating duplicates print('\nFirst few rows of duplicate rows as dataframe:\n', duplicated_observations_df.head(16)) # Still no answers, try unique 'scientific_name' for trends or patterns? # print(duplicated_observations_df.scientific_name.unique()) # No obvious trends, maybe also check category? duplicated_observations_species_info = species_info_csv[species_info_csv.scientific_name.isin(duplicated_observations_list)].reset_index(drop=True) # print(duplicated_observations_species_info.head()) # print(duplicated_observations_species_info.scientific_name.value_counts().head(20)) # Initial answer: There are nine triple duplicates of 'scientific_name' in 'species_info_csv', and 265 double duplicates. # Why is that? sorted_dup_spec_info_df = duplicated_observations_species_info.sort_values(by=['scientific_name', 'category']) print('\nFirst few rows of dataframe containing sorted duplicates from species_info_csv:\n', sorted_dup_spec_info_df.head(8)) # I see that there are duplicate 'scientific_name' values, but the corresponding 'common_names' are mostly different, but still very similar # For example, both 'common_names' rows for 'Agrostis gigantea' contain 'Redtop', but the second row contains more names. # Is there a way I can merge these together, but only keep unique 'common_names'? # Are we supposed to just 'know' which 'scientific_name' from 'observations_csv' is supposed to match up with the same column/row in 'species_info_csv'?

Major red flag: I think you should have merged/added the observations instead of dropping them.
* Duplicate rows are for the same species that were labeled differently. For example, in the species df, both ‘common_names’ rows for ‘Agrostis gigantea’ contain ‘Redtop’, but the second row contains more names. These duplicates are also in the observations df, but we only see the Latin names so the cause for duplication is less apparent.
* To make up an example of how adding values together is the best action: If we had duplicate rows for bison in Bryce Park that looked like [Latin name: ‘bison’, observations: 5], and [‘bison’, 284], that would add together to make [‘bison’, 289]. The way you did it, however, your dataframe now only includes 5 bison (keeping the first instance of a duplicate).

Another possible source of error: Using .duplicates() without ‘subset=’
* The way you did it, you only removed exact duplicates. If there was [‘bison’, 5] and [‘bison’, 284] in Bryce Park, it would have stayed in the dataframe while the second instance of [‘fox’, 25] and [‘fox’, 25] would have been dropped.
* By including ‘subset=’ in the species df, you would have seen that there were 283 duplicates in Latin name, but zero duplicates for Common names – worth investigating further, because we would have expected the same number of duplicates.

  • Analysis
    • Consider including more discussion about what you find, or at least a chart to make it easier to see differences (I see lots of large tables without any explanations or follow-up)
    • Consider running the whole notebook from start to finish every now and then. I see a lot of non-consecutive numbers, indicating that you often work in small sections without making sure your code still works as a whole.
    • Consider proportions when doing an analysis as opposed to just counts.
      • For example, early on you made a chart of total observations in each park and saw that one park had far more observations than the other parks. However, when you looked at the number of bats in each park, you did not account for the relative size of each park. Of course, Yellowstone had the most bats, it will always have the most of any species. But what you should analyze is: What is the proportion of bats to all other animals in each park? That will give you a more meaningful result.
      • Again, for the chart you made about protected vs unprotected bats, what are the proportions? You came close by visually evaluating protected vs non-protected but take it one step further with one more calculation. What % of the total bat population is protected in each park?
  • Conclusions
    • Huge red flag: You said, “Mammals and Reptiles exhibited a statistically significant difference,” but you never ran any statistical tests!! Never ever ever say there is a statistically significant difference without including the statistical test!! If you do include statistics, you need to explain why the test you chose is appropriate. This is not how you use ‘statistical inference’ that you mentioned in the beginning, “Statistical inference will also be used to test if the observed values are statistically significant.” You need to have a p-value to determine statistical significance. If you make inferences, you need to make that clear in your conclusions, otherwise people will think you are fabricating your results; if this was a homework assignment, this alone could result in an F on the whole paper.

What I like about your project:

  • Very clean, headings are well-labeled
  • Identified the type of missing data in the species dataframe and filled in values appropriately. I was lazy and left those as NAN because I thought it would make analysis easier (don’t have to != anything)
  • I liked how you did the word count to find the most common animal species group, and the counts in each park.
  • Future directions
    • I agree – it would be great to see changes in populations over time. I feel that would give us much more meaningful results.
    • I didn’t think of getting longitude and latitude of each observation – that would be cool to look at in Tableau.

Hi Shugarr,

Thanks for sharing your project. It’s very thorough, and does a good job of showcasing your skills. It’s good that you’re clear about what you’re doing at each step and why.

How did you know that the observations were in the last 7 days? I missed that. I was wondering what the timeframe was and just avoided mentioning it!

We handled our duplicates differently, but that’s a subjective thing. I see that you’ve already received feedback around that on this thread.

I was interested to see that you changed the column headings to longer form capitalized ones (e.g. Latin Name instead of scientific_name. I tend to go in the opposite direction so that I can use df.col_name rather than df[‘Column Name’] in my code, which feels cleaner and quicker, but is your preference so that it’s more intuitive to the end-user? That’s a good thought.

Thanks again for sharing!