Biodiversity Project + Questions for Experienced Data Scientists / Programmers

Hello everybody,

I am glad to share my project with you!

During the project i got some questions which might be answered by a more experienced developer:

Question 1:
I created my stacked bar charts for the counts of observations as well as for the proportion of the total populations of the considered category by performing a separate loop for each plot. In contrast, I created the pie charts by a separate function call and if-elif conditions in the beginning of the function.
Which way is more efficient/easier to read/ā€¦ or is there an more elegant way to perform?

Question 2:
I put much effort in the object oriented programming to write the code in a ā€œclearā€ way. Is it really worth it or could I just not use OOP?

Question 3:
Using sns.lmplot for a regression does not work for subplots or modifiing the axis ticks / tick labels. Why does it not work when it works for other seaborn plots?

Question 4:
I always tried to write a ā€œgeneralsticā€ function, such as for the pie charts. Sometimes it feels like it would be more efficient to write two separate functions (respecively for the status and the parks). What would you recommend to do? When should I decide to write two separate functions instead of one more generalistic?

Thanks in advance and kind regards!

1 Like

Congrats on finishing the project!

I guess Iā€™m wondering why you imported Pandas, NumPy, SciPy, and Seaborn if you didnā€™t use them but instead created classes for the data frame? Unless itā€™s easier that way(?)

Pandas is pretty powerful for navigating around a data frame, analysis and cleaning of data.

Concerning lmplotā€¦this might help?
See:

Hello and thanks for your feedback so far!
Tbh your comment is confusing me. I did use Pandas, NumPy, SciPy, and Seaborn in the class (methods). Your comment sounds like creating a class does exclude using the modules above - but it does not.
Regarding my code you see, that I instanciated the class object ā€œdataframeā€ from the class ā€œDataframeClassā€. Using the constructor I call the method ā€œcreateDataframe()ā€ which does create two dataframes from both CSV files using Pandas and subsequently merge them. The next call of the constructor is to perform the method ā€œcleanDataframe()ā€ which does apply the Pandas methods
.drop_duplicates()
.dropna()
.dataframe.drop()
.columns
ā€¦
to the created dataframe. Could you please point out more precisely why you mean that I did not use the modules?

To lmplot: My problem was not that it did not show the plot generally, but only some matplotlib methods (such as subplot and set x ticks (labels)) does not work. In contrast, the do work for other Seaborn plots.

Kind regards,
Jonas

I did this project so long ago. Perhaps they changed the requirements.

I meant what is the advantage of creating classes here for these csv files over using built in Pandas methods to do EDA?:
df = pd.read_csv()
df.info()
df.shape
df.describe()
pd.isnull()

Etc.
Does creating classes use less memory or is it more efficient or something? (Iā€™m asking because I honestly donā€™t know). It just seems like a lot more time/work to create classes & functions to do EDA for this project.

Do you have a Jupyter (or Colab) Notebook for this project so one can see the output of the code in cells? Plus w/that you can use the markdown cells to describe what youā€™re doing in your analysis (which is telling a story with the data.)

Iā€™m not sure why certain methods didnā€™t work for lmplot. Did you check the documentation?
https://seaborn.pydata.org/generated/seaborn.lmplot.html

I used built-in Pandas methods in my class as well. As I described - using classes does not exclude using Pandas methods.
From the mothods you mentioned I used df = pd.read_csv(). I did not use the other mentioned methods such as df.info(), df.shape, df.describe(), pd.isnull() since I used the debugger to get these information. This does not mean that I did not use further Pandas methods to filter and modify the dataframes.

I am also just guessing that the code runs more efficient using classes. But yes - since I put a lot of effort in to create the classes and deal with their methods I hoped to get the opinion of a experienced developer who might know this.

I tried to write a code with the least lines. Since I do not I do not need methods such as ā€˜.info()ā€™ using PyCharm I decided to stick with this IDE.

Unfortunatelly, it was the first thing I did and could not figure it out yet.

I would consider myself to be a pretty experienced data person and I can say that all the projects that I have reviewed no one used classes. So, this is unique here (not a bad thing at all!). Like I said, I think itā€™s a bit more time consuming to do and maybe thatā€™s why others didnā€™t go this route.

But, the point of this project isnā€™t just writing fewer lines of code; itā€™s EDA and hypothesis testing. This includes presentation of the data, which is why I asked about using Jupyter or Colab (which is based on Jupyter and cloud-based. Itā€™s an app that you can add to your Google drive and then you can push notebooks to your GitHub repo). So, when youā€™re going through the data then whoever is viewing your projectā€”whether theyā€™re another data person or someone who has zero experience in dataā€“they can see your thought processes as you weed through the data sets (EDA), test hypotheses, and create graphs and plots. When one only looks at a .py file one cannot see code output.

Iā€™m going to keep investigating this lmplot issue though b/c Iā€™m stumped.

Just to make sure that you did not get me wrong: I meant not to say that you are not an experienced data scientist - but the question concerning classes is rather directed to experienced software developer or whatever.

I know, but after writing the code itself I tried to optimize it and generalize it as far as possible.

This might be the point to work on for me!

1 Like