Pandas vs SQL

After going through the Data Scientist path, it seems like Pandas and SQL can do almost the same things. At what place would you want to use SQL instead of Pandas and vice-versa?

EDIT: Is it always better to store data on our local machine as a sql file rather than as a csv file?

1 Like

I guess it depends on what you’re trying to do with the data & what your end goals are. You could use either I guess…whatever you’re comfortable with. For the most part you can perform the same tasks with both. I think Pandas is a little cleaner or, easier. (but that’s just my .02).
SQL is part of an entire RDBMS universe though. Pandas is a powerful (free) Python library (as you already know).

You would connect to a SQL server and work with the tables that way. You wouldn’t store the DBs & tables on your machine b/c most of them would be too large. Storing data is dependent upon how much space you have on your machine too (for personal projects you can store csv files on GitHub or Google Drive). For businesses, it’s easier for data (structured and unstructured) to be stored in a cloud data warehouse or data lake.
There’s tons of articles out there about data warehouses vs. data lakes too.

There’s also lots out there about Pandas vs. SQL. It’s a fun topic (IMO).

I like this one sql vs. pandas. and this one.
I really like this article as well.

This is neat, dataframe_sql might be worth looking into.https://github.com/zbrookle/dataframe_sql

4 Likes

That last link is the best one where you say “you really like”.
After reading all three, so far, without considering which language is more comfortable, the real technical difference for using either seems to be the source of the data and where you will be carrying out your computational operations.

You can also connect to a sql server with Python. I do it in my Colab notebooks using SQLAlchemy and the db driver Psycopg2
https://www.sqlalchemy.org

1 Like

Hello! It really depends on the end goal. If you’re trying tp use the data for a predictive model, may be better to do most of the aggregations in Pandas so that you can minimize the back-and-forth navigation between SQL and Python, and just have one base data model for all future analysis (that you would also do in Python).

If it’s for a one-off analysis using descriptive stats, I find SQL to be incredible easy: you can use their built-in window, avg, max, min function etc and then ship it in whatever format you prefer. Additionally, if you’re trying to build a dashboard, it’ll be better to do the aggregations in SQL and then import that in your Tableau/Looker etc. Depending on your use case, I can give you a more specific answer!

1 Like