Maintaining variables in data analysis

Hi Guys,

So a bit of a general question / ask for advice about using python for data analysis.

So say I have a script like this:

df = pd.read_sql_query('SELECT * from DB', connection)

I run this, but then think… ah I want to view some certain data so I add some code and run script again

foo = df["bar"]

Then I think, I want to actually print this variable, so I add more code and run it again

print(foo)

So at this point I have run 3 seperate requests to my DB to retrieve the exact same data. I could see this manner of coding quickly becoming an issue if im importing a huge dataset or if im using an API with a request limit.

How do people generally get around this? My only thoughts were to use an IDE like spyder and make manipulations to my imported variable using the iPython console. This seems like a bit of a work around though.

What is “best practice”? Is there a way to keep a variable in python memory while you are amending and rerunning the program to manipulate this data?

You could use Ipython or python IDLE

I personally don’t think this is a massive problem, 3 queries is not much. If you work with an API with restrictions, you must take more time to think about what you want, then code, then run over the code in your mind if you have everything, then run the script

I don’t work a lot with python, but I do work a lot with databases. I run my test suite (code to verify code is working correctly) dozen of times per day. The test suite executes hundreds of queries (both to insert the data needed for the test and querying to see if get desired output)

the tests suites takes somewhere 2 and 20 seconds (depending on which project, and how large the test suite is)

2 Likes

@ashleyhowe7359274843,

The easiest way to deal with this is to use a Jupyter notebook when you’re doing the exploratory analysis. Once you are sure everything is running the way you want it to, you can always switch over to a regular Python script if you want to. If you haven’t used Jupyter notebooks before, I suggest you check them out.

As @stetim94 said, what you’re describing is usually not a big deal, but if you are limited by API restrictions, Jupyter notebooks would help a lot. Essentially each “cell” can run independently of the whole notebook. So in one cell you query data from your DB or API:

In [1]: df = pd.read_sql_query('SELECT * from DB', connection)

Then in the next cells you can do your other actions:

In [2]: foo = df["bar"]
In [3]: print(foo)
Out [3]: foo

If you go about it this way, you only query the database when cell 1 is run, rather than any time new code is added.

1 Like