Given a single table, how can we split it into multiple tables?

Question

In this exercise, it mentions splitting the table into multiple tables. Given a single table, how can we split it into multiple tables?

Answer

In this context, the tables are dataframes. Because they are dataframes, we just need to split the dataframe into separate CSV files using methods from Pandas.

The following example will split a dataframe (“table”) into separate dataframes (“tables”), and save them into separate CSV files, to make them more manageable to work with.

Example

# First, read the original CSV and store in 
# a variable. This is the original "table".
data = pd.read_csv("filename.csv")

# Split the dataframe into separate ones, by columns.
# These are the "tables".
df1 = data[['col1', 'col2']]
df2 = data[['col3', 'col4', 'col5']]
df3 = data[['col6', 'col7', 'col8']]

# Now, we just need to store these into 
# their own CSV files.
# Note: index=False is used to remove the added index column.
df1.to_csv('table1.csv', index=False)
df2.to_csv('table2.csv', index=False)
df3.to_csv('table3.csv', index=False)
8 Likes

could you split the table using the function which pulls the first 3 and the last 3 columns, assuming it’s only 6 columns long using [-3:]/ [0:3] ?

Can anyone point to a good resource on theory for setting up a dataframe from scratch? It seems like we are getting contradictory messages about what makes for a ‘good table’ with ‘useful data’. In this lesson, it stresses that a dataframe that repeats information, such as customer name, products, etc., can ultimately result in tables that are big and unmanageable. However, in other lessons that focus on ‘tidy data’, it seems to stress that this kind of repetition is actually desirable for working with the data and doing analysis. So what’s the deal? Is it the difference between efficiency of data storage vs. efficiency of data usage? What is the actual best practice here?

2 Likes

Is your question about setting up a database from scratch? A database is a collection of tables (or dataframes in Pandas). For theory behind setting up a relational database from scratch, I found this resource provided a helpful overview: https://launchschool.com/books/sql/read/table_relationships

1 Like

Some of the programming principles include things like SOC: Separation of concerns and DRY: Do not repeat yourself.
Separating data into multiple tables helps to save storage space and makes management of data easier. Imagine a customer makes 50 orders . In a single table we’ll have 50 repetitions of that customers data like house address, email, phone number. But if we use multiple table. we’ll have just one line containing the customers information then have only the customers id in the orders table repeated which we can combine with the customers table to get info of a specific customer.

1 Like

Thanks for sharing this! I completely agree. Separating data into multiple tables is a smart way to manage storage and maintain data integrity. It’s amazing how much more efficient and organized databases can be with principles like SOC and DRY. The example really illustrates the benefits well.