Are there guidelines to splitting a table as provided in the example?

Question

In the context of this exercise, are there any guidelines and reasons for splitting the table as shown in the example?

Answer

Yes, although there might not necessarily be a one-fits-all solution to every possible database structure, there are some guidelines that can help you when you need to split a table into multiple tables.

A concept that most programmers use to refer to restructuring a database in this manner is “Database Normalization”. A very brief explanation of this concept is that it aims to accomplish that databases tables are structured so that they avoid data redundancy and keep the data accurate and consistent. “Data redundancy” is when we have a lot of repeated, or redundant, data.

In the example given in this exercise, it explains how we might split a table into multiple tables, which is essentially doing normalization. Initially, the table has many columns, and as a result, some of the values will most likely be repeated many times, which introduces data redundancy.

Say for example, we added 1 million rows to this table. Some values like customer_address might end up being stored many thousands of times, when we really only need to store it once per customer.

To avoid this issue, we would split the table so that the information is stored more concisely. If we need the customer information, we can obtain it from the customers table, by their customer_id. And, if we need the information for a subscription, all of its information is stored in the subscriptions table. We only need the customer_id and subscription_id in the orders table, and we can obtain their information from their respective tables.

18 Likes

Would you still keep a master table with all the redundancies?

5 Likes

It’s a hard question to answer without specifics but think of a master table with purchase orders. Ideally, any business would want to have repeat customers so while the order numbers are unique, the emails may be redundant. In that case, not keeping the table intact may be an issue later on right?

2 Likes

No, there’s no need. All the information that would be stored in a hypothetical “master table” is already in the database, and you can use queries to join together the data however you like. The data is not made more fragile or less accessible by being divided up into multiple tables. After all, viewing the contents of one table vs multiple tables is the same process: you write a select query.

29 Likes

Just exactly my thought. Also remember that the essence of having a database is in order to query specifics and not to view a large database in one shot. As long as your multiple tables are jointly queryable, then you are good

3 Likes

I am still a bit confused.

What exactly is the issue with redundancies in a table if we ultimately are going to be querying everything anyway?

Is it a storage issue?

1 Like

Probably yes. Because you save your hard disc from needless repeated values.

To add one row with one changed cell in your database you would have to repeat all the unnecessary columns for this. Imagine if there were tens or hundreds of columns in your one table database.

In another case, you would just add one row with 2 or 3 columns in a separate table and that’s it.

2 Likes

Why would “Some values like customer_address end up being stored many thousands of times,” ?? Wouldn’t they just belong to one customer? how can there be repeated address?

1 Like

Basically, the concept is that a new record is added every single time a customer makes a purchase. So let’s say customer A buys a subscription for a Fashion Magazine, which has an associated description, monthly price, length of subscription, etc. This record also includes the customer’s name, address, and anything else associated with their ID (for example, email and password for logging in to access the subscription, preferences for what types of emails they want to receive from the company, and so on).

With database normalization, you have two tables where Orders displays order ID, the customer ID (unique to the customer, presumably, where their personal information is stored with a matching customer ID as key in the Customers table), their subscription ID, and the purchase date:

Orders

  • Fashion Magazine, ID 3, Sub ID 5, 12/19/2018
  • Fashion Magazine, ID 3, Sub ID 6, 5/23/2019
  • Sewing Magazine, ID 3, Sub ID 7, 12/02/2019

Customers

  • Customer A, 123 Internet St, ID 3, customerA@email

Or, if it’s all one database, you could have something like the scenario below when customer A makes multiple purchases:

  • Customer A, 123 Internet St, ID 3, customerA@email, 12/19/2018, Fashion Magazine, Sub ID 5
  • Customer A, 123 Internet St, ID 3, customerA@email, 5/23/2019, Fashion Magazine, Sub ID 6
  • Customer A, 123 Internet St, ID 3, customerA@email, 12/02/2019, Sewing Magazine, Sub ID 7
  • et cetera
16 Likes

Thinking about this question here from a real world user case…

For databases with multiple chances for repeated data, would a data key be used? Perhaps something like a customer id or purchase id?

2 Likes

I may be misunderstanding what you’re asking but I think you’ve nailed what this example is demonstrating -

“customer_id” from the customers table is the data key being used to identify unique customers
“subscription_id” from the subscriptions table is the data key used for unique purchases

3 Likes

Thank you very much. This was so helpfull :slightly_smiling_face:

Can you recommend any resources where to dig deeper into this concept and learn how to apply and implement it?

2 Likes

Thank you all for helping me understand this concept. I really appreciate the time you put into these replies. This is my first time working with SQL.

Hi there,
This video on YT delves a bit deeper into data normalization for anyone wondering why it would be necessary.

5 Likes

Makes sense, but it is not the way things work in the real world. Most enterprises keep the master table and leave it untouched (as the original source of truth).
This data is transformed and the normalized data are then stored separately - hence, both the copies are maintained.
Any new entry comes into the master table from where the data pipeline applies the requisite transformations and sinks the output in the form of normalized data.

As an excel user who has just started learning SQL:

Is “querying a table” such as the customer_id table the same thing as performing a VLOOKUP?

ie using a unique key to pull data from another table