How is the best way to iterate over a large dataset?

Hi everyone,

I currently working on a project using mysql where I need to generate a large new table out of other large tables. In the end I would like to have a csv file as well as a working dataset in mysql.

I have a table (table1) which contains unique names in each row, a data table which stores every hour and date for the last 7 years and large table (table2) which contains the production and consumption of each name for each hour of the last 7 years. I would now like to generate a new table which contains for each name this data in the following manner:

date | time | name1_production | name1_consumtpion | name2_production | name2_consumtpion | …

To make sure the data is correctly progrest I would also like to check that the data and time is listed in my date table as well that the name in table1 and table2 are the same.

I am however not quite sure what the best approach for this might be. My dataset is quite big, were there are over 2’000 names and has therefore each over 120’000 data entries. Should I export the data to python using panda or should I progress this in sql? And how would be the best way to do this?

Many thanks already in advance for the help and I hope I was able to describe my problem in a understandable manner :slight_smile:

Best,
David

Mysql would (if done decently) very likely be faster (shorter run time), but if you are more experienced with panda and have nearly no mysql knowledge, you might complete the task faster (less development time)

Can you provide the structure of the tables you currently have? Sounds like a simple join is sufficient to export your data to csv.

Thank you very for your quick reply. Right now I am proably a bit more familiar with python but in any way I am no expert in either language of yet.

Yes sure. My tabe1 has the structure:

Coumn | data type
variable 1 … variable 6
name | text |
variable 8 … variable 11
name_id | int | primary key auto_increment

table2

Coumn | data type
DateTime | text
variable1 … variable 8
name | text |
generation | double
consumption | double
variable12 … varibale14
name_id_2 | int | primary key auto_increment

date_table

Coumn | data type
DateTime | text
Date | Date
Time | Time

Is this what you meant by the structures of the tables?

Thanks a lot for your help :slight_smile:

mostly interested in the foreign key constrains, how does table 1 know which entries it belongs to for table 2? And same for the date table

name_id_1 and name_id_2 are the names of your primary keys?

Ok sorry I am still a bit new to this. I created the database myself from a bunch of csv files. table2 is the original “raw” table from this datacollection and I passed then all unique names from this table2 to table1 with some other informations.

There are no foreign keys in either one table when I check the table inspector. I added the primary keys only after the data collection.

Yes name_id_1 and name_id_2 are the names for the primary key I created.

I hope this helps :slight_smile:

But if not for foreign keys, how do you intent to match records from the different tables?

The names are the same, so I was planning to do some where statements which check the date, time and name with a && condition.

I only see name in table1?

sounds horrible, please consider using joins

Sorry for my mistake, I have corrected it now. How could I achieve this with joins? Do I need to change something with my data structure first for this?

Well, ideally you would use join on primary and foreign keys, but it is possible to join on name too:

select * from table1
inner join table2 on table1.name = table2.name
2 Likes

Thank you very much, the search is in progress. However, it has been running already for a little over 14 hours… Now I was wondering if I can see somewhere the progress in percent of the progress or the estimated end time? And can I already check if the table is made according to my wishes while the progress is running?

this should take not 14 hours? I can’t answer those questions without getting fully involved in the project. I have no idea what you did

Then you would need to put the project in a repository on github, so I can clearly see what you are doing

3 Likes

Ok thanks I understand completly. I have however another syntax/methodic related questions if this is alright…

The query has finally finished after something over 15 hours. However I must have made a mistake in the query as I now got a table in the following form:

DateTime | production

Where I used the following code:


drop table if exists name_production;  

CREATE TABLE name_production AS(    
select table1.`DateTime`, table1.production from table2
inner join table1 on table2.name= table1.name
inner join time_table on time_table.txt = table1.`DateTime`
order by table1.`DateTime`);

However I would like somthing like:

DateTime | name1 (column of production) | name2 (column of production) | name3 (column of production) | …

How to I need my code for to achieve this?

Is the problem that my names in table2 are in rows? Should I somehow “transpose” this table first and then use the join statment? I think I also need to change the column “name” from production to the actual name which produces the production. This means to the name I actually compare in my statement.

I can’t keep working from these half examples, you need to give me something I can really work with. Given I am just a volunteer on the forum, I am getting close to the limit of how much heavy lifting I want to do :wink:

you intent to create a table with hundreds of columns? Or do I misunderstand? Because that is a dreadful way to work with databases

2 Likes

Sorry I didn’t mean to be a pain, I’m still new to this and therefore don’t know what information is needed. My approach was to explain the problem as logically as possible from my data set. I am really greatful for your time and effort and did not wanted to cause any distress.

My table1 in the descripton is the table “plant_information”, table2 is “act_gen_raw” and the time table is “dt”. I run the following query:

drop table if exists plants_production_test; 

CREATE TABLE plants_production_test AS(    
select dt.d as Date, dt.t as Time, act_gen_raw.ActualGenerationOutput as PowerSystemResourceName from plant_information
inner join act_gen_raw on plant_information.PowerSystemResourceName = act_gen_raw.PowerSystemResourceName
inner join dt on dt.txt = act_gen_raw.`DateTime`
#order by act_gen_raw.`DateTime`
LIMIT 10);

I put my data under the following link, where I extracted only some sample, which is why the tables are appended with “_sample”.

I run also:

-show create table plant_information;

‘plant_information’, ‘CREATE TABLE plant_information (\n ResolutionCode text,\n AreaCode text,\n AreaTypeCode text,\n AreaName text NOT NULL,\n MapCode text,\n GenerationUnitEIC text,\n PowerSystemResourceName text NOT NULL,\n ProductionType text NOT NULL,\n InstalledGenCapacity double DEFAULT NULL,\n id_plant int NOT NULL AUTO_INCREMENT,\n PRIMARY KEY (id_plant),\n KEY country (MapCode(500)),\n KEY installed (InstalledGenCapacity),\n KEY production_type (ProductionType(500)),\n KEY time_intervall (ResolutionCode(500)),\n KEY cta (AreaTypeCode(500))\n) ENGINE=InnoDB AUTO_INCREMENT=2154 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci’

-show create table act_gen_raw;

‘act_gen_raw’, ‘CREATE TABLE act_gen_raw (\n DateTime text,\n ResolutionCode text,\n AreaCode text,\n AreaTypeCode text,\n AreaName text,\n MapCode text,\n GenerationUnitEIC text,\n PowerSystemResourceName text,\n ProductionType text,\n ActualGenerationOutput double DEFAULT NULL,\n ActualConsumption double DEFAULT NULL,\n InstalledGenCapacity double DEFAULT NULL,\n UpdateTime text,\n id_plant int NOT NULL AUTO_INCREMENT,\n PRIMARY KEY (id_plant)\n) ENGINE=InnoDB AUTO_INCREMENT=110527853 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci’

-show create table dt;

‘dt’, ‘CREATE TABLE dt (\n txt text,\n d date DEFAULT NULL,\n t time DEFAULT NULL\n) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci’

I would however understand if you prefer not to spend anymore time on this problem :wink: And yes I itend to create hunderts of columns as it is my order of my supervisor to do so.

some quick googling lead me to something called pivot:

sql - How can I return pivot table output in MySQL? - Stack Overflow

which seems to do sort of what you want, I am not well versed in this kind of mysql matrix/pivot tables. But seems what you need

I think this is the last of my time I am going to put in, maybe someone will take over, but I doubt it. This is well outside the scope of codecademy

1 Like

Ok looks interesting :smiley: Thanks you very much your time and good to know :slight_smile: I was not aware of that.