Visualizing Data that's Before Our Eyes


#1

How does one go from this…

2018 FIFA World Cup Russia™ - Matches

to this…

========= RESTART: D:/cc/discuss/users/mtf/fifa_world_cup_venues.py =========
 1: Ekaterinburg Arena       Ekaterinburg
 2: Fisht Stadium            Sochi
 3: Kaliningrad Stadium      Kaliningrad
 4: Kazan Arena              Kazan
 5: Luzhniki Stadium         Moscow
 6: Mordovia Arena           Saransk
 7: Nizhny Novgorod Stadium  Nizhny Novgorod
 8: Rostov Arena             Rostov-On-Don
 9: Saint Petersburg Stadium St. Petersburg
10: Samara Arena             Samara
11: Spartak Stadium          Moscow
12: Volgograd Arena          Volgograd
>>> 

Why search when you can have it at your disposal? Data can be brought in from anywhere, but it may take a strange form. Some manual work is involved in disseminating it, but if we are sensible in our approach, it doesn’t have to take a lot of time or effort.

Ping this topic if you are interested in venturing down this avenue. I’d be glad to share how we got to this point, but only if there is interest.


#2

Hello, I would be interested in seeing how this type of data grab is done.


#3

‘grab’ is the operative word. Click the webpage, Ctrl+A, Ctrl+C, click text editor, File > New, Ctrl+V. Save as fifa_word_cup_raw.txt. What do you have now?

2018 FIFA World Cup Russia™

Kick-offs are shown in your time zone
Thursday 14 June
14 Jun 2018 - 18:00 Local time
Group A
Luzhniki Stadium
Moscow
Full-time
RUS
Russia
5-0
KSA
Saudi Arabia
Friday 15 June
15 Jun 2018 - 17:00 Local time
Group A
Ekaterinburg Arena
Ekaterinburg
Full-time
EGY
Egypt
0-1
URU
Uruguay
... // to over 500 lines.

Everything below line 499 was deleted from the file. It’s all the page nav. The data we’ll be scraping is everything down to that line from pretty much to top.

My next manual step was to insert tab characters where indicated by ->,

Group A
Luzhniki Stadium->
Moscow
Full-time
RUS->
Russia->
5-0->
KSA->
Saudi Arabia

Save the file as ...working.txt

The next step is just as tedious. Collapse the lines at all the tabs. Click, Delete, Click, Delete, … Work from the bottom of the file to make it a little easier. Save as you go.

When all is said and down, we able to spin off another file, and delete everything but the lines we collapsed, to get,

RUS\tRussia\t5-0\tKSA\tSaudi Arabia
EGY\tEgypt\t0-1\tURU\tUruguay
MAR\tMorocco\t0-1\tIRN\tIR Iran
... // 48 lines in total

The data to open this topic was gleaned in much the same manner. Spin off a new file, and delete all the lines that do not apply, then save.

Luzhniki Stadium\tMoscow
Ekaterinburg Arena\tEkaterinburg
Saint Petersburg Stadium\tSt. Petersburg
... // 48 lines in total

At this point we have two tab separated files that can be quickly converted to CSV in Excel, if needs be. But since this is all manual, I’ve just kept working with the files. Python can open them but I didn’t take that route and rather pasted the data into a script as a string. Reading the data file would probably have been easier.

venues = "Luzhniki Stadium\tMoscow*\
Ekaterinburg Arena\tEkaterinburg*\
Saint Petersburg Stadium\tSt. Petersburg*\
... // 48 lines in total"

I’ve manually inserted the *\, first so we have unique character to split on, second so we can leave the line break in the string.
See if you can get to this point and we’ll get into the rest of the Python.


#4

Thanks for going through this. I’m able to get the raw data but I’m a little confused as to the rationale for adding tab characters on certain lines and then the “\t” when collapsing lines. How do I determine if a line should get a tab character and what is the meaning of the backslash t?
Thanks


#5

The raw data is broken into separate lines. Adding the tab characters makes it possible to compose tabular data once those lines are collapsed. (Tabular data is easy to paste into a spreadsheet.) The tabs are embedded as the escape sequence, \t. See below for clarification…

group phase tabulated data

fifa_group_phase_1-124_tabulated

That is the top half of the game data, tabulated. The whitespace is tabs not spaces.

venues tabulated data

fifa_venues_1-124

Again, the whitespace is tabs separating two columns, each containing spaces.

Python source code
venues = "Luzhniki Stadium	Moscow*\
Ekaterinburg Arena	Ekaterinburg*\
Saint Petersburg Stadium	St. Petersburg*\
Fisht Stadium	Sochi*\
Kazan Arena	Kazan*\
Spartak Stadium	Moscow*\
Mordovia Arena	Saransk*\
Kaliningrad Stadium	Kaliningrad*\
Samara Arena	Samara*\
Luzhniki Stadium	Moscow*\
Rostov Arena	Rostov-On-Don*\
Nizhny Novgorod Stadium	Nizhny Novgorod*\
Fisht Stadium	Sochi*\
Volgograd Arena	Volgograd*\
Mordovia Arena	Saransk*\
Spartak Stadium	Moscow*\
Saint Petersburg Stadium	St. Petersburg*\
Luzhniki Stadium	Moscow*\
Rostov Arena	Rostov-On-Don*\
Kazan Arena	Kazan*\
Samara Arena	Samara*\
Ekaterinburg Arena	Ekaterinburg*\
Nizhny Novgorod Stadium	Nizhny Novgorod*\
Saint Petersburg Stadium	St. Petersburg*\
Volgograd Arena	Volgograd*\
Kaliningrad Stadium	Kaliningrad*\
Spartak Stadium	Moscow*\
Rostov Arena	Rostov-On-Don*\
Fisht Stadium	Sochi*\
Nizhny Novgorod Stadium	Nizhny Novgorod*\
Ekaterinburg Arena	Ekaterinburg*\
Kazan Arena	Kazan*\
Samara Arena	Samara*\
Volgograd Arena	Volgograd*\
Kaliningrad Stadium	Kaliningrad*\
Mordovia Arena	Saransk*\
Fisht Stadium	Sochi*\
Luzhniki Stadium	Moscow*\
Saint Petersburg Stadium	St. Petersburg*\
Rostov Arena	Rostov-On-Don*\
Kazan Arena	Kazan*\
Ekaterinburg Arena	Ekaterinburg*\
Spartak Stadium	Moscow*\
Nizhny Novgorod Stadium	Nizhny Novgorod*\
Volgograd Arena	Volgograd*\
Samara Arena	Samara*\
Mordovia Arena	Saransk*\
Kaliningrad Stadium	Kaliningrad".split('*')
for i, x in enumerate(sorted(set(venues))):
  print ("{:2}: {:24} {}".format(i + 1, x.split('\t')[0], x.split('\t')[1]))

The first step is to create a list split on the rows, which is where the * comes in to play.

The second step is to tabulate on the screen only the unique rows, each one split on the '\t' to form the two columns. The row numbers are arbitarily added for visual purposes, only.


#6

Ok that makes sense now, thanks. I believe I am caught up with you…I have my two tab separated files and am ready to paste the data into Python.