‘grab’ is the operative word. Click the webpage, Ctrl+A, Ctrl+C, click text editor, File > New, Ctrl+V. Save as fifa_word_cup_raw.txt
. What do you have now?
2018 FIFA World Cup Russia™
Kick-offs are shown in your time zone
Thursday 14 June
14 Jun 2018 - 18:00 Local time
Group A
Luzhniki Stadium
Moscow
Full-time
RUS
Russia
5-0
KSA
Saudi Arabia
Friday 15 June
15 Jun 2018 - 17:00 Local time
Group A
Ekaterinburg Arena
Ekaterinburg
Full-time
EGY
Egypt
0-1
URU
Uruguay
... // to over 500 lines.
Everything below line 499 was deleted from the file. It’s all the page nav. The data we’ll be scraping is everything down to that line from pretty much to top.
My next manual step was to insert tab characters where indicated by ->
,
Group A
Luzhniki Stadium->
Moscow
Full-time
RUS->
Russia->
5-0->
KSA->
Saudi Arabia
Save the file as ...working.txt
The next step is just as tedious. Collapse the lines at all the tabs. Click, Delete, Click, Delete, … Work from the bottom of the file to make it a little easier. Save as you go.
When all is said and down, we able to spin off another file, and delete everything but the lines we collapsed, to get,
RUS\tRussia\t5-0\tKSA\tSaudi Arabia
EGY\tEgypt\t0-1\tURU\tUruguay
MAR\tMorocco\t0-1\tIRN\tIR Iran
... // 48 lines in total
The data to open this topic was gleaned in much the same manner. Spin off a new file, and delete all the lines that do not apply, then save.
Luzhniki Stadium\tMoscow
Ekaterinburg Arena\tEkaterinburg
Saint Petersburg Stadium\tSt. Petersburg
... // 48 lines in total
At this point we have two tab separated files that can be quickly converted to CSV in Excel, if needs be. But since this is all manual, I’ve just kept working with the files. Python can open them but I didn’t take that route and rather pasted the data into a script as a string. Reading the data file would probably have been easier.
venues = "Luzhniki Stadium\tMoscow*\
Ekaterinburg Arena\tEkaterinburg*\
Saint Petersburg Stadium\tSt. Petersburg*\
... // 48 lines in total"
I’ve manually inserted the *\
, first so we have unique character to split on, second so we can leave the line break in the string.
See if you can get to this point and we’ll get into the rest of the Python.