Projects to practice web scraping

i just completed the sole codeacademy course on web scraping and was hoping to try out some of my skills to with a project. i feel like the course was not sufficient to understand more topics like pagination and so on.

would love to hear from anyone if they had the same experience. did perhaps a particular project/resource help you get a better grip on web scraping?

Hey Ajax,

I have a passion for web scraping. Here are a couple of resources:

Web Scraping Test Sites

YouTube Channels

Books

I also recommend learning Scrapy which is a web scrapping and web crawling framework created for downloading, editing, and saving data from the web. It is more difficult to learn than Beautifulsoup; however, it’s a Beast! :smiling_imp:

If you decide to learn Scrapy, I recommend this video tutorial series by BuildwithPython. It will get you from “Zero to Hero” in no time.

Finally, if you feel comfortable, post your code on GitHub. I’ll be more than happy to code review, collaborate and contribute.

Best regards,

3 Likes

this is incredible! thank you so much for sharing these resources! I’ll try to post my code here or on github for web scraping a few repos very soon

It might also help to learn more about how webpages are structured, so you know how to access elements & attributes when scraping.

And, as always: read a website’s documentation about scraping their data first. Make sure it’s allowed or else it’s a good way to get your IP banned from their site.

And, as always: read a website’s documentation about scraping their data first. Make sure it’s allowed or else it’s a good way to get your IP banned from their site.

Good point! Either that, or you could spoof the user-agent, and use rotating proxies! :smiling_imp:

No, I wouldn’t recommend that. Abide by a site’s data rules, period.

thank you for sharing! would you happen to know if its possible to scrape for specific words within a series of code from github?

for ex: if I wanted to check if this link (github.com/cosmos/cosmos-sdk/x/auth/types) exists within this repo (ibc-go/account.go at main · cosmos/ibc-go · GitHub), is that possible?

would you happen to know if its possible to scrape for specific words or links within a series of code from github?

for ex: if I wanted to check if this link (github.com/cosmos/cosmos-sdk/x/auth/types) exists within this repo (ibc-go/account.go at main · cosmos/ibc-go · GitHub), is that possible?

Yes, of course! Here’s one approach:

import requests
from bs4 import BeautifulSoup
import re


def check_link(url, link):
    """Check if link exists in url"""
    html = requests.get(url)
    soup = BeautifulSoup(html.text, 'html.parser')

    pattern = re.compile(link)
    match = re.search(pattern, str(soup))

    if match:
        return True
    else:
        return False


if __name__ == '__main__':
    # GitHub repo URL
    url = 'https://github.com/cosmos/ibc-go/blob/main/modules/apps/27-interchain-accounts/types/account.go'

    # These are the links that you want to check if exists within this (url) repo:
    l1 = 'github.com/cosmos/cosmos-sdk/x/auth/types'
    l2 = 'github.com/cosmos/cosmos-sdk/types'
    l3 = 'github.com/cosmos/cosmos-sdk/types/address'
    l4 = 'github.com/cosmos/cosmos-sdk/types/errors'
    l5 = 'github.com/cosmos/cosmos-sdk/x/auth/types'
    l6 = 'github.com/seraph776/'

    # Put links in a list:
    links = [l1, l2, l3, l4, l5, l6]
    
    # Loop through all the inks and check:
    for link in links:
        print(check_link(url, link))

Output

True
True
True
True
True
False

This code might not work for all use cases but it works for what you asked. I added some comments to the code; however, If you have any questions please ask!

1 Like

This is amazing, thank you so much for the script!
I’m a bit new to python so this might be a dumb question but I was wondering what the IF statement above is doing (and why the double underscores?)

similarly here, wasn’t sure what these 2 lines of code are doing

Hey Ajax,

The __name__ == '__main__' is not part of the functionality of the program to scrape data. It is just good practice to use it.
When Python interpreter reads source file it defines few special variables/global variables. If the python interpreter is running that module (the source file) as the main program, it sets the special __name__ variable to have a value “__main__”. If this file is being imported from another module, __name__ will be set to the module’s name. Module’s name is available as value to name global variable. For example, lets say you have the following two files in the same directpry:

Directory Structure

Project/
├── foo.py
├── bar.py

foo.py

Lets say foo.py has the following lines of code.

# foo.py

print('This is foo')
print(__name__)

If we run the script the output will be:

This is foo
__main__

Explanation: Since we are running foo.py as the main module (not importing it) its varibale __name__ is set to __main__.

bar.py

Now, lets open bar.py, import foo.py and run this module.

# bar.py
import foo

print('This is bar')

The output will be:

This is foo
foo
This is bar

Explanation: That’s because we’re importing foo.py, and it’s __name__ variable is now set to the module’s name (foo). Does that make sense?

Regular Expression

With these two lines of code I am using regular expression to match a specifc pattern in the Beautifulsoup object.

pattern = re.compile(link)
match = re.search(pattern, str(soup))

Explanation: I want to compile a regular expression pattern (or a string) into a regular expression object, which can be used for matching. So, the pattern that you want to find is github.com/cosmos/cosmos-sdk/x/auth/types (i.e., link). Therefore, complied into a regex object. This pattern can be be searched for in the soup object that was created. Does that make sense?

Conclusion

I hope I answered all of your questions on my code example. Remember, there are no dumb questions. I consider myself a student of Python because I am always learning, and I am glad I could help. I have a passion for web scrapping, so if you have any more questions or want me to look at some of your code, please feel free to ask.

1 Like

I respectfully disagree. Abiding by website’s rules for scraping is a matter of courtesy not legality.

Oracle vs Rimini Street (2018)

“Taking data using a method prohibited by the applicable terms of use” — i.e., scraping — when the taking itself generally is permitted, does not violate” the state computer crime laws.

LinkedIn v. HiQ Labs (2022)

Web scraping doesn’t qualify as accessing a protected computer without authorization.

Sandvig v Sessions (2020)

US District Court in Washington, DC, has ruled that violating a website’s terms of service isn’t a crime under the Computer Fraud and Abuse Act.

Criminalizing terms-of-service violations risks turning each website into its own criminal jurisdiction and each webmaster into his own legislature. Such an arrangement, wherein each website’s terms of service “is a law unto itself“, would raise serious problems.

Conclusion

Web scraping publicly available data is not illegal, period! It only becomes illegal when scraping non publicly available data. Additionally, using proxies or a vpn to hide or bypass geographical restrictions is not illegal either. Its honorable that you want to respect website’s policies for scraping; however, it’s not a requirement…

Best regards,

I never said anything about legality. I said, ‘Abide by a site’s data rules, period.’
And, it IS a requirement (not a suggestion) based on the site’s rules.
ie: read their documentation.

Abide by a site’s data rules, period.’ And, it IS a requirement (not a suggestion) based on the site’s rules.

No, it is not a requirement because you can scrap a website without regards to a sites data rules if that data is publicly available, and what you are doing is LEGAL!

Again, web scrapping does not violate the state computer crime laws and violating terms of Service is not enforceable. Therefore, there is no sense in abiding by a site’s data rules, if those rules cannot be enforced; except for the sake of being courteous, which unfortunately, is not a requirement… Now, I’ll concede it maybe good practice to read a sites documentation on web scraping; however, it is not “required” to abide by those rules to scrape a site clean. The website may not like it, and try and block you, but that’s when you spoof user-agents, and use rotating proxies!! :smiling_imp:

Now I respect that you want to follow unenforceable rules set by websites, however, I disagree when you say it’s a “it IS a requirement (not a suggestion)” ~ because that’s an opinion, a misconception, and a matter of personal policy when it comes to web scraping.

Do you agree or disagree?

Best regards,

1 Like

everything you mentioned makes sense to me. thank you so much for taking the time to explain things! much appreciated

i do have just one more question. if I wanted to check if these links exist in multiple URLs, is the best way to do that by defining multiple IF statements? for ex the above script checks for existence of links in the URL (ibc-go/account.go at main · cosmos/ibc-go · GitHub)

what in your opinion is the best approach to check in the above-mentioned link and in (osmosis/modules.go at main · osmosis-labs/osmosis · GitHub) ? so essentially iterating through 2 or more urls

The best approach would be to put the urls that you want to check for in a list, and put the links that you want to search for in another list, and iterate over them in a double for-loop like so:

if __name__ == '__main__':

    # These are the links that you want to check if exists within this (url) repo:
    l1 = 'github.com/cosmos/cosmos-sdk/x/auth/types'
    l2 = 'github.com/cosmos/cosmos-sdk/types'
    l3 = 'github.com/cosmos/cosmos-sdk/types/address'
    l4 = 'github.com/cosmos/cosmos-sdk/types/errors'
    l5 = 'github.com/cosmos/cosmos-sdk/x/auth/types'
    l6 = 'github.com/seraph776/'

    # Put links in a list:
    links = [l1, l2, l3, l4, l5, l6]

    # GitHub repo URLs
    urls = ['https://github.com/cosmos/ibc-go/blob/main/modules/apps/27-interchain-accounts/types/account.go',
            'https://github.com/osmosis-labs/osmosis/blob/main/app/modules.go']

    # Loop through all the links:
    for link in links:
        # Loop through all the urls:
        for url in urls:
            # Check results:
            print(f'Checking if LINK: <{link}> is in URL: {url}: {check_link(url, link)}')

Output

Checking if LINK: <github.com/cosmos/cosmos-sdk/x/auth/types> is in URL: <https://github.com/cosmos/ibc-go/blob/main/modules/apps/27-interchain-accounts/types/account.go>: True
Checking if LINK: <github.com/cosmos/cosmos-sdk/x/auth/types> is in URL: <https://github.com/osmosis-labs/osmosis/blob/main/app/modules.go>: True
Checking if LINK: <github.com/cosmos/cosmos-sdk/types> is in URL: <https://github.com/cosmos/ibc-go/blob/main/modules/apps/27-interchain-accounts/types/account.go>: True
Checking if LINK: <github.com/cosmos/cosmos-sdk/types> is in URL: <https://github.com/osmosis-labs/osmosis/blob/main/app/modules.go>: True
Checking if LINK: <github.com/cosmos/cosmos-sdk/types/address> is in URL: <https://github.com/cosmos/ibc-go/blob/main/modules/apps/27-interchain-accounts/types/account.go>: True
Checking if LINK: <github.com/cosmos/cosmos-sdk/types/address> is in URL: <https://github.com/osmosis-labs/osmosis/blob/main/app/modules.go>: False
Checking if LINK: <github.com/cosmos/cosmos-sdk/types/errors> is in URL: <https://github.com/cosmos/ibc-go/blob/main/modules/apps/27-interchain-accounts/types/account.go>: True
Checking if LINK: <github.com/cosmos/cosmos-sdk/types/errors> is in URL: <https://github.com/osmosis-labs/osmosis/blob/main/app/modules.go>: False
Checking if LINK: <github.com/cosmos/cosmos-sdk/x/auth/types> is in URL: <https://github.com/cosmos/ibc-go/blob/main/modules/apps/27-interchain-accounts/types/account.go>: True
Checking if LINK: <github.com/cosmos/cosmos-sdk/x/auth/types> is in URL: <https://github.com/osmosis-labs/osmosis/blob/main/app/modules.go>: True
Checking if LINK: <github.com/seraph776/> is in URL: <https://github.com/cosmos/ibc-go/blob/main/modules/apps/27-interchain-accounts/types/account.go>: False
Checking if LINK: <github.com/seraph776/> is in URL: <https://github.com/osmosis-labs/osmosis/blob/main/app/modules.go>: False

Let me know if that works out for you! Again, if you have any more questions do not hesitate to ask!

2 Likes

this is super elegant. thank you so much!