Web spider path

Hi,
I know what I want to make, but don’t have any coding background. In other words, I don’t know what I don’t know.

I’d like to make a web spider to crawl the internet. The spider should then index sites that have a certain technology. For example, ‘show me all the sites in the world hosting Shopify’. Those sites should then be exported to CSV.

I’m looking at Python & Scrapy. But I don’t know what else I’ll need. Any thoughts on what modules I’ll need in Codecademy to make this journey as smooth and direct as possible?

Thanks!!

1 Like

The code for this can actually be super simple depending on the ambition of this project you have in mind. If that path makes the most sense, I don’t have time to do a freelance contract FOR you …at least not FOR free.

Doing this for fun or a learning exercise?

I think this is a great choice for a fun project! …and there are lots of google-able tutorials to building a web scraper. The toughest thing behind the technical design will be deciding how specific you want to get about what web technologies are being used for a site, and what defining criteria you’ll be able to deduce that. However, since this is for fun & learning, I wouldn’t get too caught up designing a precise end-product or design.

Doing this for commercial work?

We need to get way more specific about your deliverables (a.k.a. what precisely do they want this to help them do at the end of the day …and what do they assume that will look like?). Since you haven’t provided much of that information, I’m willing to wager that you are getting way too deep into this problem for your expertise and there’s likely a better way.

Disclaimer: I want way more people to learn how to code, but if you’re under a deadline already …you can’t afford to make a productive investment in building your knowledge for a project this wide in scope

However, there is hope! Again, I can only run with how little you’ve shared. You want a more specific answer? Try asking a more specific question.

I’m willing to wager that you’re working on more of a data science project that can just as easily be solved without “reinventing the wheel” yourself. If you’re just trying to help your clients make more data-informed decisions, then I wouldn’t invest a ton in computing resources to crawl “the entire internet” doing your own primary research. Why not leverage the data others have already collected on this problem?

There are many teams out there that have already answered these types of questions and have data widely available for you to leverage. You might be better off preparing a dataset (a.k.a. spreadsheet) with this data that you’ve manually acquired yourself. That way, you can answer the questions like “How many sites in the world are using Shopify? How many sites in the Top 1000 most visited sites are using Shopify?” and on and on.

The thing people who don’t code don’t understand, is that most technologies “stand on the shoulders of giants.” Every hand-coded solution is likely leveraging many libraries & frameworks that people have already done the heavy-lifting for. Here’s an extreme example: I didn’t have to build a crawler to get a “good enough” answer that one. It took one Google search:

How many sites in the world are using Shopify? 1,661,942
(Source: https://trends.builtwith.com/)

1 Like

Hi!

Thanks so much for your thoughtful reply. That’s awesome. I love how much detail you put into it.

BuiltWith is a great example of what I’d like to do. It’s fairly old technology, so there should be existing libraries of code. I don’t need a polished solution like BuiltWith, but I would like to learn to crawl the web and export the results.

Since I’m 100% new to coding I don’t know what I need to learn. Obviously there are some basics like Python syntax. And the course on Beautiful Soup would probably be very helpful. But what other Python courses would be ‘required’ learning? Do I need to know about databases? I’m guessing the answer is yes, but I don’t know which course is the right one. I also don’t know how to access and implement 3rd party libraries at the moment.

Appreciate any recommendations on what to study!