In Python, I’m attempting to create a simple web crawler. The recursive aspect and profundity of this problem appear to be tripping me up right now.
def crawl(self,url,maxDepth):
self._listOfCrawled.add(url)
text = crawler_util.textFromURL(url).split()
for each in text:
self._index[each] = url
links = crawler_util.linksFromURL(url)
if self._depth < maxDepth:
self._depth = self._depth + 1
for i in links:
if i not in self._listOfCrawled:
self.crawl(i,maxDepth)
I add the url to the collection of searched sites and download all the text and links from the site, given a url and a maxDepth of how many sites from there I want to link to. I want to look through all of the links in the url to find words and links like this one. The issue is that when I try to recursively call the next url, the depth has already reached maxDepth and it stops at only one more page… I hope I stated it well; basically, my question is how do I perform all of the recursive calls and then set self. depth += 1?