In Python, I’m attempting to create a simple web crawler. The recursive aspect and profundity of this problem appear to be tripping me up right now.
def crawl(self,url,maxDepth): self._listOfCrawled.add(url) text = crawler_util.textFromURL(url).split() for each in text: self._index[each] = url links = crawler_util.linksFromURL(url) if self._depth < maxDepth: self._depth = self._depth + 1 for i in links: if i not in self._listOfCrawled: self.crawl(i,maxDepth)
I add the url to the collection of searched sites and download all the text and links from the site, given a url and a maxDepth of how many sites from there I want to link to. I want to look through all of the links in the url to find words and links like this one. The issue is that when I try to recursively call the next url, the depth has already reached maxDepth and it stops at only one more page… I hope I stated it well; basically, my question is how do I perform all of the recursive calls and then set self. depth += 1?