Scraping XML data from multiple links on the same webpage


#1

I’m can’t figure out how to get Scrapy to crawl multiple links on the same url webpage. This is for data mining congressional legislation on the Government Printing Office, specifically this webpage: https://www.gpo.gov/fdsys/bulkdata/BILLSTATUS/115/hr.

In any given Congress there are around 10,000 bills introduced, so I need Python language that looks for a bill number beyond 10,000 to ensure that all possible bills are mined. Putting the number at 20,000 would ensure that happens.

Notice the end of the url. I’d need it to go from /115/hr1 to /115/hr20000.


#2

Am I missing something? Why not just iterate through each of the lines and processes them? Would be fairly easy with a combination of requests and beautyfulsoup

from bs4 import BeautifulSoup
import requests

url = "https://www.gpo.gov/fdsys/bulkdata/BILLSTATUS/115/hr"

r = requests.get(url)

soup = BeautifulSoup(r.text, 'html.parser')

for link in soup.find_all('a'):
    href = link.get('href')
    if href is not None and href.endswith('.xml') and href.startswith('bulkdata'):
        print(href)

From here you just grab the urls and pull the data you need.


#3

The only problem is I need it to use Scrapy as well because the spider will be mining from other websites. I’m using xpath but I’m having a tough time incorporating bs4 with scrapy.


#4

why would you incorporate bs4 then? Scrapy is perfectly suited for the job at hand, the trick is to understand xpath so you can scrape all the data, but its more likely to get a good answer from the Scrapy community then here on codecademy forum


#5

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.