I am getting a weird error when I am trying to write bs4 output to a file?


#1

I am getting a weird eorr when trying wite output to file, and I am stuck. Thaks for your help.
import urllib2
import bs4

open the web page

response = urllib2.urlopen('http://npr.org')

get the response text

raw_html = response.read()

use BeautifulSoup to convert raw text into beautiful soup object

soup = bs4.BeautifulSoup(raw_html)

convert soup to plain text]

plain_text = soup.get_text()

f= open('html.txt','w')
f.write(plain_text )
f.close()
the errr is Traceback (most recent call last):
File "C:\Users\Owner\Documents\Markrov chain\Fetch_data.py", line 21, in
f.write(plain_text )
UnicodeEncodeError: 'ascii' codec can't encode characters in position 19698-19700: ordinal not in range(128)


#2

First off. When you post code - make sure it's intact so that others can copy it and run it to get the same result.

I don't understand how unicode works, I don't know what it means to decode/encode unicode, but I believe that what you need to do is this:

f.write(plain_text.encode('utf-8'))

Also, I tried doing the same thing in Python3, where unicode is the default string class (it was renamed to str), and there it "worked on its own"

I believe (keyword: believe, this might not be correct) that file.write expects to receive bytes (8 bits) that can be immediately written to a file. When iterating through a unicode object, that's not what you get, instead you get a character, which can be of varying size (in number of bits). So.. what you needed to get is the bytes (which are not going to correspond 1:1 to characters since this is unicode) for your unicode string, which it will give you when you call unicode.encode('utf-8'), those bytes can then be written to the file and all is well.

Python2's str class is just ascii characters, one byte per character, nothing fancy whatsoever.


#3

This works. thankyou so much