UnicodeEncodeError: 'gbk' codec can't encode character u'\xbd' in position 212: illegal multibyte sequence


#1

I have been working on parsing this xml file in python and after finally figuring out how to start pulling out the different attributes with Element Tree. Then I got this error:
UnicodeEncodeError: 'gbk' codec can't encode character u'\xbd' in position 212: illegal multibyte sequence

The feed is a eCommerce site's product feed. One of their product descriptions has some strange characters in it, that is causing the error. What I want to know is if their is a way to just force python to take the character in as is, so I can just get the description value and keep moving on? The site is quite large, so I think there will probably be more product descriptions that will cause the same error and to get in touch with the company to fix the errors will be really hard.

Here is my code:
from xml.etree.ElementTree import ElementTree
from xml.etree.ElementTree import Element
import xml.etree.ElementTree as etree

tree = etree.parse('C:\Users\FOO\Desktop\catalog1.xml')

for node in tree.iter():
print node.attrib.get('TableFieldID'), node.attrib.get('Value')

Here is the part of the XML Feed that is causing the problem:
TableFieldID="caption" Value="Measures 4"W x 1½"D x 11"H. ~"/>

.... Just noticed that it CodeCademy converted the text to HTML. However it is the html code for " that is causing the problem.


#2

No expert on the subject so you will please forego my naivete, the above statement is where the issue would need to be trapped. A try: except: approach would work well here, The 99% of cases that are cool slip by, and those that aren't get trapped.

That's where you will have a chance to take a closer look and manage the encoding on a case by case basis. Then feed it back into the flow.

>>> u'\xbd'
'½'
>>>

#3

@mtf thanks for the advice. Using the try: except: works. However, using this will skip the whole product description. Also what I don't understand is why is it if I run "print node.attrib" through the for loop, it doesn't give me an error, but if I have it print out the value of the attribute it does. Why is this?

Doesn't generate an error:
for node in tree.iter():
print node.attrib

Prints: Measures 5"w x 2\xbd"D x 12"H

Which is fine, because at least it isn't causing any errors and stopping the program. Any ideas on how to print the value without it causing an error?


#4

Why are you encoding it to gbk? Use unicode (fix your system/terminal settings)
Alternatively, if you must use an encoding that doesn't support characters in your text, don't print such characters/view it with another program. Your browser if nothing else probably handles unicode without any modifications, you could write to file and display it there. (Make sure you're using unicode when writing the file so that there aren't odd system settings being used as default)


#5

@ionatan To be honest I didn't know I was and not really sure how to change this. I am so new to all this and I didn't know I could, so sorry for my silly questions. Also, do you know of any resources that can tell me out to do this?

Are you saying that if I just "return" the command instead of printing it, I wouldn't get an error?

Thanks for the 411!

Thanks a ton for your feedback and questions. This really helps me a lot to better learn and figure out how to parse files.


#6

Whatever your python's stdout is pointed at is likely indicating that output should be encoded in gbk, it would be better to use a terminal that indicates utf-8. I don't know what you mean by return, but yes, unless you switch out the application that is invoking python with incompatible locale settings, then you'll have to output to something else, such as to a file which you would then open with something that is able to read utf-8


#7

Thanks for pointing me in the right direction. I was able to fix this by adding the following code to my program:

import codecs
import sys
UTF8Writer = codecs.getwriter('utf8')
sys.stdout = UTF8Writer(sys.stdout)