I have been working on parsing this xml file in python and after finally figuring out how to start pulling out the different attributes with Element Tree. Then I got this error:
UnicodeEncodeError: 'gbk' codec can't encode character u'\xbd' in position 212: illegal multibyte sequence
The feed is a eCommerce site's product feed. One of their product descriptions has some strange characters in it, that is causing the error. What I want to know is if their is a way to just force python to take the character in as is, so I can just get the description value and keep moving on? The site is quite large, so I think there will probably be more product descriptions that will cause the same error and to get in touch with the company to fix the errors will be really hard.
Here is my code:
from xml.etree.ElementTree import ElementTree
from xml.etree.ElementTree import Element
import xml.etree.ElementTree as etree
tree = etree.parse('C:\Users\FOO\Desktop\catalog1.xml')
for node in tree.iter():
print node.attrib.get('TableFieldID'), node.attrib.get('Value')
Here is the part of the XML Feed that is causing the problem:
TableFieldID="caption" Value="Measures 4"W x 1½"D x 11"H. ~"/>
.... Just noticed that it CodeCademy converted the text to HTML. However it is the html code for " that is causing the problem.