As I said above, because I need to crawl a large number of Chinese and UTF-8 encoded pages
Google’s solution is:
Replace ‘\ xa0’ with '(& nbsp;)
print(item['detail'][i].replace(u'\xa0 ', u' '))
And then still wrong, BUG feedback estimates your eldest brother is familiar with getting started:
D:\Users\15806.DESKTOP-A9HK574\Anaconda\python.exe C:/Users/15806.DESKTOP-A9HK574/Desktop/工作站-代码/python项目/网页爬虫初步/e-book.py
Traceback (most recent call last):
File "C:/Users/15806.DESKTOP-A9HK574/Desktop/工作站-代码/python项目/网页爬虫初步/e-book.py", line 57, in <module>
system_write(getBook(i)[0],getBook(i)[1])
File "C:/Users/15806.DESKTOP-A9HK574/Desktop/工作站-代码/python项目/网页爬虫初步/e-book.py", line 51, in system_write
f.writelines(data)
UnicodeEncodeError: 'gbk' codec can't encode character '\xa0' in position 0: illegal multibyte sequence
Process finished with exit code 1
Attached: did not call you to see Chinese
Still familiar with the familiar recipe
Has replaced the wrong / xa0, and then still incorrect report, the problem function is as follows:
def system_write(title,page):
road=str(title.replace(" ", "_")+'.txt')
#尝试替换utf-8中gbk没有的\xa0字符为空格|
#Tried to replace utf-8 gbk not \ xa0 characters for the space
data=page.replace(u'\xa0 ', u' ')
#print('part 1 can work')
with open(road,'a+') as f:
#f.writelines('\n')
f.writelines(data)
f.seek(0)
cNames=f.readlines()
print(road+' 已下载完成')`
#Has been downloaded
Under this folder, you can see the file name written. However, this opening is an empty document, indicating that the Chinese file have already error
The big brother who will help out it
[although I think you may not encounter this problem because you are using GBK anyway]
At last:
English is not very good, written in Google translation, there are some places do not fluent everyone guess what it means
Each Chinese comment I have added English translation