Reading text file error on emoji

In off-platform-project-generative-chatbot project for ** Build Chatbots with Python Skill Path** script to read the file crashes because of emoji:

data_path = "./twitter-project/weather.txt"

# Defining lines as a list of each line
with open(data_path, 'r', encoding='utf-8') as f:
  lines = f.read().split('\n')

lines = [re.sub(r"(?:\@|https?\://)\S+", "", line).strip() for line in lines]
print(lines)

returns error

line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f927' in position 94: character maps to <undefined>

I tried to add error ignore however no luck

# Defining lines as a list of each line
with open(data_path, 'r', encoding='utf-8', errors='ignore') as f:
  lines = f.read().split('\n')

lines = [re.sub(r"(?:\@|https?\://)\S+", "", line).strip() for line in lines]
print(lines)

thank you for your help

Hey Animusxcash,

I am relatively new to Python, but I tried to help you. The following is what I found regarding your question.

Source: python - How to open a text file that has emojis in it? - Stack Overflow
Author: Mark Tolonen


You are getting a UnicodeEncodeError, likely from your print statement. The file is being read and interpreted correctly, but you can only print characters that your console encoding and font actually support. The error indicates the character isn’t supported in the current encoding.

For example:

Python 3.3.5 (v3.3.5:62cf4e77f785, Mar  9 2014, 10:35:05) [MSC v.1600 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> print('\U0001F681')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\\Python33\lib\encodings\cp437.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f681' in position 0: character maps to <undefined>

But print a character the terminal encoding supports, and it works:

>>> print('\U000000E0')
à

My console encoding was cp437, but if I use a Python IDE that supports UTF-8 encoding, then it works:

>>> print('\U0001f681')
🚁

You may or may not see the character correctly. You need to be using a font that supports the character; otherwise, you get some default replacement character.


Let me know if this was of any help!

Kind regards,

Benjamin

hmm thanks, for this will look into the issue and how to fix it