FAQ: Generating Text with Deep Learning - Preprocessing for seq2seq

This community-built FAQ covers the “Preprocessing for seq2seq” exercise from the lesson “Generating Text with Deep Learning”.

Paths and Courses
This exercise can be found in the following Codecademy content:

Natural Language Processing

FAQs on the exercise Preprocessing for seq2seq

There are currently no frequently asked questions associated with this exercise – that’s where you come in! You can contribute to this section by offering your own questions, answers, or clarifications on this exercise. Ask or answer a question by clicking reply (reply) below.

If you’ve had an “aha” moment about the concepts, formatting, syntax, or anything else with this exercise, consider sharing those insights! Teaching others and answering their questions is one of the best ways to learn and stay sharp.

Join the Discussion. Help a fellow learner on their journey.

Ask or answer a question about this exercise by clicking reply (reply) below!
You can also find further discussion and get answers to your questions over in Language Help.

Agree with a comment or answer? Like (like) to up-vote the contribution!

Need broader help or resources? Head to Language Help and Tips and Resources. If you are wanting feedback or inspiration for a project, check out Projects.

Looking for motivation to keep learning? Join our wider discussions in Community

Learn more about how to use this guide.

Found a bug? Report it online, or post in Bug Reporting

Have a question about your account or billing? Reach out to our customer support team!

None of the above? Find out where to ask other questions here!

Hi with regards to the 2nd lesson, preprocessing with seq2seq, May i know why for processing of input and target theres different methods used? for instance,

input_docs.append(input_doc)
VS
target_doc = " “.join(re.findall(r”[\w’]+|[^\s\w]", target_doc))
target_doc = " "+ target_doc + " "
target_docs.append(target_doc)

Instead of for target doc being just append(target_doc) like the input?

Also similarly,
for token in re.findall(r"[\w’]+|[^\s\w]", input_doc):

VS

for token in target_doc.split():

why not both just use .split()? I tried to replace it and print input_doc VS before replacing and what get printed is the same? is there any reason for the difference? Thanks

I think this is because we need to eliminate the punctuation. Not really sure though

What does the regex pattern → r"[\w’]+|[^\s\w]" ← do exactly in the script; i.e.

target_doc = " “.join(re.findall(r”[\w’]+|[^\s\w]", target_doc)), where an example of target_doc could be: ¡Hemos ganado!

More specifically, I understand [\w] is capturing all words that occur one or more times, but what does the | symbol outside of [^\s\w] do? I know [^\s\w] also captures anything that’s not a word or space.

The ultimate goal of this lesson is to find the number of tokens generated in input(English) and output(Spanish).
We need to find max no of tokens for English as well as Spanish because we have to convert these tokens into vectors(a numerical form of data). We may also create one-hot encoding for the token as well.

input (English) → max no tokens (unique words) → encode the tokens into numbers(so each number denotes a word)

target(Spanish) -->max no tokens(unique words) -->encode the tokens into numbers
Some languages like Spanish uses symbols or punctuations and therefore, the number of tokens generated from English sentence and the number of tokens generated from translated Spanish sentence may not be equal.
Hence, we use different methods such as regular expression to generate tokens.

num_encoder_tokens = len(input_tokens) =18
num_decoder_tokens = len(target_tokens) =27

As we can see here total no of tokens for the target (Spanish) text is greater than English.
One more thing, when we translate English into Spanish, the order of text may be different from English.
For eg:- We will see(3 tokens) —> Después veremos(2 tokens).

1 Like

Hi [georgeanton145263425]
@georgeanton145263425 regarding your question.
I can only nanswer partially to it.

What does this pattern mean?
r"[\w’]+|[^\s\w]"
lets split it up by parts and name those parts
r" [\w’] + | [^\s\w] "
(1) (2) (3) (4)

(1) [\w’] has the these brackets which represent a set of characters. What set?
That is indicated by the \w which stands for alphanumerical symbols 0-9a-zA-Z.

(2) + is the “Keen Plus”. It is sort of a quantifier that matches one or more ocurrences of the preceding pattern. In this case one or more occurrences of symbols 0-9a-zA-Z representet by \w.

(3) stands for alternation.

(4) [^\s\w] the only unknowns so far are ^ which stands for beginning. And \s which stands for any unicode white space characters.

Just use the syllabus to go back to regular expressions in chatper 2.

Now my questions concerncs the first symbol.

Why is there an apostrophe after the w in [\w’]?
I didn’t find that on the lesson nor by googling it. I only know \W as opposite of \w

Can somebody help me out with the apostrophe?

Kind regards,

Manuel