There are currently no frequently asked questions associated with this exercise – that’s where you come in! You can contribute to this section by offering your own questions, answers, or clarifications on this exercise. Ask or answer a question by clicking reply () below.
If you’ve had an “aha” moment about the concepts, formatting, syntax, or anything else with this exercise, consider sharing those insights! Teaching others and answering their questions is one of the best ways to learn and stay sharp.
Join the Discussion. Help a fellow learner on their journey.
Ask or answer a question about this exercise by clicking reply () below!
You can also find further discussion and get answers to your questions over in Language Help.
Agree with a comment or answer? Like () to up-vote the contribution!
Hi with regards to the 2nd lesson, preprocessing with seq2seq, May i know why for processing of input and target theres different methods used? for instance,
Instead of for target doc being just append(target_doc) like the input?
Also similarly,
for token in re.findall(r"[\w’]+|[^\s\w]", input_doc):
VS
for token in target_doc.split():
why not both just use .split()? I tried to replace it and print input_doc VS before replacing and what get printed is the same? is there any reason for the difference? Thanks
What does the regex pattern → r"[\w’]+|[^\s\w]" ← do exactly in the script; i.e.
target_doc = " “.join(re.findall(r”[\w’]+|[^\s\w]", target_doc)), where an example of target_doc could be: ¡Hemos ganado!
More specifically, I understand [\w] is capturing all words that occur one or more times, but what does the | symbol outside of [^\s\w] do? I know [^\s\w] also captures anything that’s not a word or space.
The ultimate goal of this lesson is to find the number of tokens generated in input(English) and output(Spanish).
We need to find max no of tokens for English as well as Spanish because we have to convert these tokens into vectors(a numerical form of data). We may also create one-hot encoding for the token as well.
input (English) → max no tokens (unique words) → encode the tokens into numbers(so each number denotes a word)
target(Spanish) -->max no tokens(unique words) -->encode the tokens into numbers
Some languages like Spanish uses symbols or punctuations and therefore, the number of tokens generated from English sentence and the number of tokens generated from translated Spanish sentence may not be equal.
Hence, we use different methods such as regular expression to generate tokens.
As we can see here total no of tokens for the target (Spanish) text is greater than English.
One more thing, when we translate English into Spanish, the order of text may be different from English.
For eg:- We will see(3 tokens) —> Después veremos(2 tokens).
Hi [georgeanton145263425] @georgeanton145263425 regarding your question.
I can only nanswer partially to it.
What does this pattern mean?
r"[\w’]+|[^\s\w]"
lets split it up by parts and name those parts
r" [\w’] + | [^\s\w] "
(1) (2) (3) (4)
(1) [\w’] has the these brackets which represent a set of characters. What set?
That is indicated by the \w which stands for alphanumerical symbols 0-9a-zA-Z.
(2) + is the “Keen Plus”. It is sort of a quantifier that matches one or more ocurrences of the preceding pattern. In this case one or more occurrences of symbols 0-9a-zA-Z representet by \w.
(3) stands for alternation.
(4) [^\s\w] the only unknowns so far are ^ which stands for beginning. And \s which stands for any unicode white space characters.
Just use the syllabus to go back to regular expressions in chatper 2.
Now my questions concerncs the first symbol.
Why is there an apostrophe after the w in [\w’]?
I didn’t find that on the lesson nor by googling it. I only know \W as opposite of \w