Machine Translation-> Off-Platform Projects

Hello I am going through the Code given by Codecademy for the Machine translation projects and I have a couple of questions:

  1. target_doc = " “.join(re.findall(r”[\w’]+|[^\s\w]", target_doc)) is seperating sentences together using RegEx but to me I read as findall [one letter before an apostrophe or more ([\w’]+) ] OR ^ beginning by a space and then a word. Basically I read it as gibberish how it separated the document into sentences?

  2. " I tensorflow/core/platform/] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.2022-02-07 12:51:59.268826: I tensorflow/core/common_runtime/] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance."

2022-02-07 12:52:00.885362: I tensorflow/compiler/mlir/] None of the MLIR Optimization Passes are enabled (registered 2)

I am getting thess messages when I am running the code and I don’t understand what it means and what to do about it. My code still compiles: I can see the fractions of the epochs compiling but according to the Codecademy prompt I should see translations of the words appear but there’s no output. Here is the link of the prompt, I am stuck at “The Test Model and Function #3

  1. How do you guys stay motivated when you get stuck while learning how to code. I love codecademy and it’s the best way of learning for me, but I get so frustrated when I can’t go to the equivalent of a teacher’s office hours and really sit down with my computer and show what’s not working. It’s just not the same, searching on Google especially when you’re still a beginner. Any tips on navigating that?
import numpy as np #numpy does linear algebra, random numbers, everything mathematical import re #RegEx, Regular Expression. Allows to search for words # Importing our translations # for example: "spa.txt" or "spa-eng/spa.txt" data_path = "por.txt" # Defining lines as a list of each line with open(data_path, 'r', encoding='utf-8') as f: #r means read only mode. lines ='\n') #it splits each line \n means enter # Building empty lists to hold sentences input_docs = [] target_docs = [] # Building empty vocabulary sets input_tokens = set() target_tokens = set() # Adjust the number of lines so that # preprocessing doesn't take too long for you for line in lines[:10000]: #teacher says can go up to 123,000 lines # Input and target sentences are separated by tabs #bc you need to seperate them input_doc, target_doc = line.split('\t')[:2] # Appending each input sentence to input_docs input_docs.append(input_doc) target_doc = " ".join(re.findall(r"[\w']+|[^\s\w]", target_doc)) # splitting words from ponctuation # Redefine target_doc below # and append it to target_docs: target_doc = '<START> ' + target_doc + ' <END>' target_docs.append(target_doc) # Now we split up each sentence into words # and add each unique word to our vocabulary set for token in re.findall(r"[\w']+|[^\s\w]", input_doc): # print(token) # Add your code here: if token not in input_tokens: input_tokens.add(token) for token in target_doc.split(): # print(token) # And here: if token not in target_tokens: target_tokens.add(token) input_tokens = sorted(list(input_tokens)) target_tokens = sorted(list(target_tokens)) # Create num_encoder_tokens and num_decoder_tokens: num_encoder_tokens = len(input_tokens) num_decoder_tokens = len(target_tokens) max_encoder_seq_length = max([len(re.findall(r"[\w']+|[^\s\w]", input_doc)) for input_doc in input_docs]) max_decoder_seq_length = max([len(re.findall(r"[\w']+|[^\s\w]", target_doc)) for target_doc in target_docs]) input_features_dict = dict( [(token, i) for i, token in enumerate(input_tokens)]) target_features_dict = dict( [(token, i) for i, token in enumerate(target_tokens)]) reverse_input_features_dict = dict( (i, token) for token, i in input_features_dict.items()) reverse_target_features_dict = dict( (i, token) for token, i in target_features_dict.items()) encoder_input_data = np.zeros( (len(input_docs), max_encoder_seq_length, num_encoder_tokens), dtype='float32') decoder_input_data = np.zeros( (len(input_docs), max_decoder_seq_length, num_decoder_tokens), dtype='float32') decoder_target_data = np.zeros( (len(input_docs), max_decoder_seq_length, num_decoder_tokens), dtype='float32') for line, (input_doc, target_doc) in enumerate(zip(input_docs, target_docs)): for timestep, token in enumerate(re.findall(r"[\w']+|[^\s\w]", input_doc)): # Assign 1. for the current line, timestep, & word # in encoder_input_data: encoder_input_data[line, timestep, input_features_dict[token]] = 1. for timestep, token in enumerate(target_doc.split()): decoder_input_data[line, timestep, target_features_dict[token]] = 1. if timestep > 0: decoder_target_data[line, timestep - 1, target_features_dict[token]] = 1.
1 Like

This topic was automatically closed 41 days after the last reply. New replies are no longer allowed.