Parsing with regular expressions Project

I’m struggling to finish the following project (https://www.codecademy.com/paths/data-science/tracks/natural-language-processing-dsp/modules/parsing-with-regular-expressions-dsp/projects/nlp-regex-parsing-project)

I’m stucked in question 12:

  • Loop through each part-of-speech tagged sentence in pos_tagged_text and noun phrase chunk each sentence using your RegexpParser ‘s .parse() method. Append the result to np_chunked_text .

And my code is like this:

from nltk import pos_tag, RegexpParser
from tokenize_words import word_sentence_tokenize
from chunk_counters import np_chunk_counter, vp_chunk_counter

import text of choice here

text = open(“dorian_gray.txt”, encoding=‘utf-8’).read().lower()

sentence and word tokenize text here

word_tokenized_text = word_sentence_tokenize(text)

store and print any word tokenized sentence here

single_word_tokenized_sentence = word_tokenized_text

#print(single_word_tokenized_sentence)

create a list to hold part-of-speech tagged sentences here

pos_tagged_text =

create a for loop through each word tokenized sentence here

for word in word_tokenized_text:

part-of-speech tag each sentence and append to list of pos-tagged sentences here

single_pos_sentence = pos_tagged_text.append(word)

store and print any part-of-speech tagged sentence here

single_pos_sentence = pos_tagged_text[100]
#print(single_pos_sentence)

define noun phrase chunk grammar here

np_chunk_grammar = “NP: {

?*}”

create noun phrase RegexpParser object here

np_chunk_parser = RegexpParser(np_chunk_grammar)

define verb phrase chunk grammar here

vp_chunk_grammar = “VP: {

?<VB.><RB.?>?}”

create verb phrase RegexpParser object here

vp_chunk_parser = RegexpParser(vp_chunk_grammar)

create a list to hold noun phrase chunked sentences and a list to hold verb phrase chunked sentences here

np_chunked_text =

vp_chunked_text =

create a for loop through each pos-tagged sentence here

for pos_tagged_sentence in pos_tagged_text:

chunk each sentence and append to lists

  np_chunked_text\
  .append(np_chunk_parser\
  .parse(pos_tagged_sentence))
  vp_chunked_text\
  .append(vp_chunk_parser\
  .parse(pos_tagged_sentence))

And the error is

Traceback (most recent call last):
  File "script.py", line 53, in <module>
    .parse(pos_tagged_sentence))
  File "/usr/local/lib/python3.6/dist-packages/nltk/chunk/regexp.py", line 1208, in parse
    chunk_struct = parser.parse(chunk_struct, trace=trace)

Anyone can find what is the problem?

Many thanks in advance

Looks like you didn’t define your noun phrase chunk grammar.

Actually I defined both noun and verb phrase chunk grammars. I guess that it didn’t pass well here in the post but they’re defined in command.py

Well the error is thrown from trying to use the RegexpParser, but it’s hard to tell exactly what the issue is without seeing your exact code formatting.

Try hitting this button
image
and pasting your exact code inside the triple back ticks so we can read it better.

from nltk import pos_tag, RegexpParser
from tokenize_words import word_sentence_tokenize
from chunk_counters import np_chunk_counter, vp_chunk_counter

# import text of choice here
text = open("dorian_gray.txt", encoding='utf-8').read().lower()

# sentence and word tokenize text here
word_tokenized_text = word_sentence_tokenize(text)

# store and print any word tokenized sentence here
single_word_tokenized_sentence = word_tokenized_text

#print(single_word_tokenized_sentence)

# create a list to hold part-of-speech tagged sentences here
pos_tagged_text = []

# create a for loop through each word tokenized sentence here
for word in word_tokenized_text:
  # part-of-speech tag each sentence and append to list of pos-tagged sentences here
  single_pos_sentence = pos_tagged_text.append(word)
  

# store and print any part-of-speech tagged sentence here
single_pos_sentence = pos_tagged_text[100]
print(single_pos_sentence)


# define noun phrase chunk grammar here
np_chunk_grammar = "NP:{<DT>?<JJ>*<NN>}"

# create noun phrase RegexpParser object here
np_chunk_parser = RegexpParser(np_chunk_grammar)

# define verb phrase chunk grammar here
vp_chunk_grammar = "VP:{<DT>?<JJ>*<NN><VB\
.*><RB.?>?}"

# create verb phrase RegexpParser object here
vp_chunk_parser = RegexpParse(vp_chunk_grammar)

# create a list to hold noun phrase chunked sentences and a list to hold verb phrase chunked sentences here
np_chunked_text = []

vp_chunked_text = []


# create a for loop through each pos-tagged sentence here
for pos_tagged_sentence in pos_tagged_text:
# chunk each sentence and append to lists
      np_chunked_text\
      .append(np_chunk_parser\
      .parse(pos_tagged_sentence))
      vp_chunked_text\
      .append(vp_chunk_parser\
      .parse((pos_tagged_sentence))

Thanks! I figured it out once I was able to run the code myself and see the full error. Here is the important part:

image
You didn’t use nltk's pos_tag() function to tag your words with their part of speech back in step 5 (check the hint if you need some help). Instead, your loop just created a new variable called single_pos_sentence and assigned it the value of pos_tagged_text, appended by each “word” element in word_tokenized_text.

image

When you go back to fix this, I’d suggest a couple of changes for clarity:

  • Maybe think of a better way to loop through the elements and append them to pos_tagged_text. It doesn’t really make sense to reassign the variable single_pos_sentence on each iteration of the loop.
  • I’d also consider changing word to sentence, just for readability, since you are looping through each word tokenized sentence in word_tokenized_text.

Happy coding and good luck with NLP!

1 Like

It was missing pos_tag, indeed!

Thank you very much :slight_smile:

Hello!

I am also sort of stuck here, at step 15 actually. It does not return any error though. It just doesn’t return what I’m expecting, which is the 30 most common VP chunks, as it does for the 30 most common NP chunks. All I get is a pair of square brackets .

I have read my code over and over again and I cannot find any problem. I’ve also tried to inspect the file where the counter is built, and I haven’t found anything either, but I don’t fully understand it.

So, here is my code:

from nltk import pos_tag, RegexpParser
from tokenize_words import word_sentence_tokenize
from chunk_counters import np_chunk_counter, vp_chunk_counter

# import text of choice here
text = open("dorian_gray.txt",encoding='utf-8').read().lower()

# sentence and word tokenize text here
word_tokenized_text = word_sentence_tokenize(text)

# store and print any word tokenized sentence here
single_word_tokenized_sentence = word_tokenized_text[4]
print(single_word_tokenized_sentence)

# create a list to hold part-of-speech tagged sentences here
pos_tagged_text = []

# create a for loop through each word tokenized sentence here
for word_tokenized_sentence in word_tokenized_text:
  # part-of-speech tag each sentence and append to list of pos-tagged sentences here
  pos_tagged_text.append(pos_tag(word_tokenized_sentence))

# store and print any part-of-speech tagged sentence here
single_pos_sentence = pos_tagged_text[4]
print(single_pos_sentence)

# define noun phrase chunk grammar here
np_chunk_grammar = "NP:{<DT>?<JJ>*<NN>}"

# create noun phrase RegexpParser object here
np_chunk_parser = RegexpParser(np_chunk_grammar)

# define verb phrase chunk grammar here
vp_chunk_grammar = "VP:{<DT>?<JJ>*<NN><VP><RB>?}"

# create verb phrase RegexpParser object here
vp_chunk_parser = RegexpParser(vp_chunk_grammar)

# create a list to hold noun phrase chunked sentences and a list to hold verb phrase chunked sentences here
np_chunked_text = []
vp_chunked_text = []

# create a for loop through each pos-tagged sentence here
for pos_tagged_sentence in pos_tagged_text:
  # chunk each sentence and append to lists here
  np_chunked_text.append(np_chunk_parser.parse(pos_tagged_sentence))
  vp_chunked_text.append(vp_chunk_parser.parse(pos_tagged_sentence))

# store and print the most common NP-chunks here
most_common_np_chunks = np_chunk_counter(np_chunked_text)
print(most_common_np_chunks)

# # store and print the most common VP-chunks here
most_common_vp_chunks = vp_chunk_counter(vp_chunked_text)
print(most_common_vp_chunks)

And here is what I get:

[‘those’, ‘who’, ‘find’, ‘ugly’, ‘meanings’, ‘in’, ‘beautiful’, ‘things’, ‘are’, ‘corrupt’, ‘without’, ‘being’, ‘charming’, ‘.’]
[(‘those’, ‘DT’), (‘who’, ‘WP’), (‘find’, ‘VBP’), (‘ugly’, ‘JJ’), (‘meanings’, ‘NNS’), (‘in’, ‘IN’), (‘beautiful’, ‘JJ’), (‘things’, ‘NNS’), (‘are’, ‘VBP’), (‘corrupt’, ‘JJ’), (‘without’, ‘IN’), (‘being’, ‘VBG’), (‘charming’, ‘VBG’), (’.’, ‘.’)]
[(((‘i’, ‘NN’),), 962), (((‘henry’, ‘NN’),), 200), (((‘lord’, ‘NN’),), 197), (((‘life’, ‘NN’),), 170), (((‘harry’, ‘NN’),), 136), (((‘dorian’, ‘JJ’), (‘gray’, ‘NN’)), 127), (((‘something’, ‘NN’),), 126), (((‘nothing’, ‘NN’),), 93), (((‘basil’, ‘NN’),), 85), (((‘the’, ‘DT’), (‘world’, ‘NN’)), 70), (((‘everything’, ‘NN’),), 69), (((‘anything’, ‘NN’),), 68), (((‘hallward’, ‘NN’),), 68), (((‘the’, ‘DT’), (‘man’, ‘NN’)), 61), (((‘the’, ‘DT’), (‘room’, ‘NN’)), 60), (((‘face’, ‘NN’),), 58), (((‘the’, ‘DT’), (‘door’, ‘NN’)), 56), (((‘love’, ‘NN’),), 55), (((‘art’, ‘NN’),), 52), (((‘course’, ‘NN’),), 51), (((‘the’, ‘DT’), (‘picture’, ‘NN’)), 46), (((‘the’, ‘DT’), (‘lad’, ‘NN’)), 45), (((‘head’, ‘NN’),), 44), (((‘round’, ‘NN’),), 44), (((‘hand’, ‘NN’),), 44), (((‘sibyl’, ‘NN’),), 41), (((‘the’, ‘DT’), (‘table’, ‘NN’)), 40), (((‘the’, ‘DT’), (‘painter’, ‘NN’)), 38), (((‘sir’, ‘NN’),), 38), (((‘a’, ‘DT’), (‘moment’, ‘NN’)), 38)]

How come? Can someone help me, please?
Thanks!

Ok, I found the problem! It was the regex:
I had

VP:{<DT>?<JJ>*<NN><VP><RB>?}

instead of

VP:{<DT>?<JJ>*<NN><VB><RB>?}

VP instead of VB

bah

1 Like

Yes, was just looking through this and noticed that.

Ideally, however, your verb phrase would look something like this:

"VP: {<DT>?<JJ>*<NN><VB.*><RB.?>?}"

Allowing for different types of verbs and making the adverb optional.

1 Like

Thank you! Actually, that’s what I was about to ask, now that I looked at the hint.
So:

  • the .* “allows for different types of verbs” meaning tense wise or verbs with apostrophes (like “I’m”), for example? Does it stand for any character or any tag or variant of a tag?

  • in the case of RB, I thought it was already optional with the ? outside the <>, like the determiner, right?

I think that it’ll get it as soon as I get the .

Sorry, you are right, your adverb was already optional. Totally glossed over your ? :man_facepalming:

In regex, the . is a wildcard that represents any single character (letter, number, symbol or whitespace).
The * is a quantifier that matches the preceding character 0 or more times.

So <VB.*> allows for 0 or more of any character after VB, while <RB.?> allows for 0 or 1 character after RB. These are both to catch any type of verb or adverb that are tagged.

By “type”, I’m referring to base form, tense, 3rd person, etc.

Here is the link to the P.O.S. tags that the lesson is referring to:
https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

As you can see, there are numerous different verbs and adverbs that might slip through your chunking if you only had <VB><RB>

1 Like

Ok, @el_cocodrilo, just one more question: why * in the case of VB and ? in the case of RB?

Excellent question. Honestly, I think it would work fine either way for both VB and RB. I’m not sure why Codecademy chose to do VB with the * and RB with the ?. In fact, in this exercise, they use VB.?:
https://www.codecademy.com/courses/natural-language-processing/lessons/nlp-regex-parsing-intro/exercises/chunk-filtering

On the other hand, let’s say that you want to match any type of noun. In that case, you have NN, NNS, NNP, and NNPS. This is a circumstance where you would need <NN.*> rather than <NN.?>, since there may be more than one character after NN.

Regardless, it is always good to brush up on your regex. I know I sometimes have to refresh since I just don’t use it often enough to remember how to do some of the more complicated combinations.

2 Likes

Right! Thank you, @el_cocodrilo , you’ve been very helpful!