Chatbot Bag of Words

Hello everyone! I would like some input from anyone that is available. Anything helps.
I am trying to understand the Bag of Words features vector topic. I don’t understand how to read a word tree either.
I don’t think sufficient explanation was given on this topic and the diagram just made it even more confusing and challenging to understand.
I don’t understand how the words were numbered and ordered in the diagram and what they were numbered in regards to. See the image below.

image

I had to go on YouTube to try to understand what the numbers were representing. I understand the 0’s and 1’s aspect of the vector(representing frequency) but I don’t understand the 01 23456 part of it. I thought that the words were in alphabetic order but then I quickly realized it wasn’t. These concept are completely new to me and the only time I have ever heard of a vector was when I was learning about forces in my physics class. For more detail on the topic, please see the link.

https://www.codecademy.com/paths/build-chatbots-with-python/tracks/retrieval-based-chatbots/modules/language-and-topic-modeling-chatbots/lessons/language-model-bag-of-words/exercises/bow-vectors-iii

Pretty sure the 0123456 represents the number value of the words. I could be wrong though. We’ll see what other people have to say :joy:

what do you mean by “number value of the words”

It’s like a mathematical function, each word has its own value. For example, if I had the words lemon, banana, apple, they wouldn’t be numbered in alphabetical order. Instead, 0 would be lemon, banana 1, and apple 2.

1 Like

Hey @adrianburton15130151.

It’s just as what @wafflejz has said. Those numbers inside the features dictionary represent an index value. This image is a .gif image, right? You can see the animation where it goes through the sentence ‘Help my fly fish fly away’. It’s a features dictionary storing numbered indices to later access the ‘Vector of Test Document’.

Think about it this way:
You have a python dictionary as follows:

features_dictionary = { 'all': 0, 'my': 1, 'fish': 2, 'fly': 3, 'away': 4, 'help': 5, 'me': 6, }

and then you use this features_dictionary to access the vector created. So, if you found the word ‘fish’, then you need to increase the counter value in the vector, right?

vector = [0, 0, 0, 0, 0, 0, 0]
vector[features_dictionary['fish']] += 1

See how features_dictionary is being used to access the index value of vector array?

I hope that helps understand this better.

1 Like

hi @goku-kun
Thank you, i can tell that you really put effort into your explanation but something just isn’t clicking.

In “All my fish fly away help me” the 0123456 are just the index numbers for the words in the sentence. That is noted.
And in this code below:
vector = [0, 0, 0, 0, 0, 0, 0]
vector[features_dictionary[‘fish’]] += 1
You are adding a value of one to the position where fish would be placed, which is index 2. Also noted.

OOHHHHH!
I get it! So in the gif they are actually comparing the two sentences! I thought that the first sentence and the numbers beneath it somehow turned into the second sentence and the numbers for the second sentence were for the first one. I don’t know how I came to that conclusion.
Thank you SO MUCH @goku-kun and @wafflejz you helped me out alot! :slight_smile:

@zoe.bachman
Maybe I’m just stupid, tired or both but I feel like this entire topic needs more explanatory content in the passage before the instructions to actually explain what is being taught. Many of us are seeing these completely foreign concepts for the first time in our lives. It’s like a new language and I have no idea what I’m doing. But maybe that’s just me.

3 Likes

Rather than thinking about it as two sentences, I’d suggest that one of them is word collection which stores a word as a key and it’s index in the vector list as value. So, now if you have a paragraph you want to analyze and you want to count how many times each word is being used in that paragraph, then you can access the index via the word collection and the increment the value using this index in the vector list.

Basically, word collection is a bucket which holds a number of words and their corresponding indices in the vector list.

I understand completely now thank you.

1 Like

Happy to pass on your feedback to the content creators!

1 Like