Markov Chain not randomizing output


#1

Hello. After much tinkering, I’ve finally been able to get the MarkovChain module to produce an output from the html that I parsed. I am collecting lyrics from an album to use as my chain. When I run the MarkovChain module on my fetch_data.py function, it is grabbing 20 words in order and putting them out, instead of producing a random group of 20 words.

My run.py file:

from markov_python.cc_markov import MarkovChain
import fetch_data


url = 'http://www.darklyrics.com/lyrics/metallica/andjusticeforall.html#2'
source = fetch_data.fetch_words(url)


markov = MarkovChain(3)
markov.add_string(source)

mc = markov.generate_text()

print(mc)


My fetch_data.py:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

def fetch_words(url):

    page = urlopen(url)
    soup = BeautifulSoup(page, 'html.parser')
    text = soup.find("div", {"class": "lyrics"})
    data = text.get_text(" ", strip=True)
    lyrics = str(data)
    return lyrics


This outputs:

[‘the’, ‘blatant’, ‘disarray’, ‘disfigure’, ‘the’, ‘public’, ‘eye’s’, ‘disgrace’, ‘defying’, ‘common’, ‘place’, ‘unending’, ‘paper’, ‘chase’, ‘unending’, ‘deafening’, ‘painstaking’, ‘reckoning’, ‘this’, ‘vertigo’]

These words are all in order as they are on the song. I don’t know how to modify cc_markov.py to make it randomize the output. Anyone know what to do? Thank you!

The cc_markov.py code: (obviously not mine)

import re
import random
from collections import defaultdict, deque

"""
Codecademy Pro Final Project supplementary code

Markov Chain generator
  This is a text generator that uses Markov Chains to generate text
  using a uniform distribution.

  num_key_words is the number of words that compose a key (suggested: 2 or 3)
"""

class MarkovChain:

  def __init__(self, num_key_words=2):
    self.num_key_words = num_key_words
    self.lookup_dict = defaultdict(list)
    self._punctuation_regex = re.compile('[,.!;\?\:\-\[\]\n]+')
    self._seeded = False
    self.__seed_me()

  def __seed_me(self, rand_seed=None):
    if self._seeded is not True:
      try:
        if rand_seed is not None:
          random.seed(rand_seed)
        else:
          random.seed()
        self._seeded = True
      except NotImplementedError:
        self._seeded = False

  """
  " Build Markov Chain from data source.
  " Use add_file() or add_string() to add the appropriate format source
  """
  def add_file(self, file_path):
    content = ''
    with open(file_path, 'r') as fh:
      self.__add_source_data(fh.read())

  def add_string(self, str):
    self.__add_source_data(str)

  def __add_source_data(self, str):
    clean_str = self._punctuation_regex.sub(' ', str).lower()
    tuples = self.__generate_tuple_keys(clean_str.split())
    for t in tuples:
      self.lookup_dict[t[0]].append(t[1])

  def __generate_tuple_keys(self, data):
    if len(data) < self.num_key_words:
      return

    for i in range(len(data) - self.num_key_words):
      yield [ tuple(data[i:i+self.num_key_words]), data[i+self.num_key_words] ]

  """
  " Generates text based on the data the Markov Chain contains
  " max_length is the maximum number of words to generate
  """
  def generate_text(self, max_length=20):
    context = deque()
    output = []
    if len(self.lookup_dict) > 0:
      self.__seed_me(rand_seed=len(self.lookup_dict))

      idx = random.randint(0, len(self.lookup_dict)-1)
      chain_head = list(self.lookup_dict.keys())[idx]
      context.extend(chain_head)

      while len(output) < (max_length - self.num_key_words):
        next_choices = self.lookup_dict[tuple(context)]
        if len(next_choices) > 0:
          next_word = random.choice(next_choices)
          context.append(next_word)
          output.append(context.popleft())
        else:
          break
      output.extend(list(context))
    return output

#2

Hey, did you figure this out? Just wondering. I can see myself running into this issue as well. Once I get my MC to generate SOMETHING…


#3

If I run the program over and over again, I will occasionally get something randomized. However, more often than not, it will spit out words in the correct order as they appear in the song. I’m still stumped.


#4

OP has probably moved on by now but I also had this problem and my answer might be useful for others.

Basically you need to change num_key_words to =1 not =2. In your example line the program is looking for words that follow ‘the blatant’ which can only be ‘disarray’, and then ‘blatant disarray’ which can only be ‘disfigure’.
If you change num_key_words to 1 then it will look for words which follow ‘the’ which could be ‘blatant’ or ‘public’, and words which follow ‘unending’ which could be ‘paper’ or ‘deafening’.
Hope this helps