Forum Archive

Markov chain text?

chrisburdick

Is there any way in Pythonista to create a Markov chain text generator (or something resembling one)? Something, in other words, that will take user-provided text (either via imported document or copy/paste) and use that as a source library for generating random semi-coherent sentences?

omz

I have sample code for Markov chain text somewhere, just need to clean it up a bit.

omz

This generates some interesting, pseudo-random text from Tolstoi's Anna Karenina:

#!python3

# Adapted from this blog post: http://agiliq.com/blog/2009/06/generating-pseudo-random-text-with-markov-chains-u/

import random
import os
import urllib.request

class Markov(object):

    def __init__(self, open_file):
        self.cache = {}
        self.open_file = open_file
        self.words = self.file_to_words()
        self.word_size = len(self.words)
        self.database()


    def file_to_words(self):
        self.open_file.seek(0)
        data = self.open_file.read()
        words = data.split()
        return words


    def triples(self):
        """ Generates triples from the given data string. So if our string were
                "What a lovely day", we'd generate (What, a, lovely) and then
                (a, lovely, day).
        """

        if len(self.words) < 3:
            return

        for i in range(len(self.words) - 2):
            yield (self.words[i], self.words[i+1], self.words[i+2])

    def database(self):
        for w1, w2, w3 in self.triples():
            key = (w1, w2)
            if key in self.cache:
                self.cache[key].append(w3)
            else:
                self.cache[key] = [w3]

    def generate_markov_text(self, size=25):
        while True:
            seed = random.randint(0, self.word_size-3)
            seed_word = self.words[seed]
            if seed_word[0].isupper():
                break       
        seed_word, next_word = self.words[seed], self.words[seed+1]
        w1, w2 = seed_word, next_word
        gen_words = []
        while not w2.endswith('.'):
            gen_words.append(w1)
            w1, w2 = w2, random.choice(self.cache[(w1, w2)])
        gen_words.append(w2)
        return ' '.join(gen_words)

def main():
    if not os.path.exists('anna_karenina.txt'):
        print('Downloading book...')
        urllib.request.urlretrieve('http://www.gutenberg.org/files/1399/1399-0.txt', 'anna_karenina.txt')

    with open('anna_karenina.txt', 'r', encoding='utf-') as f:
        markov = Markov(f)
        print(markov.generate_markov_text())

if __name__ == '__main__':
    main()
Phuket2

@omz , was fun to look at this code and run it. But then I went to http://www.gutenberg.org to see other books. What a messy site for such an important resource. Surprising no one has offered to re do it using bootstrap or another type of framework. Just saying...

chrisburdick

Thanks, ole! I'll give it a go!