This codewalk describes a program that generates random text using
a Markov chain algorithm. The package comment describes the algorithm
and the operation of the program. Please read it before continuing.
doc/codewalk/markov.go:6,44
Modeling Markov chains
A chain consists of a prefix and a suffix. Each prefix is a set
number of words, while a suffix is a single word.
A prefix can have an arbitrary number of suffixes.
To model this data, we use a map[string][]string.
Each map key is a prefix (a string) and its values are
lists of suffixes (a slice of strings, []string).
Here is the example table from the package comment
as modeled by this data structure:
map[string][]string{
" ": {"I"},
" I": {"am"},
"I am": {"a", "not"},
"a free": {"man!"},
"am a": {"free"},
"am not": {"a"},
"a number!": {"I"},
"number! I": {"am"},
"not a": {"number!"},
}
While each prefix consists of multiple words, we
store prefixes in the map as a single string.
It would seem more natural to store the prefix as a
[]string, but we can't do this with a map because the
key type of a map must implement equality (and slices do not).
Therefore, in most of our code we will model prefixes as a
[]string and join the strings together with a space
to generate the map key:
The complete state of the chain table consists of the table itself and
the word length of the prefixes. The Chain struct stores
this data.
doc/codewalk/markov.go:76,79
The NewChain constructor function
The Chain struct has two unexported fields (those that
do not begin with an upper case character), and so we write a
NewChain constructor function that initializes the
chain map with make and sets the
prefixLen field.
This is constructor function is not strictly necessary as this entire
program is within a single package (main) and therefore
there is little practical difference between exported and unexported
fields. We could just as easily write out the contents of this function
when we want to construct a new Chain.
But using these unexported fields is good practice; it clearly denotes
that only methods of Chain and its constructor function should access
those fields. Also, structuring Chain like this means we
could easily move it into its own package at some later date.
doc/codewalk/markov.go:82,84
The Prefix type
Since we'll be working with prefixes often, we define a
Prefix type with the concrete type []string.
Defining a named type clearly allows us to be explicit when we are
working with a prefix instead of just a []string.
Also, in Go we can define methods on any named type (not just structs),
so we can add methods that operate on Prefix if we need to.
doc/codewalk/markov.go:60
The String method
The first method we define on Prefix is
String. It returns a string representation
of a Prefix by joining the slice elements together with
spaces. We will use this method to generate keys when working with
the chain map.
doc/codewalk/markov.go:63,65
Building the chain
The Build method reads text from an io.Reader
and parses it into prefixes and suffixes that are stored in the
Chain.
The io.Reader is an
interface type that is widely used by the standard library and
other Go code. Our code uses the
fmt.Fscan function, which
reads space-separated values from an io.Reader.
The Build method returns once the Reader's
Read method returns io.EOF (end of file)
or some other read error occurs.
doc/codewalk/markov.go:88,100
Buffering the input
This function does many small reads, which can be inefficient for some
Readers. For efficiency we wrap the provided
io.Reader with
bufio.NewReader to create a
new io.Reader that provides buffering.
doc/codewalk/markov.go:89
The Prefix variable
At the top of the function we make a Prefix slice
p using the Chain's prefixLen
field as its length.
We'll use this variable to hold the current prefix and mutate it with
each new word we encounter.
doc/codewalk/markov.go:90
Scanning words
In our loop we read words from the Reader into a
string variable s using
fmt.Fscan. Since Fscan uses space to
separate each input value, each call will yield just one word
(including punctuation), which is exactly what we need.
Fscan returns an error if it encounters a read error
(io.EOF, for example) or if it can't scan the requested
value (in our case, a single string). In either case we just want to
stop scanning, so we break out of the loop.
doc/codewalk/markov.go:92,95
Adding a prefix and suffix to the chain
The word stored in s is a new suffix. We add the new
prefix/suffix combination to the chain map by computing
the map key with p.String and appending the suffix
to the slice stored under that key.
The built-in append function appends elements to a slice
and allocates new storage when necessary. When the provided slice is
nil, append allocates a new slice.
This behavior conveniently ties in with the semantics of our map:
retrieving an unset key returns the zero value of the value type and
the zero value of []string is nil.
When our program encounters a new prefix (yielding a nil
value in the map) append will allocate a new slice.
For more information about the append function and slices
in general see the
Slices: usage and internals article.
doc/codewalk/markov.go:96,97
Pushing the suffix onto the prefix
Before reading the next word our algorithm requires us to drop the
first word from the prefix and push the current suffix onto the prefix.
When in this state
p == Prefix{"I", "am"}
s == "not"
the new value for p would be
p == Prefix{"am", "not"}
This operation is also required during text generation so we put
the code to perform this mutation of the slice inside a method on
Prefix named Shift.
doc/codewalk/markov.go:98
The Shift method
The Shift method uses the built-in copy
function to copy the last len(p)-1 elements of p to
the start of the slice, effectively moving the elements
one index to the left (if you consider zero as the leftmost index).
p := Prefix{"I", "am"}
copy(p, p[1:])
// p == Prefix{"am", "am"}
We then assign the provided word to the last index
of the slice:
The Generate method is similar to Build
except that instead of reading words from a Reader
and storing them in a map, it reads words from the map and
appends them to a slice (words).
Generate uses a conditional for loop to generate
up to n words.
doc/codewalk/markov.go:103,116
Getting potential suffixes
At each iteration of the loop we retrieve a list of potential suffixes
for the current prefix. We access the chain map at key
p.String() and assign its contents to choices.
If len(choices) is zero we break out of the loop as there
are no potential suffixes for that prefix.
This test also works if the key isn't present in the map at all:
in that case, choices will be nil and the
length of a nil slice is zero.
doc/codewalk/markov.go:107,110
Choosing a suffix at random
To choose a suffix we use the
rand.Intn function.
It returns a random integer up to (but not including) the provided
value. Passing in len(choices) gives us a random index
into the full length of the list.
We use that index to pick our new suffix, assign it to
next and append it to the words slice.
Next, we Shift the new suffix onto the prefix just as
we did in the Build method.
doc/codewalk/markov.go:111,113
Returning the generated text
Before returning the generated text as a string, we use the
strings.Join function to join the elements of
the words slice together, separated by spaces.
doc/codewalk/markov.go:115
Command-line flags
To make it easy to tweak the prefix and generated text lengths we
use the flag package to parse
command-line flags.
These calls to flag.Int register new flags with the
flag package. The arguments to Int are the
flag name, its default value, and a description. The Int
function returns a pointer to an integer that will contain the
user-supplied value (or the default value if the flag was omitted on
the command-line).
doc/codewalk/markov.go:119,121
Program set up
The main function begins by parsing the command-line
flags with flag.Parse and seeding the rand
package's random number generator with the current time.
If the command-line flags provided by the user are invalid the
flag.Parse function will print an informative usage
message and terminate the program.
doc/codewalk/markov.go:123,124
Creating and building a new Chain
To create the new Chain we call NewChain
with the value of the prefix flag.
To build the chain we call Build with
os.Stdin (which implements io.Reader) so
that it will read its input from standard input.
doc/codewalk/markov.go:126,127
Generating and printing text
Finally, to generate text we call Generate with
the value of the words flag and assigning the result
to the variable text.
Then we call fmt.Println to write the text to standard
output, followed by a carriage return.
doc/codewalk/markov.go:128,129
Using this program
To use this program, first build it with the
go command:
$ go build markov.go
And then execute it while piping in some input text:
$ echo "a man a plan a canal panama" \
| ./markov -prefix=1
a plan a man a plan a canal panama
Here's a transcript of generating some text using the Go distribution's
README file as source material:
$ ./markov -words=10 < $GOROOT/README
This is the source code repository for the Go source
$ ./markov -prefix=1 -words=10 < $GOROOT/README
This is the go directory (the one containing this README).
$ ./markov -prefix=1 -words=10 < $GOROOT/README
This is the variable if you have just untarred a
doc/codewalk/markov.go
An exercise for the reader
The Generate function does a lot of allocations when it
builds the words slice. As an exercise, modify it to
take an io.Writer to which it incrementally writes the
generated text with Fprint.
Aside from being more efficient this makes Generate
more symmetrical to Build.
map[string][]string
. Each map key is a prefix (astring
) and its values are lists of suffixes (a slice of strings,[]string
).Here is the example table from the package comment as modeled by this data structure: While each prefix consists of multiple words, we store prefixes in the map as a single
string
. It would seem more natural to store the prefix as a[]string
, but we can't do this with a map because the key type of a map must implement equality (and slices do not).Therefore, in most of our code we will model prefixes as a
[]string
and join the strings together with a space to generate the map key:Chain
struct stores this data.Chain
struct has two unexported fields (those that do not begin with an upper case character), and so we write aNewChain
constructor function that initializes thechain
map withmake
and sets theprefixLen
field.This is constructor function is not strictly necessary as this entire program is within a single package (
main
) and therefore there is little practical difference between exported and unexported fields. We could just as easily write out the contents of this function when we want to construct a new Chain. But using these unexported fields is good practice; it clearly denotes that only methods of Chain and its constructor function should access those fields. Also, structuringChain
like this means we could easily move it into its own package at some later date.Prefix
type with the concrete type[]string
. Defining a named type clearly allows us to be explicit when we are working with a prefix instead of just a[]string
. Also, in Go we can define methods on any named type (not just structs), so we can add methods that operate onPrefix
if we need to.Prefix
isString
. It returns astring
representation of aPrefix
by joining the slice elements together with spaces. We will use this method to generate keys when working with the chain map.Build
method reads text from anio.Reader
and parses it into prefixes and suffixes that are stored in theChain
.The
io.Reader
is an interface type that is widely used by the standard library and other Go code. Our code uses thefmt.Fscan
function, which reads space-separated values from anio.Reader
.The
Build
method returns once theReader
'sRead
method returnsio.EOF
(end of file) or some other read error occurs.Readers
. For efficiency we wrap the providedio.Reader
withbufio.NewReader
to create a newio.Reader
that provides buffering.Prefix
slicep
using theChain
'sprefixLen
field as its length. We'll use this variable to hold the current prefix and mutate it with each new word we encounter.Reader
into astring
variables
usingfmt.Fscan
. SinceFscan
uses space to separate each input value, each call will yield just one word (including punctuation), which is exactly what we need.Fscan
returns an error if it encounters a read error (io.EOF
, for example) or if it can't scan the requested value (in our case, a single string). In either case we just want to stop scanning, so webreak
out of the loop.s
is a new suffix. We add the new prefix/suffix combination to thechain
map by computing the map key withp.String
and appending the suffix to the slice stored under that key.The built-in
append
function appends elements to a slice and allocates new storage when necessary. When the provided slice isnil
,append
allocates a new slice. This behavior conveniently ties in with the semantics of our map: retrieving an unset key returns the zero value of the value type and the zero value of[]string
isnil
. When our program encounters a new prefix (yielding anil
value in the map)append
will allocate a new slice.For more information about the
append
function and slices in general see the Slices: usage and internals article.When in this state the new value for
p
would be This operation is also required during text generation so we put the code to perform this mutation of the slice inside a method onPrefix
namedShift
.Shift
method uses the built-incopy
function to copy the last len(p)-1 elements ofp
to the start of the slice, effectively moving the elements one index to the left (if you consider zero as the leftmost index). We then assign the providedword
to the last index of the slice:Generate
method is similar toBuild
except that instead of reading words from aReader
and storing them in a map, it reads words from the map and appends them to a slice (words
).Generate
uses a conditional for loop to generate up ton
words.chain
map at keyp.String()
and assign its contents tochoices
.If
len(choices)
is zero we break out of the loop as there are no potential suffixes for that prefix. This test also works if the key isn't present in the map at all: in that case,choices
will benil
and the length of anil
slice is zero.rand.Intn
function. It returns a random integer up to (but not including) the provided value. Passing inlen(choices)
gives us a random index into the full length of the list.We use that index to pick our new suffix, assign it to
next
and append it to thewords
slice.Next, we
Shift
the new suffix onto the prefix just as we did in theBuild
method.strings.Join
function to join the elements of thewords
slice together, separated by spaces.flag
package to parse command-line flags.These calls to
flag.Int
register new flags with theflag
package. The arguments toInt
are the flag name, its default value, and a description. TheInt
function returns a pointer to an integer that will contain the user-supplied value (or the default value if the flag was omitted on the command-line).main
function begins by parsing the command-line flags withflag.Parse
and seeding therand
package's random number generator with the current time.If the command-line flags provided by the user are invalid the
flag.Parse
function will print an informative usage message and terminate the program.Chain
we callNewChain
with the value of theprefix
flag.To build the chain we call
Build
withos.Stdin
(which implementsio.Reader
) so that it will read its input from standard input.Generate
with the value of thewords
flag and assigning the result to the variabletext
.Then we call
fmt.Println
to write the text to standard output, followed by a carriage return.Generate
function does a lot of allocations when it builds thewords
slice. As an exercise, modify it to take anio.Writer
to which it incrementally writes the generated text withFprint
. Aside from being more efficient this makesGenerate
more symmetrical toBuild
.