Download this week’s example code here.
Notes forthcoming, following this outline:
From Markov to Chomsky: where n-gram analysis fails
What aspects of the syntax of human language does n-gram analysis not capture? Some are easy to catch, like these snippets from the output of a Markov chain algorithm (an order 4, character-level, source text The Waste Land):
And I ready, and gramophone.)
“What four army faces sings, slimy belly on the Metropole.
Anyone who has generated character-level texts with a Markov chain will recognize this problem: because an n-gram analysis only takes into account the immediate context of a word/character/token, there’s no way for it to recognize the fact that a quotation mark or parenthesis is occurring without its pair.
The same problem is demonstrated below, but in a slightly different way:
- That dog laughed.
- That dog over there laughed.
- That dog with the smelly ears laughed.
- That dog I told you about yesterday—the insolent one—laughed.
A human being can understand the structure of these sentences, and understand what the sentences mean. They’re all about laughing, and there’s a dog that’s doing the laughing.
But an n-gram analysis of these sentences won’t be able to tell us these (seemingly) simple facts: it can only tell us which words occur in which order. A partial word-level order 3 n-gram analysis of the sentences given above would yield the following:
that dog laughed over there laughed smelly ears laughed insolent one laughed
If you tried to figure out the text’s meaning based purely on n-gram analysis, you’d come up with some pretty strange conclusions: the dog only shows up once, and the other n-grams seem to indicate that “over there,” “smelly ears” and some mysterious “insolent one” is doing the laughing the rest of the time.
Context-Free Grammars
From the examples above, we can see it takes more than just a knowledge of how words occur in sequence to understand (and generate) human language. More specifically, language has a hierarchical organization: sentences are composed of clauses, which themselves might be composed of clauses. A sentence, for example, is composed of a noun phrase and a verb phrase, and noun phrases and verb phrases can have their own internal structure, even containing other noun phrases or verb phrases. Linguists illustrate this kind of structure with tree diagrams:
(generate your own tree diagrams with phpSyntaxTree)
Clearly, we need a model that captures the hierarchical and recursive nature of syntax. One such model is the context-free grammar.
A grammar is a set of rules. A context-free grammar defines rules in a particular way: as a series of productions. Our goal is to use these rules to generate sentences that have the structure of English. Here’s a mini-grammar of English that we can use for our initial experiments:
S -> NP VP NP -> the N VP -> V VP -> V NP N -> amoeba N -> dichotomy N -> seagull V -> shook V -> escaped V -> changed
Key: NP = noun phrase; VP = verb phrase; S = sentence; N = noun; V = verb
Each rule specifies a “expansion”: it states that the symbol on the left side of the arrow, whenever it occurs in the process of producing a sentence, can be rewritten with the series of symbols on the right side of the arrow. Symbols to the left of the arrow are non-terminal; symbols to the right side of the arrow can be a mix of terminal and non-terminal. (“Terminal” here simply means that there is no rule that specifies how that symbol should be re-written.)
A symbol that appears on the left side more than once (like VP
, N
and V
above) can be rewritten with the right side of any of the given rules. We’ll sometimes write such rules like this, for convenience, but it has the same meaning as the three N
rules above:
N -> amoeba | dichotomy | seagull
Generating sentences
Okay! So let’s generate a simple sentence from our grammar. We have to choose a non-terminal symbol to start the process: this symbol is called the axiom, and it’s usually S
(or whatever the “top level” rule is).
After we choose the axiom, the process for generating a symbol is simple. The process can be described like this:
- Replace each non-terminal symbol with an expansion of that symbol.
- If the expansion of a symbol itself contains non-terminal symbols, replace each of those symbols with their expansions.
- When there are no non-terminal symbols left, stop.
For this example, we’ll use S
as our axiom, and stipulate that whenever we encounter a non-terminal symbol with more than one production rule, we’ll chose among those options randomly. Here’s what the generation might look like:
S
(apply rule: S -> NP VP)NP VP
(apply rule: NP -> the N)the N VP
(apply rule: N -> amoeba)the amoeba VP
(apply rule: VP -> V)the amoeba V
(apply rule: V -> escaped)the amoeba escaped
This isn’t a very sophisticated example, true, but the machinery is in place for producing very complex sentences. Of course, no one wants to do all this by hand, so let’s look at a program written in Java that automates the process of generating sentences from context-free grammars.
ContextFree.java
The ContextFree class is the heart of this week’s example code. An example of how to use the class is given in the main
method of the class, which I’ll talk you through:
ContextFree cf = new ContextFree(); cf.addRule("S", "NP VP"); cf.addRule("NP", "the N"); cf.addRule("N", "cat"); cf.addRule("N", "dog"); cf.addRule("N", "weinermobile"); cf.addRule("N", "duchess"); cf.addRule("VP", "V the N"); cf.addRule("V", "sees"); cf.addRule("V", "chases"); cf.addRule("V", "lusts after"); cf.addRule("V", "blames"); cf.expand("S"); System.out.println();
After instantiating the object, we add rules to the grammar by calling the addRule
method, whose first argument is a non-terminal symbol and whose second argument is a possible expansion for that symbol. The expand
method takes the axiom, and then “expands” that axiom out into a full sentence, following a set of steps very similar to the one we followed above. The above grammar, for example, will create output that looks like this:
the cat lusts after the weinermobile
Keeping track of the rules
The ContextFree class stores its grammar as ContextRule objects. The ContextRule class is conceptually very similar to the MarkovGram object from last week’s example: just a simple class, intended to manage a list of strings and return a random string on demand. In this case, the ContextRule object stores a non-terminal symbol, and a list of all possible expansions for that symbol. Here’s the source code in full:
import java.util.ArrayList; public class ContextRule { private ArrayList<String> expansions; private String rule; public ContextRule(String rule_) { expansions = new ArrayList<String>(); rule = rule_; } public void addExpansion(String expansion) { expansions.add(expansion); } public String getRandomExpansion() { int randomIndex = (int)(Math.random() * expansions.size()); return expansions.get(randomIndex); } public void dump() { System.out.print(rule + " -> "); System.out.println(expansions); } }
A ContextRule object knows how to do three things: add a new expansion to its list of expansions (addExpansion
), return a random item from the list of expansions (getRandomExpansion
) and dump
its contents to the screen. (The dump
method is there mainly for debugging purposes.)
The ContextFree object keeps its ContextRule objects in a HashMap, with the non-terminal symbols as the keys and the ContextRule objects as values. Here’s an excerpt from the code of the ContextFree class, which shows how ContextRule objects are added to the HashMap:
private HashMap<String, ContextRule> rules; public void addRule(String rule, String expansion) { if (rules.containsKey(rule)) { ContextRule cr = rules.get(rule); cr.addExpansion(expansion); } else { ContextRule cr = new ContextRule(rule); cr.addExpansion(expansion); rules.put(rule, cr); } }
(Remember we’re calling this function like this: contextfree.addRule("S", "NP VP");
)
This snippet of code (which should by now look familiar) first checks to see if we already have a ContextRule object in the HashMap for the given rule; if so, that ContextFree object is fetched, and its addExpansion
method is called to add the new expansion to the list of possibilities for the given rule. If the given rule isn’t found, a new ContextRule object is created with the given expansion and added to the HashMap.
A call to dump
will output a representation of what’s in the rules HashMap. For the code given above, dump
will print out something like this:
S -> [NP VP] VP -> [V the N] NP -> [the N] N -> [cat, dog, weinermobile, duchess] V -> [sees, chases, lusts after, blames]
Recursive expansion
The expand
method of the ContextFree class is where the heavy work gets done: taking an axiom and turning it into a full sentence. But how does it work? Here’s the source code:
public void expand(String current) { if (!rules.containsKey(current)) { renderExpansion(current); } else { ContextRule cr = rules.get(current); String[] toExpand = cr.getRandomExpansion().split(" "); for (String s: toExpand) { expand(s); } } }
The first thing this method does is check to see if the given symbol is a non-terminal symbol—in other words, if it’s a key in the rules
HashMap. If it isn’t—if it’s a non-terminal symbol, like “the” or “cat” in the example grammar above—the method passes the symbol on to renderExpansion
, which simply prints the symbol to output. (We’ll see how to redefine renderExpansion
later, in order to give it more interesting behavior.)
If the symbol is non-terminal, then we want to expand that symbol with one of the expansions given in the grammar. We retrieve the ContextRule for the given symbol (rules.get(current)
), get a random expansion from that rule (cr.getRandomExpansion()
) and use split
to break the rule up into its component parts. For example (still using the grammar above for reference), if the symbol passed to expand
was S
, the toExpand
array would have NP
as its first element and VP
as its second element.
Here’s where it gets tricky
The expand
method then loops over every element in toExpand
and calls itself to expand that symbol. This is called recursion: when a function calls itself. Recursion is a handy tool in programming, especially for tasks that are self-similar—such as generating sentences from a context-free grammar.
The idea here is that the expansion of one symbol—say, S
—will set off a chain reaction of expansions of other symbols, until there are no more symbols to be expanded. A call to expand("S")
will itself call expand("NP")
and expand("VP")
; a call to expand("NP")
will itself call expand("the")
and expand("N")
; and so forth.
The diagram below illustrates the whole process (again, using our tiny sample grammar as an example). Solid arrows indicate a call to expand
, dashed arrows indicate a return from expand
, and the numbers indicate in which order the calls are performed.
The order of calls is important. The expand
method expands the left-most symbol to its full depth (e.g., all the way down to renderExpansion
) before it moves on to any more symbols.
Infinite recursion
The astute reader will notice that there’s the potential here for an infinite loop. Let’s say that the only expansion for the symbol NP
looked like this:
NP -> NP or NP
Each attempt to expand NP
would lead to another attempt to expand NP
, which would lead to another attempt to expand NP
, and so forth. The expansion would never complete, and Java would eventually complain that it had run out of memory.
One way to ensure that your context-free grammars will always completely expand is to check to make sure that every symbol has at least one expansion that will eventually lead to terminal symbols. For example, if we added the following rule to our grammar, we’d be okay:
NP -> the dog
(Of course, this grammar could potentially generate extraordinarily long strings, like the dog or the dog or the dog or the dog or the dog or the dog ...
. But it will eventually finish.)
ContextFilter.java
Okay, so the ContextFilter class has functions to add rules to a grammar and to generate sentences from that grammar. So far, though, we’ve had to specify those rules inside the program, calling addRule
by hand. “It would be nice,” I hear you say, “if we could specify the grammar in a separate file, then write a program to read that file in and build the ContextFree object accordingly. That way, we can make changes to the grammar without recompiling our program.”
Well, you’re in luck! The ContextFilter class does just that. It’s a (very) simple example of a parser: a program that reads a file and builds some sort of data structure from its contents. In this case, we’re building a parser for a specification of a context-free grammar. There’s nothing special about the format I chose for this file—it’s not an accepted standard for specifying CFGs—but it’s feature complete and fairly easy to parse. Here’s test.grammar from this week’s example code:
# clauses S -> NP VP S -> Interj NP VP NP -> Det N NP -> Det N that VP NP -> Det Adj N VP -> Vtrans NP # transitive verbs have direct objects VP -> Vintr # intransitive verbs have no direct object # terminals Interj -> oh, | my, | wow, | damn, Det -> this | that | the N -> amoeba | dichotomy | seagull | trombone | corsage | restaurant | suburb Adj -> bald | smug | important | tame | overstaffed | luxurious | blue Vtrans -> computes | examines | foregrounds | prefers | interprets | spins Vintr -> coughs | daydreams | whines | slobbers | vocalizes | sneezes
Each line in the file specifies a rule. The non-terminal symbol is on the left-hand side of the string “->
” and the expansion is on the right-hand side. If a non-terminal symbol has multiple expansions, those expansions can either be specified on separate lines (as with NP
), or on the same line, using the pipe character (|
) as a separator. White space around ->
and |
is ignored. Grammar files can also have comments: anything following the #
character will be ignored, up until the end of the line.
This format is simple enough that we can parse it using regular expressions. Here’s the eachLine
method from ContextFilter. In the code, cfg
refers to a ContextFree object that has already been initialized.
public void eachLine(String line) { line = line.replaceAll("#.*$", ""); // remove comments Pattern rulePattern = Pattern.compile("(\\w+) *-> *(.*)"); // Sym -> Expansion Matcher m = rulePattern.matcher(line); if (m.find()) { String rule = m.group(1); String expansion = m.group(2); String[] expansions = expansion.split("\\s*\\|\\s*"); for (String s: expansions) { cfg.addRule(rule, s); } } }
The first thing that happens in this method is a call to replaceAll("#.*$", "")
: this will replace every #
character followed by zero or more other characters (.*
) up until the end of the line ($
) with nothing.
The pattern passed into Pattern.compile
on the next line is a little bit more complex. Here’s a reading:
(\\w+)
– The first group of characters I’m looking for is one or more alphanumeric characters*-> *
– You’ll find that group just before you find the sequence of characters->
, which may (or may not) be surrounded by some space characters.(.*)
– The second group of characters I’m looking for comes after all of that. Actually, the second group is just whatever characters you find after*-> *
—I’m not picky.
Then we create a Matcher object m
with the results of that regular expression. If all goes according to plan, m.group(1)
will have the non-terminal rule, and m.group(2)
will have the expansion for that rule.
We still have one more thing we need to check for, though: multiple expansions on the same line. A call to split
will split up the second group, using the pipe character as a delimiter. More specifically, \\s*\\|\\s*
translates to “a pipe character, maybe but not necessarily with whitespace to either side.” Why the double backslash in front of the pipe character? Because the pipe character in a regular expression usually has a special meaning: alternation. The double backslash in front tells Java to ignore the special meaning and just look for a literal pipe character.
Finally, we loop through expansions
and add them to our ContextFree object using addRule
. Call the program like this:
$ java ContextFilter <test.grammar
… and get output like this:
oh, the bald corsage computes the important trombone
PDEGenerator.java
Of course, natural language isn’t the only thing that can be modelled with context-free grammars. Programming languages are often specified as context-free grammars as well (in something called Backus-Naur Form—here’s what the BNF of Java looks like). A variant of context-free grammars called l-systems are often used to model plant growth. (See Dan Shiffman’s tutorial for a quick intro to l-systems.)
Computer programs can also be generated with context-free grammars. PDEGenerator is a quick and dirty program that does just that: it uses a context-free grammar to generate Processing sketches that draw recursive geometric patterns. Here’s some sample output (source code here):
Extending ContextFree
The first thing you’ll notice when looking at the PDEGenerator class is that there isn’t much to it: it defines only two methods (main
and renderExpansion
) but seems to do behave exactly like ContextFree object—we can use the addRule
and expand
methods, even though they aren’t defined in PDEGenerator. What’s going on here?
The key word here is extends
—it’s up on the first line of the class definition: PDEGenerator extends ContextFree
. The extends
means that we’re creating a new class, but that we want that class to be able to do everything that another class can do—in essence, borrowing the functionality of that class. If we define a method in our new class that has the same name and arguments as a method in the old class, the new method overrides the old one, so that the new method will be called for all objects of the new class instead of the old method. The renderExpansion
method in ContextFree simply prints out whatever it’s given; the renderExpansion
method in ContextFree does something much more elaborate.
(This is, incidentally, how both Processing’s PApplet class and our TextFilter class work. Processing applets are actually just classes that extend PApplet, overriding PApplet’s methods—like setup
and draw
—as necessary.)
Of course, we can also add new methods to classes, thereby extending the functionality of the class. The technical term for all of this in object-oriented programming is inheritance, and if you’re interested in the nitty-gritty, Sun’s Java inheritance tutorial is as good a place as any to start.
Reading for next week
- Yudkowsky, E. “An intuitive explanation of Bayes’ Theorem.”
- Graham, P. “A plan for spam.”
- (optional) Breyer, L. “Can a Bayesian spam filter play chess?”
Homework
Due March 24th.
Modify, augment, or replace one of the in-class example programs. Some ideas:
- Augment
test.grammar
to model some part of English syntax (or the syntax of some other language) that isn’t currently part of the grammar. Ideas: proper nouns, quotations, parenthetical remarks, adverbs, adverbial phrases, questions, etc. What’s easy to model, and what’s hard to model? Modify or extend the ContextFree class as necessary. - Build a grammar that generates something other than English sentences. For example: stories, images, sound, computer programs, other grammars, etc. Modify or extend the ContextFree class as appropriate.
- Use regular expressions (or other means) to parse data out of a file, as we did in ContextFilter.java. You could write a parser for an alternate context-free grammar syntax (e.g., add in weighting data, see RiTa’s RiGrammar class), or a parser for some arbitrary file format (something as simple as tab- or comma-separated data is fine).
-
Pingback from ITPindia » Blog Archive » Cookalator on March 24, 2009 at 3:05 am
Reply
You must be logged in to post a comment.
4 comments
Comments feed for this article
Trackback link: http://www.decontextualize.com/teaching/a2z/context-free-as-in-speech/trackback/