Edinburgh Speech Tools  2.4-release
doc/estgram.md
1 Grammar {#estgram}
2 =====================
3 
4 # Overview {#estgramoverview}
5 
6 To aid speech recognition and general text processing the speech
7 tools library provides an number of techniques to process various
8 types of grammars.
9 
10 These consist of a continuing expanding set of class and related
11 programs. These are at various levels of development but
12 all have fairly stable and working basic features.
13 
14 ## N-grams {#estgramngrams}
15 
16 Ngram language models are supported by the EST_Ngrammar class, and
17 associated routines and programs. Support is provided for
18 building. tuning, smoothing and running n-gram models. N-grams
19 themselves have been can be internally stored in either a
20 *dense* or *sparse*. In
21 the dense case all possible states are represented with probability
22 distributions, while the sparece case only has those possible
23 states for with data is available. Sparse will be much smaller
24 in very large cases. However we consider the dense case to be
25 the most fully developed.
26 
27 Formally n-grams can be views as a special case of weighted finite
28 state machines. Our implementation reflects that where possible (backoff
29 options break this simple view), thus a start state is provided
30 and traversal operation are given. This method of using n-grams is
31 by far the most efficient as only one new piece of information
32 is required at each stage, so no vectors of tokens need be collected
33 (or shifted) and presented to n-gram class. However as this finite
34 state machine view can't always be reasonable used we also support
35 access through a vector of tokens.
36 
37 ### Building ngram language models {#estgramngrambuild}
38 
39 The program \ref ngram_build_manual estimates ngram language
40 models from data. The data can be in a number of formats and be
41 saved in both an ascii (easier for humans to read) and binary
42 (quick to load) format.
43 
44 #### Vocabularies
45 
46 The vocabulary of an ngram must be predefined. This is required to
47 allow efficient internal representation. This implementation
48 supports two vocabularies, one for the n-1 tokens in an ngram
49 and one for the nth token as potentially this "predictee" could
50 be from a different class. We also support the notion of
51 out of vocabulary word, so any token found in the input data
52 that is not in the vocabulary may be mapped to that token.
53 
54 In build n-grams there are options on what to do with n-grams
55 which contain out of vocabulary tokens. They may be mapped
56 to te specifed out of vocabulary word, the ngram can be ignored or
57 the whole sentence containing the out of vocabulary word can
58 be ignored.
59 
60 #### ngram data input formats
61 
62 The input data can be in a number of formats depending on how
63 much preprocessing you wish to do before building. The most
64 basic form is to submit n-grams. That is n tokens, on each
65 line. For example for a tri-gram model of phones it might
66 look like
67 
68  0 # a1
69  # a1 f
70  a1 f r
71  f r i
72  r i k
73  i k aa1
74  k aa1 n
75  aa1 n @
76 
77 In this case the data preparation stage most create each n-gram with
78 the sigle stepping through the data at each stage. This
79 format we call `ngram_per_line`
80 
81 A second format is `sentence_per_line` where each
82 line of a file is a complete "sentence". Ngrams for each n-tuple
83 will be automatically created and cumulated. In this case the input
84 file might look like
85 
86  a1 f r i k aa1 n @
87  ei1 b r @ h a m
88 
89 In this mode, ngrams for the tokens at start of the sentence are
90 created by using the token by defining a `prev_tag` (and if necessary a
91 `prev_prev_tag`). Thus given the above sentence by line file,
92 a `prev_tag` of "#" and a `prev_prev_tag` of "0". The first few
93 tri-grams cumulated are
94 
95  0 # a1
96  # a1 f
97  a1 f r
98 
99 If the ngram size requires looking back further the `prev_prev_tag` is
100 repeat indefinitely. Likewise an end_tag is appended to the end
101 of every sentence too, (i.e. end of every line).
102 
103 A third data input format is `sentence_per_file`
104 where line breaks are no longer signficant and n-grams are create
105 for all n-tuples in the file. The same special cases are treated
106 for beginning and end of file as are for beginning and end of
107 line in the sentence_per_line case.
108 
109 #### Smoothing and Backoff
110 
111 We support a number of different techniques to deal with lack of data in
112 a training set.
113 
114 Good Turing smoothing \cite churchgale1991 is supported
115 allowing smoothing on n-grams whose frequency is less than M. We also
116 support *backoff* where the n-1 grams are
117 (recursively) built to provide an estimation of probability
118 distributions for unseen n-grams.
119 
120 ### Testing ngram language models {#estngramtesting}
121 
122 \ref ngram_test_manual computes language model entropy/perplexity
123 on test data. The test data may be in any of the formats described
124 above.
125 
126 ## SCFGs {#estngramscfg}
127 
128 Stochastic context-free grammars are a version of context-free grammars
129 with probabilities on rules. In this implementation we assume SCFGs
130 are always in Chomsky Normal From (CNF) where rules may only be binary
131 or unary branching. When binary, both daughters must be non-terminals and
132 when unary, the daughter must be a terminal.
133 
134 The implementation here is primarily based on \cite pereira1992inside
135 thus allowing unsupervised training of SCFGs as well
136 as allowing seeding with a bracketed corpus which can vastly reduce
137 training time, and improve results. Training uses the inside-outside
138 algorithm.
139 
140 The support is split into four parts: making grammars, training grammars,
141 testing grammars and parsing.
142 
143 A grammar file consists of a set of rules. Each rule is a bracketed
144 list of probability, nonterminal, followed by two nonterminals or
145 one terminal. A simple example is
146 
147  (0.5 S A D)
148  (0.5 S C B)
149  (0.79 B S C)
150  (0.21 B a)
151  (0.79 D S A)
152  (0.21 D b)
153  (1 A b)
154  (1 C a)
155 
156 The mother non-terminal in the first rule is the distinguished symbol.
157 
158 Grammars may be constructed by hand, by the program \ref scfg_make_manual or
159 by some other external process. The program \ref scfg_make_manual constructs
160 a full grammar given a list (or number of) terminals and nonterminals.
161 The rules can be assigned equal probabilities or random ones. The
162 "probabilities" may be actual probabilities or log probabilties.
163 For example given a file `wp19` with a list of terminals, a grammar
164 suitable for training with 15 non-terminals may be created by the command
165 
166  scfg_make -nonterms 15 -terms wp19 -domain prob \
167  -values random -o wsj_15_19.gram
168 
169 The non-terminals or terminal names will be automatically generated if a
170 number is given, or will be as specified if a file name is given. In
171 the case of a filename being given, no brackets should be the file just
172 whitespace separated tokens.
173 
174 A corpus consists of a number of sentences, each sentence must be
175 contain within a set of parenthesis. The sentences themselves may
176 additionally contain further bracketing (for training and testing).
177 Each sentence is read by the Lisp reader so comments (semi-colon to
178 end of file) may be included. For example
179 
180 \code{.lisp}
181 ((((n n) punc ((cd n) j) punc)
182  (md (v (dt n) (in (dt j n)) (n cd)))
183  punc))
184 (((n n) (v ((n) (in ((n n) punc (dt n v n))))) punc))
185 ((((n n) punc (((cd n) j) cc ((j n) (in (n n n n)))) punc)
186  (v (v (((dt j n) (in (dt j j n))))))
187  punc))
188 ...
189 \endcode
190 
191 Training is done by estimating the inside and outside probabilities of
192 the rules based on their current probabilities and a corpus. The corpus
193 may optionally include internal bracketing which is used by the training
194 algorithm to precluded some possible parses hence making the training
195 typically faster (and sometimes more accurate). After each training
196 pass the grammar rule probabilities are updated and the process starts
197 again. Note depending on the number of training sentences training
198 passes may take a very long time. After each passes the cross entropy
199 for the current version of the grammar is printed. This should normally
200 decrease until the the "best" grammar has been found.
201 
202 The program \ref scfg_train_manual takes an initial grammar, and corpus and,
203 by default will train for 100 passes. Because it can take prohibitively
204 long for a desired number of passes an option is available to selection
205 only an N percent chunk of the training set on each pass, cycling
206 through the other N percent chunks of the corpus on each pass
207 Experiments have shown that this not only produces much faster training,
208 but the accuracy of the fully trained grammar is very similar. Given the
209 choice of waiting taking 10 days or 48 hours to parse, it is highly
210 recommended.
211 
212 After each N passes the current state of the grammar may be saved, the
213 number of passes between saving is specified by the `-checkpoint`
214 option. The grammar is saved in the output file appended with the pass
215 number.
216 
217 Because the partitioned training will select different partitions
218 depending on the pass number you can optionally specify the starting
219 pass number, making it much easier to continue training after some
220 interruption.
221 
222 Testing is done by the program \ref scfg_test_manual it takes a grammar and a
223 corpus. That corpus may be fully bracketed or not. By default
224 the mean cross entropy value from anaylzing these senetences will
225 be printed, also the number sentence sthat fail to parse.
226 
227 Alternatively a *bracketing accuracy* may be calculated this is the
228 percentage of prhases in a parsed sentence that are compatible with the
229 bracketing in the corpus example.
230 
231 The fourth program provides a mechanism for parsing one or more
232 sentences. The corpus this time should contain no bracketing except
233 around the beginning and end of the sentence itself. Two forms of parses
234 are produced. A full form with start and end points for each phrase,
235 the related non-terminal and the probability, and a simple form where
236 only the bracketing is given. Note only one (or no) parses is given.
237 For any phrase only the best example tree is given though the
238 probability is given as the sum of all possibily derivations of that
239 non-terminal for that phrase.
240 
241  scfg_parse -grammar wsj_15_19.gram -corpus news.data -brackets -o news.out
242 
243 Note the input for must be strings of terminals for the given grammar.
244 For real parsing of real text it is likely the grmmar uses part of
245 speech tags as terminals and the data is avtuall words not part of
246 speech tags. If you want to parse texts then you can use the Festival
247 script `festival/examples/scfg_parse_text` which takes in arbitrary
248 text, runs the part of speech tagger on it after standard tokenizing
249 rules and parses the output saving the parse to the specified file.
250 
251 ## WFSTs {#estngramwfst}
252 
253 The speech tools contains a small, but growing library of
254 basic functions for building, and manipulating weighted finite
255 state transducers. Although not complete they already provided many of
256 the basic operations and compilers one needs for using these devices.
257 
258 Given a WFST the following operations are supported: \ref deterimise,
259 \ref minimize, \ref complement.
260 
261 Given two WFSTs the following operations are supported: \ref intersection,
262 \ref union, \ref difference, \ref concatenation and \ref compose.
263 
264 In addition to these operations compiles are provided for a number of
265 basic input formats: regular expressions, regular grammars, context-free
266 grammars (with depth restriction) and Kay/Kaplan/Koksenniemi two-level
267 morphology rules.
268 
269 Still missing are complete treatment of the weights through some
270 basic operations (e.g. minimization doesn't presever weights). Also
271 techniques for learning WFSTs from data, or at least weightign existing
272 FSTs from data will be added in later versions.
273 
274 In general inputing symbols is of the form X or X/Y. When X is
275 given it is (except if using the wfst as a transducer) treated
276 as X/X. Where X/Y is input/output symbols, thus using single symbols
277 will mostly cause the wfst mechanisms to act as if they are finite
278 state machines.
279 
280 The two main programs are \ref wfst_build_manual and \ref wfst_run_manual.
281 \ref wfst_run_manual runs in both recognition and transduction mode.
282 
283 \ref wfst_build_ builds wfst's from description files or through
284 combination of existing ones. The output may be optionally
285 determinized or determinized and minimized.
286 
287 ### Kay/Kaplan/Koskenniemi morphological rules {#estngramwfstkaykaplan}
288 
289 One of the major drives in interest in wfst has been through their
290 use in morphology @cite kaplan94. Hence we provide a method for
291 compiling Kay/Kaplan/Koskenniemi type (restricted) context sensitive
292 rewrite rules. The exact form is given in the example below.
293 
294 This example covers basic letters to letters but also Epenthesis for
295 e-insertion in words like `churches` and `boxes`.
296 \code{.lisp}
297 (KKrules
298  engmorph
299  (Alphabets
300  ;; Input Alphabet
301  (a b c d e f g h i j k l m n o p q r s t u v w x y z #)
302  ;; Output Alphabet
303  (a b c d e f g h i j k l m n o p q r s t u v w x y z + #)
304  )
305  (Sets
306  (LET a b c d e f g h i j k l m n o p q r s t u v w x y z)
307  )
308  (Rules
309  ;; The basic rules
310  ( a => nil --- nil)
311  ( b => nil --- nil)
312  ( c => nil --- nil)
313  ( d => nil --- nil)
314  ( e => nil --- nil)
315  ( f => nil --- nil)
316  ( g => nil --- nil)
317  ( h => nil --- nil)
318  ( i => nil --- nil)
319  ( j => nil --- nil)
320  ( k => nil --- nil)
321  ( l => nil --- nil)
322  ( m => nil --- nil)
323  ( n => nil --- nil)
324  ( o => nil --- nil)
325  ( p => nil --- nil)
326  ( q => nil --- nil)
327  ( r => nil --- nil)
328  ( s => nil --- nil)
329  ( t => nil --- nil)
330  ( u => nil --- nil)
331  ( v => nil --- nil)
332  ( w => nil --- nil)
333  ( x => nil --- nil)
334  ( y => nil --- nil)
335  ( z => nil --- nil)
336  ( # => nil --- nil)
337  ( _epsilon_/+ => (or LET _epsilon_/e) --- nil)
338 
339  ;; Epenthesis
340  ;; churches -> church+s
341  ;; boxes -> box+s
342  (e/+ <=> (or (s h) (or s x z) (i/y) (c h))
343  ---
344  (s))
345 )
346 \endcode
347 
348 A substantially larger example of morphographenic rules is distributed with
349 the Festival speech synthesis system in `festival/lib/engmorph.scm`.
350 This is based on the English description in @cite ritchie92 .
351 
352 For a definition of the semantics fo the basic types of rule,
353 surface coercion, context restriction and combined rules
354 see @cite ritchie92. Note that these rules are run in
355 parallel (the transducers are intersected) making they rule
356 interact in ways that the author might not intend. A good rule
357 debugger is really required in order to write a substantial set
358 of rules in this formalism.
359 
360 The rule compilation method used differs from Kay and Kaplan, and
361 also from @cite mohri96 and actually follows them method used
362 in @cite ritchie92 though in this case, unlike @cite ritchie92,
363 the technique is followed through to true wfst's. The actual
364 compilation method shold be described somewhere.
365 
366 The above may be compiled into a wfst by the command (assuming it is
367 in the file `mm.rules`.
368 
369  wfst_build -type kk -o engmorph.wfst -detmin engmorph.scm
370 
371 This rule compiler has also been used in finding equivalent transducers
372 for restricted forms of decision tree (following @cite sproat96) and
373 may be view as mostly stable.
374 
375 ### Regular expressions {#estngramwfstregex}
376 
377 A simple method for building wfst's from regular expressions is also
378 provided.
379 
380 An example is
381 \code{.lisp} ((a b c)
382  (a b c)
383  (and a (+ (or b c)) d))
384 \endcode
385 
386 This consists of the input alphabet and the output alphabet
387 followed by a LISP s-expression contains the regex. The supported
388 operators are `and`, `or`, `+`, `*` and `not`.
389 
390 Compilation is by the following command:
391 
392  wfst_build -type regex -o t1.wfst -detmin t1.regex
393 
394 ### Regular Grammars {#estngramwfstregulargrammars}
395 
396 A compilation method also exists for regular grammars. These grammars
397 do not need to be a normal form, in fact no chaeck is made that they
398 are regular, if they contain center-embedding the construct algorithm
399 will go into a loop and eventually run out of storage. The
400 correction to that is to add a depth limit which would then
401 allow wfst approximations of context-free grammars, which would
402 be useful.
403 
404 An example regular grammar is
405 \code{.lisp} (RegularGrammar
406  engsuffixmorphosyntax
407  ;; Sets
408  (
409  (V a e i o u y)
410  (C b c d f g h j k l m n p q r s t v w x y z)
411  )
412  ;; Rules
413 
414  (
415  ;; A word *must* have a suffix to be recognized
416  (Word -> # Syls Suffix )
417  (Word -> # Syls End )
418 
419  ;; This matches any string of characters that contains at least one vowel
420  (Syls -> Syl Syls )
421  (Syls -> Syl )
422  (Syl -> Cs V Cs )
423  (Cs -> C Cs )
424  (Cs -> )
425 
426  (Suffix -> VerbSuffix )
427  (Suffix -> NounSuffix )
428  (Suffix -> AdjSuffix )
429  (VerbSuffix -> VerbFinal End )
430  (VerbSuffix -> VerbtoNoun NounSuffix )
431  (VerbSuffix -> VerbtoNoun End )
432  (VerbSuffix -> VerbtoAdj AdjSuffix )
433  (VerbSuffix -> VerbtoAdj End )
434  (NounSuffix -> NounFinal End )
435  (NounSuffix -> NountoNoun NounSuffix )
436  (NounSuffix -> NountoNoun End )
437  (NounSuffix -> NountoAdj AdjSuffix )
438  (NounSuffix -> NountoAdj End )
439  (NounSuffix -> NountoVerb VerbSuffix )
440  (NounSuffix -> NountoVerb End )
441  (AdjSuffix -> AdjFinal End )
442  (AdjSuffix -> AdjtoAdj AdjSuffix)
443  (AdjSuffix -> AdjtoAdj End)
444  (AdjSuffix -> AdjtoAdv End) ;; isn't any Adv to anything
445 
446  (End -> # ) ;; word boundary symbol *always* present
447 
448  (VerbFinal -> + e d)
449  (VerbFinal -> + i n g)
450  (VerbFinal -> + s)
451 
452  (VerbtoNoun -> + e r)
453  (VerbtoNoun -> + e s s)
454  (VerbtoNoun -> + a t i o n)
455  (VerbtoNoun -> + i n g)
456  (VerbtoNoun -> + m e n t)
457 
458  (VerbtoAdj -> + a b l e)
459 
460  (NounFinal -> + s)
461 
462  (NountoNoun -> + i s m)
463  (NountoNoun -> + i s t)
464  (NountoNoun -> + s h i p)
465 
466  (NountoAdj -> + l i k e)
467  (NountoAdj -> + l e s s)
468  (NountoAdj -> + i s h)
469  (NountoAdj -> + o u s)
470 
471  (NountoVerb -> + i f y)
472  (NountoVerb -> + i s e)
473  (NountoVerb -> + i z e)
474 
475  (AdjFinal -> + e r)
476  (AdjFinal -> + e s t)
477 
478  (AdjtoAdj -> + i s h)
479  (AdjtoAdv -> + l y)
480  (AdjtoNoun -> + n e s s)
481  (AdjtoVerb -> + i s e)
482  (AdjtoVerb -> + i z e)
483 
484 )
485 )
486 \endcode
487 
488 The above is a simple morpho-syntax for English.
489 
490 
491 # Programs {#estngramprograms}
492 
493 The following are exectutable programs for grammars
494 
495  - scfg_make_manual: make a set of rules for a SCFG.
496  - scfg_train_manual: train a set of rules for a SCFG.
497  - scfg_parse_manual: Perform parsing using pre-trained grammar
498  - scfg_test_manual: Test the output of a parser.
499  - wfst_build_manual: Build a weighted finite state transducer.
500  - wfst_run_manual: Run a weighted finite state transducer.
501  - ngram_build_manual: Train a n-gram from text.
502  - ngram_test_manual: Calculate the perplexity etc of an n-gram.
503 
504 # Classes {#estngramclasses}
505 
506  - EST_SCFG
507  - EST_SCFG_Rule
508  - EST_SCFG_traintest
509  - EST_bracketed_string
510  - EST_Ngrammar
511  - EST_WFST