CYK algorithm
From Wikipedia, the free encyclopedia
The Cocke-Younger-Kasami (CYK) algorithm (alternatively called CKY) determines whether a string can be generated by a given context-free grammar and, if so, how it can be generated. This is known as parsing the string. The algorithm employs bottom-up parsing and dynamic programming.
The standard version of CYK operates on context-free grammars given in Chomsky normal form (CNF). Any context-free grammar may be written thus. [1]
The importance of the CYK algorithm stems from the fact that it constructively proves that the membership problem for context-free languages is decidable (see also: theory of computation) and the fact that it does so quite efficiently. The worst case asymptotic time complexity of CYK is Θ(n3), where n is the length of the parsed string. This makes it one of the most efficient (in those terms) algorithms for recognizing general context-free languages. Specialized and faster algorithms exist for certain subsets of the context-free languages.
Contents |
[edit] Standard form
The algorithm as given in pseudocode is as follows:
[edit] As pseudocode
Let the input string consist of n letters, a1 ... an. Let the grammar contain r terminal and nonterminal symbols R1 ... Rr. This grammar contains the subset Rs which is the set of start symbols. Let P[n,n,r] be an array of booleans. Initialize all elements of P to false. For each i = 1 to n For each unit production Rj -> ai, set P[i,1,j] = true. For each i = 2 to n -- Length of span For each j = 1 to n-i+1 -- Start of span For each k = 1 to i-1 -- Partition of span For each production RA -> RB RC If P[j,k,B] and P[j+k,i-k,C] then set P[j,i,A] = true If any of P[1,n,x] is true (x is iterated over the set s, where s are all the indices for Rs) Then string is member of language Else string is not member of language
[edit] As prose
In informal terms, this algorithm considers every possible subsequence of the sequence of words and sets P[i,j,k] to be true if the subsequence of words starting from i of length j can be generated from Rk. Once it has considered subsequences of length 1, it goes on to subsequences of length 2, and so on. For subsequences of length 2 and greater, it considers every possible partition of the subsequence into two parts, and checks to see if there is some production P → Q R such that Q matches the first part and R matches the second part. If so, it records P as matching the whole subsequence. Once this process is completed, the sentence is recognized by the grammar if the subsequence containing the entire sentence is matched by the start symbol.
S | ||||||
VP | ||||||
S | ||||||
VP | PP | |||||
S | NP | NP | ||||
NP | V, VP | Det. | N | P | Det | N |
she | eats | a | fish | with | a | fork |
[edit] Extensions
[edit] Generating a parse tree
It is simple to extend the above algorithm to not only determine if a sentence is in a language, but to also construct a parse tree, by storing parse tree nodes as elements of the array, instead of booleans. Since the grammars being recognized can be ambiguous, it is necessary to store a list of nodes (unless one wishes to only pick one possible parse tree); the end result is then a forest of possible parse trees. An alternative formulation employs a second table B[n,n,r] of so-called backpointers.
[edit] Parsing non-CNF context-free grammars
It is also possible to extend the CYK algorithm to handle some context-free grammars which are not written in CNF; this may be done to improve performance, although at the cost of making the algorithm harder to understand.
[edit] Parsing weighted context-free grammars
It is also possible to extend the CYK algorithm to parse strings using weighted and stochastic context-free grammars. Weights (probabilities) are then stored in the table P instead of booleans, so P[i,j,A] will contain the minimum weight (maximum probability) that the substring from i to j can be derived from A. Further extensions of the algorithm allow all parses of a string to be enumerated from lowest to highest weight (highest to lowest probability).
[edit] See also
[edit] References
- ^ Sipser, Michael (1997), Introduction to the Theory of Computation (1st ed.), IPS, p. 99, ISBN 0-534-94728-X
- John Cocke and Jacob T. Schwartz (1970). Programming languages and their compilers: Preliminary notes. Technical report, Courant Institute of Mathematical Sciences, New York University.
- T. Kasami (1965). An efficient recognition and syntax-analysis algorithm for context-free languages. Scientific report AFCRL-65-758, Air Force Cambridge Research Lab, Bedford, MA.
- Daniel H. Younger (1967). Recognition and parsing of context-free languages in time n3. Information and Control 10(2): 189–208.