# CYK algorithm

### From Wikipedia, the free encyclopedia

The **Cocke-Younger-Kasami (CYK) algorithm** (alternatively called CKY) determines whether a string can be generated by a given context-free grammar and, if so, how it can be generated. This is known as parsing the string. The algorithm employs bottom-up parsing and dynamic programming.

The standard version of CYK operates on context-free grammars given in Chomsky normal form (CNF). Any context-free grammar may be written thus. ^{[1]}

The importance of the CYK algorithm stems from the fact that it constructively proves that the membership problem for context-free languages is decidable (see also: theory of computation) and the fact that it does so quite efficiently. The worst case asymptotic time complexity of CYK is Θ(n^{3}), where *n* is the length of the parsed string. This makes it one of the most efficient (in those terms) algorithms for recognizing general context-free languages. Specialized and faster algorithms exist for certain subsets of the context-free languages.

## Contents |

## [edit] Standard form

The algorithm as given in pseudocode is as follows:

### [edit] As pseudocode

Letthe input string consist ofnletters,a_{1}...a_{n}.Letthe grammar containrterminal and nonterminal symbolsR_{1}...R_{r}. This grammar contains the subset R_{s}which is the set of start symbols.LetP[n,n,r] be an array of booleans. Initialize all elements of P to false.For eachi = 1 to nFor eachunit production R_{j}-> a_{i},setP[i,1,j] = true.For eachi = 2 to n-- Length of spanFor eachj = 1 to n-i+1-- Start of spanFor eachk = 1 to i-1-- Partition of spanFor eachproduction R_{A}-> R_{B}R_{C}IfP[j,k,B] and P[j+k,i-k,C]thenset P[j,i,A] = trueIfany of P[1,n,x] is true (x is iterated over the set s, where s are all the indices for R_{s})Thenstring is member of languageElsestring is not member of language

### [edit] As prose

In informal terms, this algorithm considers every possible subsequence of the sequence of words and sets P[i,j,k] to be true if the subsequence of words starting from i of length j can be generated from R_{k}. Once it has considered subsequences of length 1, it goes on to subsequences of length 2, and so on. For subsequences of length 2 and greater, it considers every possible partition of the subsequence into two parts, and checks to see if there is some production P → Q R such that Q matches the first part and R matches the second part. If so, it records P as matching the whole subsequence. Once this process is completed, the sentence is recognized by the grammar if the subsequence containing the entire sentence is matched by the start symbol.

S |
||||||

VP | ||||||

S |
||||||

VP | PP | |||||

S |
NP | NP | ||||

NP | V, VP | Det. | N | P | Det | N |

she | eats | a | fish | with | a | fork |

## [edit] Extensions

### [edit] Generating a parse tree

It is simple to extend the above algorithm to not only determine if a sentence is in a language, but to also construct a parse tree, by storing parse tree nodes as elements of the array, instead of booleans. Since the grammars being recognized can be ambiguous, it is necessary to store a list of nodes (unless one wishes to only pick one possible parse tree); the end result is then a forest of possible parse trees. An alternative formulation employs a second table B[n,n,r] of so-called *backpointers*.

### [edit] Parsing non-CNF context-free grammars

It is also possible to extend the CYK algorithm to handle some context-free grammars which are not written in CNF; this may be done to improve performance, although at the cost of making the algorithm harder to understand.

### [edit] Parsing weighted context-free grammars

It is also possible to extend the CYK algorithm to parse strings using weighted and stochastic context-free grammars. Weights (probabilities) are then stored in the table P instead of booleans, so P[i,j,A] will contain the minimum weight (maximum probability) that the substring from i to j can be derived from A. Further extensions of the algorithm allow all parses of a string to be enumerated from lowest to highest weight (highest to lowest probability).

## [edit] See also

## [edit] References

**^**Sipser, Michael (1997),*Introduction to the Theory of Computation*(1st ed.), IPS, p. 99, ISBN 0-534-94728-X

- John Cocke and Jacob T. Schwartz (1970). Programming languages and their compilers: Preliminary notes. Technical report, Courant Institute of Mathematical Sciences, New York University.
- T. Kasami (1965). An efficient recognition and syntax-analysis algorithm for context-free languages. Scientific report AFCRL-65-758, Air Force Cambridge Research Lab, Bedford, MA.
- Daniel H. Younger (1967). Recognition and parsing of context-free languages in time
*n*^{3}.*Information and Control*10(2): 189–208.