Chomsky Hierarchy, CFG, and CKY Algorithm

According to Wikipedia, this algorithm is a parsing algorithm for context-free grammars published by Itiroo Sakai in 1961. The algorithm is named after its rediscoverers. It employs bottom-up parsing and dynamic programming.

The standard version of the algorithm operates on context-free grammars given in chomsky normal form (CNF). However any context-free grammar may be alagorithmically transformed into a CNF grammar expressing the same language.

The Pseudo-code Goes As

let the input be a string I consisting of n characters: a1 ... an.

let the grammar contain r nonterminal symbols R1 ... Rr, with 
start symbol R1.

let P[n,n,r] be an array of booleans. Initialize all elements of 
P to false.

let back[n,n,r] be an array of lists of backpointing triples. 
Initialize all elements of back to the empty list.

for each s = 1 to n
    for each unit production Rv → as
        set P[1,s,v] = true

for each l = 2 to n :-- Length of span
    for each s = 1 to n-l+1 :-- Start of span
        for each p = 1 to l-1 :-- Partition of span
            for each production Ra → Rb Rc
                if P[p,s,b] and P[l-p ,s+p ,c] then
                    set P[l,s,a] = true, 
                    append <p,b,c> to back[l,s,a]

if P[n,1,1] is true then
    I is member of language
    return back :-- by retracing the steps through back, 
    one can easily construct all possible parse trees 
    of the string.
else
    return "not a member of language"

Here's a step-by-step breakdown of the pseudocode:

Initialization:

Unit Productions:

Parsing:

For each production of the form Ra → Rb Rc:

Result Check:

Result Reporting:

The CYK algorithm is a bottom-up parsing algorithm that can determine whether a given string belongs to a context-free language and construct parse trees for valid strings. The dynamic programming table (P) is used to store intermediate parsing results, and the back array is used for backtracking to construct parse trees.


Probabilistic CYK (for finding the most probable parse)

In informal terms, this algorithm considers every possible substring of the input string and sets P[l,s,v] to be true if the substring of length l starting from s can be generated from the nonterminal RvR_{v}. Once it has considered substrings of length 1, it goes on to substrings of length 2, and so on.

For substrings of length 2 and greater, it considers every possible partition of the substring into two parts, and checks to see if there is some production AB CA\to B\; \ C such that B matches the first part and C matches the second part. If so, it records A as matching the whole substring.

Once this process is completed, the input string is generated by the grammar if the substring containing the entire input string is matched by the start symbol.


cky


Now the sentence she eats a fish with a fork is analyzed using the CYK algorithm. In the following table, in P[i,j,k], i is the number of the row (starting at the bottom at 1), and j is the number of the column (starting at the left at 1).

S :-> NP VP; VP :-> VP PP 
VP :-> V NP; VP :-> EATS
PP :-> P NP; NP :-> DET N
NP :-> SHE; V :-> EATS; P :-> WITH 
N :-> FISH; N :-> FORK; DET :-> A

Now the sentence she eats a fish with a fork is analyzed using the CYK algorithm. In the following table, in P[i,j,k], i is the number of the row (starting at the bottom at 1), and j is the number of the column (starting at the left at 1).


cky_tbl


Dynamic Programming

It"s a very ubiquitous method of reducing the runtime algorithms by discovering solutions to subproblems along the way to the solution of the main problem; to optimally plan a multi-stage process

What are the subproblems for the parsing task?

A well-formed substring table is a data structure containing partial consitutency structures. It may be represented as either a chart or a graph.

Example

The brown dog:NPDT NomNomJJ NNDT etc \textbf{The} \, \textbf{brown} \, \textbf{dog} \\ \\ :-------- \\ \\ \text{NP}\to \text{DT} \, \, \text{Nom} \\ \text{Nom}\to \text{JJ} \,\, \text{NN} \\ \text{DT}\to \, etc

Put into a chart,

TheBrownDog
DT1\text{DT}_{1}NP5\text{NP}_{5}
JJ2\text{JJ}_{2}Nom4\text{Nom}_{4}
NN3\text{NN}_{3}
Numbers indicate order in which symbol was entered into table.

Setting up the CKY algorithm

CKY Assumptions

Critical observation: any portion of the input string spanning i to j can be split at k, and structure can then be built using sub-solutions spanning i to k and sub-solutionts spanning k to j.

•0 The •1 Brown •2 Dog •3

Simple Grammar

S → NP VBZ; DT → the
S → NP VP; NN → chef
VP → VP PP; NNS → fish
VP → VBZ NP; NNS → chopsticks
VP → VBZ PP; VBP → fish
VP → VBZ NNS; VBZ → eats
VP → VBZ VP; IN → with
VP → VBP NP
VP → VBP PP
NP → DT NN
NP → DT NNS
PP → IN NP

•0 the •1 chef •2 eats •3 fish •4 with •5 the •6 chopsticks •7

Let's Walk Through This In An Example
01234567
0
1
2
3
4
5
6

First we need to build a table; build an n + 1 x n + 1 matrix, where n = number of words in input


0the 1chef 2eats 3fish 4with 5the 6chopsticks7
0[0,1]
1[1,2]
2[2,3]
3[3,4]
4[4,5]
5[5,6]
6[6,7]

Illustrate the numbering of cells: [i,j]s represent spans.


0the 1chef 2eats 3fish 4with 5the 6chopsticks7
0
1[1,2]
2
3
4
5
6

Notice how the spans (e.g. [1,2]) differ from the word indices (e.g. "chef" , 2).


0the 1chef 2eats 3fish 4with 5the 6chopsticks7
0DT [0,1]
1NN [1,2]
2[2,3]
3[3,4]
4[4,5]
5[5,6]
6[6,7]

"The" is labeled DT


0the 1chef 2eats 3fish 4with 5the 6chopsticks7
0DT [0,1]
1NN [1,2]
2[2,3]
3[3,4]
4[4,5]
5[5,6]
6[6,7]

"chef" is labelled NN


0the 1chef 2eats 3fish 4with 5the 6chopsticks7
0DT [0,1]NP [0,2]
1NN [1,2]
2[2,3]
3[3,4]
4[4,5]
5[5,6]
6[6,7]

Found an NP: [0,1] ,[1,2]


0the 1chef 2eats 3fish 4with 5the 6chopsticks7
0DT [0,1]NP [0,2]
1NN [1,2]
2VBZ [2,3]
3[3,4]
4[4,5]
5[5,6]
6[6,7]

"eats" is labelled VBZ


0the1chef2eats3fish4with5the6chopsticks7
0DT [0,1]NP [0,2]S [0,3]
1NN [1,2]
2VBZ [2,3]
3[3,4]
4[4,5]
5[5,6]
6[6,7]

Found an S: [0,2] ,[2,3]


0the 1chef 2eats 3fish 4with 5the 6chopsticks7
0D T[0,1]NP [0,2]S [0,3]
1NN [1,2]
2VBZ [2,3]
3NNS[3,4]
4[4,5]
5[5,6]
6[6,7]

"fish" is lablled NNS


0the 1chef 2eats 3fish 4with 5the 6chopsticks7
0D T[0,1]NP [0,2]S [0,3]
1NN [1,2]
2VBZ [2,3]
3NNS, VBP[3,4]
4[4,5]
5[5,6]
6[6,7]

"fish" is also labelled VBP because of the verb "eats" or VBZ


0the 1chef 2eats 3fish 4with 5the 6chopsticks7
0D T[0,1]NP [0,2]S [0,3]
1NN [1,2]
2VBZ [2,3]VP[2,4]
3NNS, VBP[3,4]
4[4,5]
5[5,6]
6[6,7]

Found a VP


0the 1chef 2eats 3fish 4with 5the 6chopsticks7
0DT[0,1]NP [0,2]S [0,3]S[0,4]
1NN [1,2]VP [2,4]
2VBZ [2,3]
3NNS[3,4]
4[4,5]
5[5,6]
6[6,7]

Found an S: [0,2],[2,4]


0the 1chef 2eats 3fish 4with 5the 6chopsticks7
0DT [0,1]NP [0,2]S [0,3]S[0,4]
1NN [1,2]
2VBZ [2,3]
3NNS[3,4]
4IN [4,5]
5[5,6]
6[6,7]

"with" is labelled IN


0the1chef2eats3fish4with5the6chopsticks7
0[0,1]NP [0,2]S [0,3]S[0,4]
1NN [1,2]
2VBZ [2,3]
3NNS[3,4]
4[4,5]
5DT [5,6]
6[6,7]

"the" is labelled DT


0the1chef2eats3fish4with5the 6chopsticks7
0DT[0,1]NP [0,2]S [0,3]S [0,4]
1NN [1,2]
2VBZ [2,3]
3[3,4]
4[4,5]
5[5,6]
6NNS[6,7]

"chopsticks" is labelled NNS


0the1chef2eats3fish4with5the6chopsticks7
0DT[0,1]NP [0,2]S [0,3]S [0,4]
1NN [1,2]
2VBZ [2,3]
3[3,4]
4[4,5]
5DT[5,6]NP[5,7]
6NNS[6,7]

Found an NP: [5,6], [6,7]


0the 1chef 2eats 3fish 4with5the6chopsticks7
0DT[0,1]NP [0,2]S [0,3]S [0,4]
1NN [1,2]
2VBZ [2,3]
3[3,4]
4IN [4,5]PP [4,7]
5DT[5,6]NP [5,7]
6NNS[6,7]

Found a PP: [4,5], [5,7]


0the1chef2eats3fish4with5the6chopsticks7
0DT[0,1]NP [0,2]S [0,3]S [0,4]
1NN [1,2]
2VBZ [2,3]
3NNS,VPB[3,4]VP [3,7]
4IN [4,5]PP [4,7]
5DT[5,6]NP [5,7]
6NNS[6,7]

Found a VP: [3,4], [4,7]


0the1chef2eats3fish4with5the6chopsticks7
0DT[0,1]NP [0,2]S [0,3]S [0,4]
1NN [1,2]
2VBZ [2,3]VP [2,7]
3NNS, VPB[3,4]VP [3,7]
4IN [4,5]PP [4,7]
5DT[5,6]NP [5,7]
6NNS[6,7]

Found a VP:[2,3], [3,7]


0the1chef2eats3fish4with5the6chopsticks7
0DT[0,1]NP [0,2]S [0,3]S [0,4]
1NN [1,2]
2VBZ [2,3]VP [2,4]VP1\text{VP}_{1},VP2\text{VP}_{2} [2,7]
3NNS, VPB[3,4]VP [3,7]
4IN [4,5]PP [4,7]
5DT[5,6]NP [5,7]
6NNS[6,7]

Found another VP: [2,4], [4,7]


0the1chef2eats3fish4with5the6chopsticks7
0DT[0,1]NP [0,2]S [0,3]S [0,4]S [0,7]
1NN [1,2]
2VBZ [2,3]VP [2,4]VP1\text{VP}_{1} [2,7], VP2\text{VP}_{2}
3NNS, VPB[3,4]VP [3,7]
4IN [4,5]PP [4,7]
5DT[5,6]NP [5,7]
6NNS[6,7]

Found an S node: [0,2], [2,7] (where VP spans from [2,3], [3,7])


0the1chef2eats3fish4with5the6chopsticks7
0DT[0,1]NP [0,2]S [0,3]S [0,4]S1\text{S}_{1}, S2\text{S}_{2} [0,7]
1NN [1,2]
2VBZ [2,3]VP [2,4]VP1\text{VP}_{1} [2,7], VP2\text{VP}_{2} [2,7]
3NNS, VPB[3,4]VP [3,7]
4IN [4,5]PP [4,7]
5DT[5,6]NP [5,7]
6NNS[6,7]

Found a second S node also [0, 2], [2,7] (where VP spans from [2,4] , [4,7])

* Recognition algorithm returns `true` when a root node is found in `[0,n]`.

Another Version of The CKY Algorithm Pseudo Code

CKY Recognition
CKY Parsing
function CKY-Parse (words, grammar) returns parses
    for j ← 1 to length(words) do: (loop over columns)
        table[j-1,j] ← for all {A|A → words[j] ∈ grammar} (add all POS)
        for i ← j-2 downto 0 do: (loop over rows, backwards)
            for k ← i+1 to j-1 do: (loop over contents of cell)
                for all {A|A → B C}: (all productions)
                back[i,j,A] ← { k,B,C } (add back pointer)
    return buildtree(back[1, length(words,S]), table[1,LENGTH(words),S])
    (follow back pointer)
Facts about CKY algo
Chomsky Normal Form grammar
There can be:
Grammar equivalence
Symbol Naming Conventions :- X Bar Theory
Refer to NLTK tree transforms module:
A Sample of CFG
S → NP VP PUNC (non-binary)
       S → S and S (mixed) 
        NP → DT NP (OK)
           NP → NN (unit production)
          NN → dog (OK)
          NN → cat (OK)
       VP → VBZ NP (OK)
          VP → VBZ (unit production)
      VBZ → sleeps (OK)
        VBZ → eats (OK)
          DT → the (OK)
Conversion of CFG to CNF: Step 1
Non-CNF grammar|| CNF grammar   ||     Action
-------------- || ------------- || -------------- 
NP :-> NN      ||               ||
NN :-> dog     ||               ||
NN :-> cat     ||               ||
               ||  NP :-> dog   || (collapse rule)
               ||  NP :-> cat   || (collapse rule)
VP :-> VBZ     ||               ||
VBZ :-> sleeps ||               ||
VBZ :-> eats   || VP :-> sleeps || (collapse rule)
Conversion of CFG to CNF: Step 2
Non-CNF grammar|| CNF grammar ||    Action
-------------- || ----------- || ------------ 
S :-> S and S  ||             ||
               || S :-> S X1  || (new symbol)
               || X1 :-> X2 S || (new symbol)
               || X2 :-> and  ||
Conversion of CFG to CNF: Step 3
Non-CNF grammar || CNF grammar    ||    Action
--------------  || -------------  || ------------ 
S :-> NP VP PUNC||                ||
                || S  :-> NP X3   || (new symbol)
                || X3 :-> VP PUNC ||
NP :-> DT NP    || NP :-> DT NP   || (carry over)
VP :-> VBZ NP   || VP :-> VBZ NP  || (carry over)
DT :-> the      || DT :-> the     || (carry over)
Finally CFG in CNF
NP → dog S → NP X3
NP → cat X3 → VP PUNC
VP → sleeps NP → DT NP
VP → eats VP → VBZ NP
S → S X1 DT → the
X1 → X2 S
X2 → and