Identifying Collocation

31 Dec 2014

I've been reading the book Foundations of Statistical Natural Language Processing. I should have read it a long time ago, so I've decided to read cover-to-cover. It's actually pretty damn hard to understand :S

Collocation is a set of words that seems to happen together, though we usually focus on only 2 words. Word collocation can have totally different meaning from the 2 words it's composed from. Or it can have the similar meaning, e.g. strong tea. There are some terminology around different types of collocation. It's not really critical to me, so I'm not gonna write about it.

Collocation can be 2 words next to each other or 2 words with some words in between. That's why it's a bit hard to detect collocation.

Here are some approaches to identifying collocation:

Word distance as normal distribution

The first method offered in the book is to compute the distance between a pair of two words within a 12-word span (Please note here that we just need to choose some arbitrary limit because using 100-word span would require very costly computation). For any pair, we can compute the mean and the standard deviation. Intuitively, any pair that have low standard deviation and high mean is likely to be a collocation.

The caveat in this method is that, if the 2 words independently occur very often, they likely to occur together. The example is “new companies”. We should consider the frequency of each word independently as well. This brings us to the next approach.

The t test

This is one of the hypothesis testing methods. Basically, we have some samples and we have our prior belief, which is the null hypothesis. And we want to know how likely it is that the samples are in accordance with our prior belief.

First, we define the null hypothesis (that we want to ultimately reject):

There is no association between 2 words, w1 and w2. Therefore, P(w1 w2) is P(w1) P(w2), and its distribution is Bernoulli Trial

From a corpus with N words, we can compute P(w1) with c1/N, where c1 is the count of w1 in the corpus. P(w2) can be computed in the same way.

I want to note that it makes sense to assume Bernoulli Trial distribution (or binomial distribution, same thing) because we assume no prior knowledge. Say, if we independently random a pair N times, then yeah it will be the Bernoulli Trial distribution

Ok, so the t statistics is computed with this formula: t = (x - u) / sqrt(s^2 / N), where u is the mean of the null hypothesis, which is P(w1) P(w2). x and s^2 is the mean and the variance from the sample data, respectively. And yes, the t test just assumes that the sample and the null hypothesis are in the normal distribution.

We already have u, but we don't have x and s^2. Now this is a very important part because we make a bunch of assumption that will make this shit works. Otherwise, we wouldn't be able to fill the value of all parameters. I guess it's also where everything breaks.

The mean x from the sample is (the number of interesting pairs / the number of all pairs). The variance from the sample is x (1-x) since we assume Bernoulli Trial distribution for the sample.

Please note that (1) the t test only works with the normal distribution and (2) Binomial distribution can be thought of as the normal distribution with mean = p = P(success) and s^2 = p(1-p). However, it's not that good when p is really low or really high.

One more thing to note is that, to find the mean of a distribution, we use the method named Maximum likelihood estimation. But it's just a fancy name really. Basically, we find a probability model that best explains the data. For example, if we toss a coin 10 times and yield head 10 times, then, assuming Bernoulli Trial, it makes more sense that the coin has P(head)=1.0 than having P(head)=0.5.

And there you go. You have all the parameters to compute the t statistic, and you can reject the null hypothesis with error probability.

The caveat of this approach is that it assumes the normal distribution. It doesn't work well with the reality though. So, this brings us to the next approach.

Chi-Square test

Chi-square test is more appropriate for this task because the null hypothesis is around independence. Please note that chi-square test is more specific that it only tests for independence. But, for the t test, the null hypothesis can be just about anything.

When the task is like that, it doesn't need to assume any distribution. It basically compares the probability of 2 words happening together against the probability of each word happening independently.

For 2-word collocation, we can formulate the problem into 2x2 tables: (w1, ~w1) x (w2, ~w2)

The X^2 statistic (called chi-square) is sigma(i,j) of ((O(i,j) - E(i,j))^2)/E(i,j).

O(i,j) is the actual count from the sample data, while E(i,j) is the expected count that assumes the independence.

For example, O(1,1) is the count of w1w2 in the sample data; E(i,j) is the expected count, which is P(w1) x P(w2) x N. We can use maximum likelihood estimate for P(w1) and P(w2), which is the number of w1 / N and the number of w2 / N.

i=1,j=2 is for the pair w1 ~w2… so you can compute everything as above.

After you get the value X^2, you can look up the table to see if you can reject the null hypothesis or not.

The caveat of the chi-square test is very hard to understand; It turns out that, if the count in any cell is less than 20, it won't be very good. And you bet that a collocation doesn't happen very often, so the count is usually low.

Likelihood ratio

The likelihood ratio seems to be the best one among the approaches offered here.

The formulation is a bit different. I still cannot understand why the formulations are different from other approaches above. However, it still test the same hypothesis.

The likelihood ratio requires 2 hypotheses to be tested. So, let's formulate 2 hypotheses:

  • Hypothesis 1: P(w2|w1) = p = P(w2|~w1) — saying that w2 and w1 happen independently
  • Hypothesis 2: P(w2|w1) = p1 and P(w2|~w1) = p2 and p1 != p2 — saying that w1 causes w2

Let's throw in some notation: c1, c2, and c12 are for the count of w1, w2 ,and w2w1.


  • p = c2 / N — since we claim that w2 happens independently
  • p1 = c12 / c1 — according to Bayes, P(w2|w1) = P(w2 intersect w1) / P(w1)
  • p2 = (c2 - c12) / (N - c1) — according to Bayes as well, the number of w2 that is not collocated with w1 divided by the number of ~w1.

Now we should estimate the likelihood of each theory. Again, we assume Bernoulli Trial (or Binomial Distribution); It kind of makes sense if we assume all words happen independently. This can also be where the approach breaks.

The likelihood is a weird concept.For H1, we assume that P(w2|w1) = P(w2|~w1) = p. Assuming Binomial Distribution, we use p to compute the probability that c1, c2, and c12 will happen. That probability is called the likelihood. It's the probability that the parameter p is true given the data.

The likelihood of Hypothesis 1 L(H1) consists of 2 parts:

  1. The likelihood L(w2|w1), which is C(c1, c12) p^c12 (1-p)^(c1-c12) where C(n,k) means the number of ways of choosing k things out of n things.
  2. The likelihood L(w2|~w1), which is `C(c2 - c12, N - c1) p(c2-c12) (1-p)(N-c1-c1+c12)

From the book, L(H1) is L(w2|w1) * L(w2|~w1). L(H2) is computed in the same way with using p1 and p2 instead.

I'm not sure why we choose the product over other settings. The product of L(w2|w1) and `L(w2|~w1) doesn't have a meaning. It can mean we sample w2 twice, but why would we sample it twice? Nevertheless, using sum has a caveat that L(H1) might be more than 1.

The product will have this effect; For example, 0.3 * 0.3 > 0.4 * 0.2, but 0.3 + 0.3 = 0.4 + 0.2. It prefers the 2 constutients to be similar.

Anyway, we will choose H1 if L(H1) > L(H2).

There are 3 flaws that I can see:

  1. Our likelihood is only comes from maximising the probability of the product of 2 likelihoods. It just has no meanings. —- Edit. When we want to maximising 2 values, we maximise its rectangular area geometrically instead; OK, that probably better than a sum, which has no meaning geometrically. — Man, this is just human finding meaning in some other random things and saying that it is sensible.
  2. Why would we only consider L(w2|w1) and L(w2|~w1)? Should we choose L(H1) that maximising the whole text? For example, maybe we should incorporate L(~w2|~w1) an L(~w2|w1). That would be a great exercise for me to incorporate it.
  3. Using Bernoulli Trial is weird. It simply assumes that we generate each bigram independently. That simply not true. For example, if we have rule out, we wouldn't add another rule out after it; That just doesn't make sense. However, building a more complex probability model might be too difficult and gain too little… So, Bernoulli Trial is good enough.