Fork me on GitHub

Collocation Extraction by dainiusjocas

Statistics of the project

Numbers found in Brown Corpus

Words1008320
Words with type "NN"168026
Words with type "JJ"68728
Words with type "VB"33985
Bigrams found300340

10 most frequen bigrams

BigramCount
af/NN_af/NN299
high/JJ_school/NN78
world/NN_war/NN69
white/JJ_house/NN68
fiscal/JJ_year/NN67
old/JJ_man/NN67
peace/NN_corps/NN57
long/JJ_time/NN50
young/JJ_man/NN50
time/NN_time/NN47

You can see more bigrams that are frequent here.

First of all, I should say that there were no result cleaning because there were no such requirement, so the results are very raw. The most frequent bigram is “af/NN_af/NN” which is quite strange because it has no meaning but in corpus file occurences of this word “Af”are very often. We can check if in the corpus there is such a word by simply running UNIX shell utility grep by typing “grep -i “>af<” [data_file]” and in the terminal should be printed each line where this word occurs. Other most frequent bigrams look good and are likely to be true collocations. In conclusion, result of simple frequency counting are reliable.

10 bigrams with largest chi-square value

BigramChi-square value
new/JJ_time/NN 13.365136
new/JJ_af/NN 11.17747
new/JJ_man/NN 9.893212
new/JJ_state/NN9.193513
time/NN_time/NN9.18815
such/JJ_time/NN8.548816
time/NN_af/NN7.684192
af/NN_time/NN7.684192
new/JJ_way/NN7.333554
such/JJ_af/NN7.15

You can see more bigrams that has large chi-square value here.

To compute chi square values for bigrams this formula was used

chi_square_value = (head * tail) / total_number_of_bigrams
where head means number of occurences of specific head in all bigrams; tail - number of occurrences of specific tail in all bigrams.

Chi-square value is like expected value of specific bigram because you multiply probability to find tail with probablity to find tail in the bigrams list with total number of bigrams. When You have chi-square value and simple frequency You can investigate if bigram is not only just coincidence.

From the very first look to the table of chi-square values you can observe that collocations with words like "new" and "time" should be very frequent in order to be decided as collocations, because they have really large chi-square value.



Go to main page