| Words | 1008320 |
| Words with type "NN" | 168026 |
| Words with type "JJ" | 68728 |
| Words with type "VB" | 33985 |
| Bigrams found | 300340 |
| Bigram | Count |
|---|---|
| af/NN_af/NN | 299 |
| high/JJ_school/NN | 78 |
| world/NN_war/NN | 69 |
| white/JJ_house/NN | 68 |
| fiscal/JJ_year/NN | 67 |
| old/JJ_man/NN | 67 |
| peace/NN_corps/NN | 57 |
| long/JJ_time/NN | 50 |
| young/JJ_man/NN | 50 |
| time/NN_time/NN | 47 |
You can see more bigrams that are frequent here.
First of all, I should say that there were no result cleaning because there were no such requirement, so the results are very raw. The most frequent bigram is “af/NN_af/NN” which is quite strange because it has no meaning but in corpus file occurences of this word “Af”are very often. We can check if in the corpus there is such a word by simply running UNIX shell utility grep by typing “grep -i “>af<” [data_file]” and in the terminal should be printed each line where this word occurs. Other most frequent bigrams look good and are likely to be true collocations. In conclusion, result of simple frequency counting are reliable.
| Bigram | Chi-square value |
|---|---|
| new/JJ_time/NN | 13.365136 |
| new/JJ_af/NN | 11.17747 |
| new/JJ_man/NN | 9.893212 |
| new/JJ_state/NN | 9.193513 |
| time/NN_time/NN | 9.18815 |
| such/JJ_time/NN | 8.548816 |
| time/NN_af/NN | 7.684192 |
| af/NN_time/NN | 7.684192 |
| new/JJ_way/NN | 7.333554 |
| such/JJ_af/NN | 7.15 |
You can see more bigrams that has large chi-square value here.
To compute chi square values for bigrams this formula was used
chi_square_value = (head * tail) / total_number_of_bigramswhere head means number of occurences of specific head in all bigrams; tail - number of occurrences of specific tail in all bigrams.
Chi-square value is like expected value of specific bigram because you multiply probability to find tail with probablity to find tail in the bigrams list with total number of bigrams. When You have chi-square value and simple frequency You can investigate if bigram is not only just coincidence.
From the very first look to the table of chi-square values you can observe that collocations with words like "new" and "time" should be very frequent in order to be decided as collocations, because they have really large chi-square value.