Instructions and Requirements
TOPIC: "Collocation Extraction"
PAPER: Dekang Lin, "Extracting Collocations from Text Corpora".
PROJECT: Investigate various methods of extracting collocations from a corpus and finding relations between them.
Steps:
- Download the Brown Corpus from here
- Search for continuous collocations of length 2 (bigrams), and accept only collocations that match one of the following patterns: NN_NN, JJ_NN, VB_NN.
- Simple frequency counting: For each collocation, keep track of its count and the location within the corpus of all occurrence. Sort the collocations by frequency count, and take the top 1000.
- Same fashion as in the simple frequency count, but at the end compute the chi-squares values of the collocations, and order the collocations by decreasing chi-square value.
- Look for similar collocations:
- Take the list of the top N collocations found counting simple frequency (2a) and try to measure the correlation between two of them and the rest.
- For each collocation, take 200 words of the context around each occurrence and obtain counts of the words occurring in the context window of that collocation.
- Measure similarities between these distributions and the target collocation using one of the following metrics: Manhattan distance, Euclidean distance, Cosine, or Information Radius.
- Report the top 10 "closest" collocations for each target word.