Fork me on GitHub

Collocation Extraction by dainiusjocas

Instructions and Requirements

TOPIC: "Collocation Extraction"

PAPER: Dekang Lin, "Extracting Collocations from Text Corpora".

PROJECT: Investigate various methods of extracting collocations from a corpus and finding relations between them.

Steps:

  1. Download the Brown Corpus from here
  2. Search for continuous collocations of length 2 (bigrams), and accept only collocations that match one of the following patterns: NN_NN, JJ_NN, VB_NN.
    1. Simple frequency counting: For each collocation, keep track of its count and the location within the corpus of all occurrence. Sort the collocations by frequency count, and take the top 1000.
    2. Same fashion as in the simple frequency count, but at the end compute the chi-squares values of the collocations, and order the collocations by decreasing chi-square value.
  3. Look for similar collocations:
    1. Take the list of the top N collocations found counting simple frequency (2a) and try to measure the correlation between two of them and the rest.
    2. For each collocation, take 200 words of the context around each occurrence and obtain counts of the words occurring in the context window of that collocation.
    3. Measure similarities between these distributions and the target collocation using one of the following metrics: Manhattan distance, Euclidean distance, Cosine, or Information Radius.
    4. Report the top 10 "closest" collocations for each target word.

Go to main page