From the first look the easiest solution seemed to be writing Ruby script to get required results. But after having half of the project done, Ruby showed unacceptable performance working with large XML files. Brown Corpus which is 27 MB flushed 4 GB of RAM and script usually stopped responding. Problem was solved by choosing Java provided XML parsing engine which performed reasonably good.
Ifound it difficult to validate results whether their are reliable or not.
Creative Commons Attribution 3.0 License
dj (dainiusjocas@gmail.com)
Dainius Jocas (dainiusjocas@gmail.com)
You can download this project in either zip or tar formats.
You can also clone the project with Git by running:
$ git clone git://github.com/dainiusjocas/Computational-Linguistics