lmgrep - Lucene Based grep-like Utility

Jan 30, 2021 7 min read Lucene, GraalVM

Grep with Lucene

TL;DR

What if grep supported the functionality of a proper search engine like Elasticsearch without a need to install any servers or index the files before searching? lmgrep aims to provide you just that. It is installed as just one executable file without any dependencies, provides a command-line interface, starts-up instantly, and works on macOS, Linux, and, yes, even Windows.

See the source code here.

My motivation

Have you ever wished that grep supported tokenization, stemming, etc, so that you don’t have to write wildcard regular expressions all the time? I’ve also shared that question and on a one nice day, I’ve tried to scratch that itch by exposing the Lucene query syntax as a CLI utility. lmgep is the result of my effort. Give it a try and let me know how it goes.

Full-text Search vs. `grep`

I’m perfectly aware that comparing Lucene and grep is like comparing apples to oranges. However, I think that lmgrep is best compared with the very tool that inspired it, namely grep.

Anyway, what does grep do? grep reads a line from stdin, examines the line to see if it should be forwarded to stdout, and repeats until stdin is exhausted¹. lmgrep tries to mimick exactly that functionality. Of course, there are many more options to grep but it is the essence of the tool.

Several notable advantages of lmgrep over grep:

Lucene query syntax is better suited for full-text search;
Boolean operators allow to construct complex, well-designed queries;
Text analysis can be customized to the language of the documents;
Fuzzy text searches;
Flexible text analysis pipeline that includes, lowercasing, ASCII-folding, stemming, etc;
regular expressions can be combined with other Lucene query components;
Search matches can span multiple lines, i.e. search is not line-oriented.

Several notable limitations of lmgrep when compared to grep:

grep is faster when it comes to raw speed for large text files;
grep has a smaller memory footprint;
Not all options of grep are supported;

Why Lucene?

Lucene is a Java library that provides indexing and search features. Lucene has been more than 20 years in development and it is the library that powers many search applications. Also, many developers are already familiar with the Lucene query syntax and know how to leverage it to solve complicated information retrieval problems.

However powerful Lucene is, it is not well-suited for CLI application. The main problem is the startup time of JVM. To reduce the startup time I’ve compiled lmgrep with the native-image tool provided by GraalVM. In this way, the startup time is around 0.01s for Linux, macOS, and Windows.

How does `lmgrep` work?

lmgrep by default expects two parameters: a search query and a GLOB pattern (similar to regexp) to find files to execute lmgrep on. I assume that the dear reader doesn’t want to be tortured by reading the explanation on how the file names are being matched with GLOB, so I’ll skip it. Instead, I’ll focus on explaining how the search works within a file.

lmgrep creates a Lucene Monitor (Monitor) object from the provided search query. Then text file is split into lines². Each line of text is passed to the Monitor for searching. The Monitor then creates an in-memory Lucene index with a single document created out of the line of text. Then the Monitor runs the search query on that in-memory index in the good ol' Lucene way³. lmgrep takes the hits, formats them, and sends results to STDOUT. That is how lmgrep does the full-text search.

The overall searching approach is similar to the one of Percolator in Elasticsearch. lmgrep just limits the number of stored search queries to one and treats every text line as a document. A cool thing compared with the Percolator is that lmgrep provides exact offsets of the matched terms while Elasticsearch does not expose offsets when highlighting.

The described procedure seems to be somewhat inefficient. However, the query parsing for all the lines (and files) is done only once. Also, the searching itself is efficient thanks to Lucene in general and when queries are complicated thanks to the Presearcher of the Lucene Monitor in particular. Presearcher extracts terms from the search query and if none of these terms are in the index then a full query is not executed at all. Of course, many optimizations can be (and will be) implemented for lmgrep such as batching of the documents. In general, the performance is limited by the Lucene Monitor.

What about the text analysis pipeline? By default, lmgrep uses the StandardTokenizer to tokenize text. Then the tokens are passed through several token filters in the following order: LowerCaseFilter, ASCIIFoldingFilter, and SnowballFilter which is given the EnglishStemmer. The same analysis pipeline is used for both the indexing and querying. All the components of the analysis pipeline are configurable via CLI flags, see the README. However, the order of the token filters, as of now, is not configurable. Moreover, various filters are not exposed at all (e.g. StopwordsFilter, or WordDelimiterGraphFilter, etc.). Supporting a more flexible analysis pipeline configuration is left out for future releases. The more users the tool has the faster new features will be implemented ;)

Prehistory of the `lmgrep`

Almost every NLP project that I’ve worked on had the component called dictionary annotator. Also, the vast majority of the projects used Elasticsearch in one way or another. The more familiar I’ve got with Elasticsearch I’ve got, the more of my NLP workload shifted towards implementing it inside Elasticsearch. One day I’ve discovered a tool called Luwak (a cool name isn’t it?) and read more about it. It kind of opened my eyes: the dictionary annotator can be implemented using Elasticsearch and the dictionary entries can be expressed as Elasticsearch queries. Thankfully, Elasticsearch has Percolator that hides all the complexity of managing temporary indices, batching search requests, etc.

Then I was given was an NLP project where one of the requirements was to implement data analysis using AWS serverless stuff: Lambda for text processing and Dynamo DB for storage. Of course, one of the required NLP components was a dictionary annotator. Since Elasticsearch was not available (because it is not serverless) I still wanted to continue working with dictionary entries as search queries, I’ve decided to leverage the Luwak library. From experiences of that project, the Beagle library was born. lmgrep is loosely based on Beagle.

When thinking about how to implement lmgrep I wanted it to be based on Lucene because of the full-text search features. To provide a good experience the start-up time must be small. To achieve it, lmgrep had to be compiled with the native-image tool of the GraalVM. I’ve tried but the native-image doesn’t support Method Handles that Lucene uses. Some more hacking was needed. I was lucky when I’ve discovered a toy project where the blog search was implemented on AWS Lambda that was backed by Lucene which was compiled by the native-image tool. I’ve cloned the repo, mvnw install, then included the artefacts to the dependencies list, and lmgrep compiled with the native-image tool successfully.

Then the most complicated part was to prepare executable binaries for different operating systems. Plenty of CPU, RAM, VirtualBox with Windows and macOS virtual machines, and here we go.

Did I say how much I enjoyed trying to get stuff done on Windows? None at all. How come that multiple different(!) command prompts are needed to get GraalVM to compile an executable? Now I know that it would a lot better to suffer the pain and to set up the Github Actions pipeline to compile the binaries and upload them to release pages.

What is missing?

The analysis pipeline is not as flexible as I’d like to (UPDATE 2021-04-24: implemented);
Leverage the multicore CPUs by executing the search in parallel;
Batch documents for matching;
Let me know if any?

What would be cool ways to use `lmgrep`?

tail logs to lmgrep and raise alerts;
Give an alias for lmgrep with various options tailored for the code search (Java Example);
Why not expose sci script as TokenFilter?
Why not ngrams token filter then the search would be resilient to the typing errors?
Static website search, like AWS Lambda that has lmgrep and goes through all files on demand without upfront indexing.

Summary

lmgrep scratched my itch. It was exciting to get it working. I hope that you’ll also find it interesting and maybe useful. Give it a try, let me know how it was for you, and most importantly any feedback welcome on how to improve lmgrep.

References

Footnotes

https://ideolalia.com/essays/composition-is-interpretation.html ↩︎
there is no necessity to split text files into lines, it is just to mimik how grep operates. ↩︎
of course, the description is over-simplified, but it is accurate enough to get the overall idea. ↩︎

Lucene GraalVM grep