Apache Lucene is a high-performance, full-featured text search engine library written in Java.
PIM-lucene is a project to create an extension of Lucene to offload specific queries to UPMEM’s PIM (Processing In Memory) hardware.
UPMEM is a French company proposing a PIM product which can accelerate data-intensive applications. The PIM hardware is a DIMM module in which each memory chip embed small processors with fast access to the memory bank. More information about UPMEM is available on the company website and in the UPMEM's SDK documentation.
Our goal is to create a non-intrusive extension of the Lucene code base, providing an option to use PIM for specific queries (or part of queries) without impacting Lucene's performance or functionality. When using the PIM extension, the standard Lucene index is created but a new index specific to PIM is also created and stored in the PIM system. A PimIndexWriter object is the new interface for writing the Lucene index augmented with the PIM index.
The first query being ported to PIM is the phrase query. A PimPhraseQuery object can be used in place of a PhraseQuery object in order to use PIM to execute the query. When using a PimPhraseQuery, the system may or may not execute the query using PIM (e.g., depending on the PIM system availability, the PIM load vs CPU load).
This project is currently under development. The implementation of the PimPhraseQuery is functional and the current performance (QPS) when compared to standard Lucene is reported in the benchmarks' section. The next step is to improve the score's lower bound computation to reduce the work imbalance between the PIM cores.
- Install OpenJDK 17 or 18.
- Clone PIM Lucene's git repository.
- Run git submodule update --init.
- Make sure cunit is installed on your system (sudo apt install libcunit1-dev).
- Run gradle launcher script (
gradlew
).
We'll assume that you know how to get and set up the JDK - if you don't, then we suggest starting at https://jdk.java.net/ and learning more about Java, before returning to this README.
The machine used has the following characteristics:
The dataset is the english wikipedia dataset, and the set of queries consist in 1036 phrase queries extracted from the luceneutil repository. The setup and details of the benchmarks are found here. Both standard Lucene and PIM-Lucene are run on the same server.
The speedup in throughput (QPS) for various number of search threads and top docs is as follows: