
When documents are initially added to the index, the characters are read from a Java InputStream, and so they can come from files, databases, web service calls, etc.
#APACHE LUCENE HOW TO#
We will see how to customize this pipeline to recognize regions of text marked by double-quotes, which I will call dialogue, and then bump up matches that occur when searching in those regions. The standard analysis pipeline can be visualized as such: The Lucene analysis JavaDoc provides a good overview of all the moving parts in the text analysis pipeline.Īt a high level, you can think of the analysis pipeline as consuming a raw stream of characters at the start and producing “terms”, roughly corresponding to words, at the end. Pieces of the Apache Lucene Analysis Pipeline So it is therefore in these early stages where our customization must begin. In fact, they will throw away punctuation at the earliest stages of text analysis, which runs counter to being able to identify portions of the text that are dialogue.

Neither Lucene, Elasticsearch, nor Solr provides out-of-the-box tools to identify content as dialogue. Suppose we are especially interested in the dialogue within these novels. We know that many of these books are novels. If your documents have a specific structure or type of content, you can take advantage of either to improve search quality and query capability.Īs an example of this sort of customization, in this Lucene tutorial we will index the corpus of Project Gutenberg, which offers thousands of free e-books. While Lucene’s configuration options are extensive, they are intended for use by database developers on a generic corpus of text.

It can also be embedded into Java applications, such as Android apps or web backends. Apache Lucene is a Java library used for the full text search of documents, and is at the core of search servers such as Solr and Elasticsearch.
