Apache Lucene® Integration<!--
Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. -->
Apache Lucene® is a widely-used Java full-text search engine. This section describes how the system integrates with Apache Lucene. We assume that the reader is familiar with Apache Lucene's indexing and search functionalities.
The Apache Lucene integration:
- enables users to create Lucene indexes on data stored in Geode
- provides high availability of indexes using Geode's HA capabilities to store the indexes in memory
- optionally stores indexes on disk
- updates the indexes asynchronously to minimize impacting write latency
- provides scalability by partitioning index data
- colocates indexes with data
For more details, see Javadocs for the classes and interfaces that implement Apache Lucene indexes and searches, including
LuceneService
, LuceneQueryFactory
, LuceneQuery
, and LuceneResultStruct
.
Using the Apache Lucene Integration
You can interact with Apache Lucene indexes through a Java API,
through the gfsh
command-line utility,
or by means of the cache.xml
configuration file.
To use Apache Lucene to create and use indexes, you will need two pieces of information:
- The name of the region to be indexed and searched
- The names of the fields you wish to index
Key Points
- Apache Lucene indexes are supported only on partitioned regions. Replicated region types are not supported.
- Lucene indexes reside on servers. There is no way to create a Lucene index on a client.
- Only top level fields of objects stored in the region can be indexed.
- A single index supports a single region. Indexes do not support multiple regions.
- Heterogeneous objects in a single region are supported.
Creating an Index
Create the index before creating the region.
When no analyzer is specified, the
org.apache.lucene.analysis.standard.StandardAnalyzer
will be used.
Java API Example to Create an Index
// Get LuceneService
LuceneService luceneService = LuceneServiceProvider.get(cache);
// Create the index on fields with default analyzer
// prior to creating the region
luceneService.createIndexFactory()
.addField("name")
.addField("zipcode")
.create(indexName, regionName);
Region region = cache.createRegionFactory(RegionShortcut.PARTITION)
.create(regionName);
Gfsh Example to Create an Index
For details, see the gfsh create lucene index command reference page.
gfsh>create lucene index --name=indexName --region=/orders --field=customer,tags
// Create an index, specifying a custom analyzer for the second field
// Note: "DEFAULT" in the first analyzer position uses the default analyzer
// for the first field
gfsh>create lucene index --name=indexName --region=/orders
--field=customer,tags --analyzer=DEFAULT,org.apache.lucene.analysis.bg.BulgarianAnalyzer
The value __REGION_VALUE_FIELD
identifies when the
field is a single primitive type.
Use it to define the --field
option,
as there will be no field name to use in the case of a primitive type.
__REGION_VALUE_FIELD
supports entry values of type String
, Long
,
Integer
, Float
, and Double
.
XML Configuration to Create an Index
<cache
xmlns="http://geode.apache.org/schema/cache"
xmlns:lucene="http://geode.apache.org/schema/lucene"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://geode.apache.org/schema/cache
http://geode.apache.org/schema/cache/cache-1.0.xsd
http://geode.apache.org/schema/lucene
http://geode.apache.org/schema/lucene/lucene-1.0.xsd"
version="1.0">
<region name="region" refid="PARTITION">
<lucene:index name="myIndex">
<lucene:field name="a"
analyzer="org.apache.lucene.analysis.core.KeywordAnalyzer"/>
<lucene:field name="b"
analyzer="org.apache.lucene.analysis.core.SimpleAnalyzer"/>
<lucene:field name="c"
analyzer="org.apache.lucene.analysis.standard.ClassicAnalyzer"/>
<lucene:field name="d" />
</lucene:index>
</region>
</cache>
Queries
Gfsh Example to Query using a Lucene Index
For details, see the gfsh search lucene command reference page.
gfsh>search lucene --name=indexName --region=/orders --queryStrings="John*"
--defaultField=customer --limit=100
Java API Example to Query using a Lucene Index
LuceneQuery<String, Person> query = luceneService.createLuceneQueryFactory()
.setLimit(10)
.create(indexName, regionName, "name:John AND zipcode:97006", defaultField);
Collection<Person> results = query.findValues();
Destroying an Index
Since a region destroy operation does not cause the destruction of any Lucene indexes, destroy any Lucene indexes prior to destroying the associated region.
Java API Example to Destroy a Lucene Index
luceneService.destroyIndex(indexName, regionName);
An attempt to destroy a region with a Lucene index will result in
an IllegalStateException
,
issuing an error message similar to:
java.lang.IllegalStateException: The parent region [/orders] in colocation chain
cannot be destroyed, unless all its children [[/indexName#_orders.files]] are
destroyed
at org.apache.geode.internal.cache.PartitionedRegion
.checkForColocatedChildren(PartitionedRegion.java:7231)
at org.apache.geode.internal.cache.PartitionedRegion
.destroyRegion(PartitionedRegion.java:7243)
at org.apache.geode.internal.cache.AbstractRegion
.destroyRegion(AbstractRegion.java:308)
at DestroyLuceneIndexesAndRegionFunction
.destroyRegion(DestroyLuceneIndexesAndRegionFunction.java:46)
Gfsh Example to Destroy a Lucene Index
For details, see the gfsh destroy lucene index command reference page.
The error message that results from an attempt to destroy a region prior to destroying its associated Lucene index will be similar to:
Region /orders cannot be destroyed because it defines Lucene index(es)
[/ordersIndex]. Destroy all Lucene indexes before destroying the region.
Changing an Index
Changing an index requires rebuilding it. Implement these steps to change an index:
- Export all region data
- Destroy the Lucene index
- Destroy the region
- Create a new index
- Create a new region without the user-defined business logic callbacks
- Import the region data with the option to turn on callbacks. The callbacks will be to invoke a Lucene async event listener to index the data.
- Alter the region to add the user-defined business logic callbacks
Additional Gfsh Commands
See the gfsh describe lucene index command reference page for the command that prints details about a specific index.
See the gfsh list lucene index command reference page for the command that prints details about the Lucene indexes created for all members.
Requirements and Caveats
- Join queries between regions are not supported.
- Nested objects are not supported.
- Lucene indexes will not be stored within off-heap memory.
- Lucene queries from within transactions are not supported.
On an attempt to query from within a transaction,
a
LuceneQueryException
is thrown, issuing an error message on the client (accessor) similar to:
Exception in thread "main" org.apache.geode.cache.lucene.LuceneQueryException:
Lucene Query cannot be executed within a transaction
at org.apache.geode.cache.lucene.internal.LuceneQueryImpl
.findTopEntries(LuceneQueryImpl.java:124)
at org.apache.geode.cache.lucene.internal.LuceneQueryImpl
.findPages(LuceneQueryImpl.java:98)
at org.apache.geode.cache.lucene.internal.LuceneQueryImpl
.findPages(LuceneQueryImpl.java:94)
at TestClient.executeQuerySingleMethod(TestClient.java:196)
at TestClient.main(TestClient.java:59)
- Lucene indexes must be created prior to creating the region. If an attempt is made to create a Lucene index after creating the region, the error message will be similar to:
Member | Status
192.0.2.0(s2:97639)<v2>:1026 | Failed: The lucene index must be created before region
192.0.2.0(s3:97652)<v3>:1027 | Failed: The lucene index must be created before region
192.0.2.0(s1:97626)<v1>:1025 | Failed: The lucene index must be created before region
- The order of server creation with respect to index and region creation
is important.
The cluster configuration service cannot work if servers are created
after index creation, but before region creation,
as Lucene indexes are propagated to the cluster configuration after
region creation.
To start servers at multiple points within the start-up process,
use this ordering:
- start server(s)
- create Lucene index
- create region
- start additional server(s)
- An invalidate operation on a region entry does not invalidate a corresponding Lucene index entry. A query on a Lucene index that contains values that have been invalidated can return results that no longer exist. Therefore, do not combine entry invalidation with queries on Lucene indexes.
- Lucene indexes are not supported for regions that have eviction configured
with a local destroy.
Eviction can be configured with overflow to disk,
but only the region data is overflowed to disk,
not the Lucene index.
On an attempt to create a region with eviction configured to do local destroy
(with a Lucene index),
an
UnsupportedOperationException
will be thrown, issuing an error message similar to:
[error 2017/05/02 16:12:32.461 PDT <main> tid=0x1]
java.lang.UnsupportedOperationException:
Lucene indexes on regions with eviction and action local destroy are not supported
Exception in thread "main" java.lang.UnsupportedOperationException:
Lucene indexes on regions with eviction and action local destroy are not supported
at org.apache.geode.cache.lucene.internal.LuceneRegionListener
.beforeCreate(LuceneRegionListener.java:85)
at org.apache.geode.internal.cache.GemFireCacheImpl
.invokeRegionBefore(GemFireCacheImpl.java:3154)
at org.apache.geode.internal.cache.GemFireCacheImpl
.createVMRegion(GemFireCacheImpl.java:3013)
at org.apache.geode.internal.cache.GemFireCacheImpl
.basicCreateRegion(GemFireCacheImpl.java:2991)
Be aware that using the same field name in different objects where the field has different data types may have unexpected consequences. For example, if an index on the field SSN has the following entries
Object_1 object_1
has String SSN = "1111"Object_2 object_2
has Integer SSN = 1111Object_3 object_3
has Float SSN = 1111.0Integers and floats will not be converted into strings. They remain as
IntPoint
andFloatPoint
within Lucene. The standard analyzer will not try to tokenize these values. The standard analyzer will only try to break up string values. So, a string search for "SSN: 1111" will returnobject_1
. AnIntRangeQuery
forupper limit : 1112
andlower limit : 1110
will returnobject_2
. And, aFloatRangeQuery
withupper limit : 1111.5
andlower limit : 1111.0
will returnobject_3
.
- Backups should only be made for regions with Lucene indexes when there are no puts, updates, or deletes in progress. A backup might cause an inconsistency between region data and a Lucene index. Both the region operation and the associated index operation cause disk operations, yet those disk operations are not done atomically. Therefore, if a backup were taken between the persisted write to a region and the resulting persisted write to the Lucene index, then the backup represents inconsistent data in the region and Lucene index.