System Failure and Recovery
This section describes alerts for and appropriate responses to various kinds of system failures. It also helps you plan a strategy for data recovery.
If a system member withdraws from the distributed system involuntarily because the member, host, or network fails, the other members automatically adapt to the loss and continue to operate. The distributed system does not experience any disturbance such as timeouts.
Planning for Data Recovery
In planning a strategy for data recovery, consider these factors:
- Whether the region is configured for data redundancy—partitioned regions only.
- The region’s role-loss policy configuration, which controls how the region behaves after a crash or system failure—distributed regions only.
- Whether the region is configured for persistence to disk.
- The extent of the failure, whether multiple members or a network outage is involved.
- Your application’s specific needs, such as the difficulty of replacing the data and the risk of running with inconsistent data for your application.
- When an alert is generated due to network partition or slow response, indicating that certain processes may, or will, fail.
The rest of this section provides recovery instructions for various kinds system failures.
Network Partitioning, Slow Response, and Member Removal Alerts
When a network partition detection or slow responses occur, these alerts are generated:
- Network Partitioning is Detected
- Member is Taking Too Long to Respond
- No Locators Can Be Found
- Warning Notifications Before Removal
- Member is Forced Out
For information on configuring system members to help avoid a network partition configuration condition in the presence of a network failure or when members lose the ability to communicate to each other, refer to Understanding and Recovering from Network Outages.
Network Partitioning Detected
Alert:
Membership coordinator id has declared that a network partition has occurred.
Description:
This alert is issued when network partitioning occurs, followed by this alert on the individual member:
Alert:
Exiting due to possible network partition event due to loss of {0} cache processes: {1}
Response:
Check the network connectivity and health of the listed cache processes.
Member Taking Too Long to Respond
Alert:
15 sec have elapsed while waiting for replies: <ReplyProcessor21 6 waiting for 1 replies
from [ent(27130):60333/36743]> on ent(27134):60330/45855 whose current membership
list is: [[ent(27134):60330/45855, ent(27130):60333/36743]]
Description:
Member ent(27130):60333/36743 is in danger of being forced out of the distributed system because of a suspect-verification failure. This alert is issued at the warning level, after the ack-wait-threshold is reached.
Response:
The operator should examine the process to see if it is healthy. The process ID of the slow responder is 27130 on the machine named ent. The ports of the slow responder are 60333/36743. Look for the string, Starting distribution manager ent:60333/36743, and examine the process owning the log file containing this string.
Alert:
30 sec have elapsed while waiting for replies: <ReplyProcessor21 6 waiting for 1 replies
from [ent(27130):60333/36743]> on ent(27134):60330/45855 whose current membership
list is: [[ent(27134):60330/45855, ent(27130):60333/36743]]
Description:
Member ent(27134) is in danger of being forced out of the distributed system because of a suspect-verification failure. This alert is issued at the severe level, after the ack-wait-threshold is reached and after ack-severe-alert-threshold seconds have elapsed.
Response:
The operator should examine the process to see if it is healthy. The process ID of the slow responder is 27134 on the machine named ent. The ports of the slow responder are 60333/36743. Look for the string, Starting distribution manager ent:60333/36743, and examine the process owning the log file containing this string.
Alert:
15 sec have elapsed while waiting for replies: <DLockRequestProcessor 33636 waiting
for 1 replies from [ent(4592):33593/35174]> on ent(4592):33593/35174 whose current
membership list is: [[ent(4598):33610/37013, ent(4611):33599/60008,
ent(4592):33593/35174, ent(4600):33612/33183, ent(4593):33601/53393, ent(4605):33605/41831]]
Description:
This alert is issued by partitioned regions and regions with global scope at the warning level, when the lock grantor has not responded to a lock request within the ack-wait-threshold and the ack-severe-alert-threshold.
Response:
None.
Alert:
30 sec have elapsed while waiting for replies: <DLockRequestProcessor 23604 waiting
for 1 replies from [ent(4592):33593/35174]> on ent(4598):33610/37013 whose current
membership list is: [[ent(4598):33610/37013, ent(4611):33599/60008,
ent(4592):33593/35174, ent(4600):33612/33183, ent(4593):33601/53393, ent(4605):33605/41831]]
Description:
This alert is issued by partitioned regions and regions with global scope at the severe level, when the lock grantor has not responded to a lock request within the ack-wait-threshold and the ack-severe-alert-threshold.
Response:
None.
Alert:
30 sec have elapsed waiting for global region entry lock held by ent(4600):33612/33183
Description
This alert is issued by regions with global scope at the severe level, when the lock holder has held the desired lock for ack-wait-threshold + ack-severe-alert-threshold seconds and may be unresponsive.
Response:
None.
Alert:
30 sec have elapsed waiting for partitioned region lock held by ent(4600):33612/33183
Description:
This alert is issued by partitioned regions at the severe level, when the lock holder has held the desired lock for ack-wait-threshold + ack-severe-alert-threshold seconds and may be unresponsive.
Response:
None.
No Locators Can Be Found
Note: It is likely that all processes using the locators will exit with the same message.
Alert:
Membership service failure: Channel closed: org.apache.geode.ForcedDisconnectException:
There are no processes eligible to be group membership coordinator
(last coordinator left view)
Description:
Network partition detection is enabled, and there are locator problems.
Response:
The operator should examine the locator processes and logs, and restart the locators.
Alert:
Membership service failure: Channel closed: org.apache.geode.ForcedDisconnectException:
There are no processes eligible to be group membership coordinator
(all eligible coordinators are suspect)
Description:
Network partition detection is enabled, and there are locator problems.
Response:
The operator should examine the locator processes and logs, and restart the locators.
Alert:
Membership service failure: Channel closed: org.apache.geode.ForcedDisconnectException:
Unable to contact any locators and network partition detection is enabled
Description:
Network partition detection is enabled, and there are locator problems.
Response:
The operator should examine the locator processes and logs, and restart the locators.
Alert:
Membership service failure: Channel closed: org.apache.geode.ForcedDisconnectException:
Disconnected as a slow-receiver
Description:
The member was not able to process messages fast enough and was forcibly disconnected by another process.
Response:
The operator should examine and restart the disconnected process.
Warning Notifications Before Removal
Alert:
Membership: requesting removal of ent(10344):21344/24922 Disconnected as a slow-receiver
Description:
This alert is generated only if the slow-receiver functionality is being used.
Response:
The operator should examine the locator processes and logs.
Alert:
Network partition detection is enabled and both membership coordinator and lead member
are on the same machine
Description:
This alert is issued if both the membership coordinator and the lead member are on the same machine.
Response:
The operator can turn this off by setting the system property gemfire.disable-same-machine-warnings to true. However, it is best to run locator processes, which act as membership coordinators when network partition detection is enabled, on separate machines from cache processes.
Member Is Forced Out
Alert:
Membership service failure: Channel closed: org.apache.geode.ForcedDisconnectException:
This member has been forced out of the Distributed System. Please consult GemFire logs to
find the reason.
Description:
The process discovered that it was not in the distributed system and cannot determine why it was removed. The membership coordinator removed the member after it failed to respond to an internal are-you-alive message.
Response:
The operator should examine the locator processes and logs.
Restart Fails Due To Out-of-Memory Error
This section describes a restart failure that can occur when the stopped system is one that was configured with persistent regions. Specifically:
- Some of the regions of the recovering system, when running, were configured as PERSISTENT regions, which means that they save their data to disk.
- At least one of the persistent regions was configured to evict least recently used (LRU) data by overflowing values to disk.
How Data is Recovered From Persistent Regions
Data recovery, upon restart, always recovers keys. You can configure whether and how the system recovers the values associated with those keys to populate the system cache.
Value Recovery
Recovering all values immediately during startup slows the startup time but results in consistent read performance after the startup on a "hot" cache.
Recovering no values means quicker startup but a "cold" cache, so the first retrieval of each value will read from disk.
Retrieving values asynchronously in a background thread allows a relatively quick startup on a "warm" cache that will eventually recover every value.
Retrieve or Ignore LRU values
When a system with persistent LRU regions shuts down, the system does not record which of the values
were recently used. On subsequent startup, if values are recovered into an LRU region they may be
the least recently used instead of the most recently used. Also, if LRU values are recovered on a
heap or an off-heap LRU region, it is possible that the LRU memory limit will be exceeded, resulting
in an OutOfMemoryException
during recovery. For these reasons, LRU value recovery can be treated
differently than non-LRU values.
Default Recovery Behavior for Persistent Regions
The default behavior is for the system to recover all keys, then asynchronously recover all data values, leaving LRU values unrecovered. This default strategy is best for most applications, because it strikes a balance between recovery speed and cache completeness.
Configuring Recovery of Persistent Regions
Three Java system parameters allow the developer to control the recovery behavior for persistent regions:
gemfire.disk.recoverValues
Default =
true
, recover values. Iffalse
, recover only keys, do not recover values.How used: When
true
, recovery of the values "warms up" the cache so data retrievals will find their values in the cache, without causing time consuming disk accesses. Whenfalse
, shortens recovery time so the system becomes available for use sooner, but the first retrieval on each key will require a disk read.gemfire.disk.recoverLruValues
Default =
false
, do not recover LRU values. Iftrue
, recover LRU values. Ifgemfire.disk.recoverValues
isfalse
, thengemfire.disk.recoverLruValues
is ignored, since no values are recovered.How used: When
false
, shortens recovery time by ignoring LRU values. Whentrue
, restores more data values to the cache. Recovery of the LRU values increases heap memory usage and could cause an out-of-memory error, preventing the system from restarting.gemfire.disk.recoverValuesSync
Default =
false
, recover values by an asynchronous background process. Iftrue
, values are recovered synchronously, and recovery is not complete until all values have been retrieved. Ifgemfire.disk.recoverValues
isfalse
, thengemfire.disk.recoverValuesSync
is ignored since no values are recovered.How used: When
false
, allows the system to become available sooner, but some time must elapse before all values have been read from disk into cache memory. Some key retrievals will require disk access, and some will not. Whentrue
, prolongs restart time, but ensures that when available for use, the cache is fully populated and data retrieval times will be optimal.