This article provides some practical suggestions for debugging Geode Applications.
Geode Artifacts
- cache.xml - provided by the application developer to configure the caches and regions for Geode client and server processes.
- Properties files - provided by the application developer to configure Geode system properties and membership discovery
- System logs - logs generated by Geode clients, servers, and locators. The logs contain information about membership, client connections, warnings about outstanding message requests and errors. The log also displays both Geode and Java system properties.
- Statistics - archive files generated by Geode clients and/or servers containing information about the Geode application. GemFire VSD (Visual Statistics Display) is used to graph the Geode and system metrics recorded in these archives.
Geode Shell (gfsh)
Geode gfsh provides a command-line interface from which you can launch, manage, and monitor Geode processes, data, and applications. The shell provides commands useful in debugging and to bring all the information to a single location for analysis. The following gfsh commands require first executing the gfsh connect
command to establish the connection to the locator or JMX Manager of the distributed system. Please refer to the Geode documentation for more details.
- export logs
- export stack-traces
- show log
- show dead-locks
Geode MergeLogFiles
Use Geode MergeLogFiles (com.gemstone.gemfire.internal.logging.MergeLogFiles) to merge your log files based on timestamps.
Code Block |
---|
$ java -classpath $CLASSPATH -Xmx1200M com.gemstone.gemfire.internal.logging.MergeLogFiles Usage: java MergeLogFiles [(directory | logFile)]+ -dirCount n Number of parent dirs to print -mergeFile file File in which to put merged logs -pids Search for PIDs in file names and use them to identify files -align Align non-timestamped lines with others -noblanks Suppress output of blank lines -threaded Use multithreading to take advantage of multiple CPUs Merges multiple Geode log files and sorts them by timestamp. The merged log file is written to System.out (or a file). If a directory is specified, all .log files in that directory are merged. |
General Guidelines
Check your environment - Machine (e.g. ulimit settings), JDK, JVM properties (-Xmx -Xms), GC parameters
Draw out a diagram of your system topology (servers, clients) and make a note of Listeners, Writers and other plug-ins.
Verify your cache and region configuration
Confirm your system properties (Review properties files and display in system log)
On your system topology diagram, add notes on the initialization and processing being done in various members or classes of members.
If you are debugging a specific interaction, draw a sequence diagram.
Specific search strings and patterns
If possible, bring all the system logs and stack dumps together into a single directory for inspection (use gfsh commands above). Here's a simple script which will search for specific strings in the logs.
Code Block # !bin/bash find . -name hs_err\*.log -print find . -name jrockit.\*.dump -print find . -name core -print find . -name \*.hprof -print fgrep SerializationException */system.log fgrep OutOfMemory *.log */*.log | grep -v HeapDumpOnOutOfMemoryError fgrep NullPointerException *.log */*.log fgrep ConcurrentModificationException *.log */*.log fgrep Assert *.log */*.log fgrep deadlock *.log */*.log fgrep InternalGemFire *.log */*.log fgrep "Too many" *.log */*.log fgrep "Suspect notification" *.log */*.log grep Exception *.log */*.log | grep -v continuing | grep -v HeapDumpOnOutOfMemoryError
Search the system logs for warning, error or severe messages
Search the system logs for any underlying Exceptions. For example: ConcurrentModificationException, NullPointerException, SerializationException.
Search the system logs for warnings about resources causing delays in statistics sampling. If found, use VSD to investigate further.
Code Block [warning 2015/03/29 04:47:23.028 PDT gemfire1_w1-gst-dev26_15651 <Thread-5 StatSampler> tid=0x55] Statistics sampling thread detected a wakeup delay of 8,310 ms, indicating a possible resource issue. Check the GC, memory, and CPU statistics.
Verify there are no HotSpot (hs_err_pid.log files) indicating a HotSpot error in the JVM Refer to the Oracle Troubleshooting Guide for more details.
Verify that there are no heapdump (*.hprof) files or OutOfMemoryErrors. Tools such as jhat, jmap, visualvm and Eclipse Memory Analyzer can provide heap histograms and leak suspects.
Search the stack dumps for
Java-level deadlock
. Dumping the stacks using jstackor the Linux commandkill -3 <pid>
will highlight any Java-level deadlocks including the threads involved in the deadlock as well as the stack dumps for each of those threads. When debugging, it is best to get stack dumps for all JVMs. To determine if progress is being made, execute multiple thread dumps several seconds apart for comparison.
You can also search forstate=BLOCKED
threads and trace waiting to lock (e.g. waiting to lock java.lang.Object@16ce6f90) threads to whichever thread locked the object (e.g. locked java.lang.Object@16ce6f90). Follow this pattern until you find the root thread.Search the system logs for any
15 seconds have elapsed
messages which don't have correspondingwait for replies has completed
logs. You can match these log messages together via the thread id or native thread id. Note that these messages are only logged between peers in the Distributed System. See "Special Considerations for Clients" for messages specific to Geode clients.In this example, we can see that the request did complete, so while we should be concerned (and possibly check stats in vsd to see what system resources are causing this delay), it will not be the cause of our hang.
Code Block [warning 2014/07/26 02:02:54.384 PDT dataStoregemfire5_w1-gst-dev12_25904 <Pooled Waiting Message Processor 11> tid=0xb5] 15 seconds have elapsed while waiting for replies: <DeposePrimaryBucketMessage$DeposePrimaryBucketResponse 2308 waiting for 1 replies from [10.138.44.112(dataStoregemfire2_w1-gst-dev12_37336:37336)<v82>:52253]> on 10.138.44.112(dataStoregemfire5_w1-gst-dev12_25904:25904)<v82>:14125 whose current membership list is: [[10.138.44.112(dataStoregemfire4_w1-gst-dev12_14688:14688):31076, 10.138.44.112(accessorgemfire1_w1-gst-dev12_37800:37800):59828, 10.138.44.112(accessorgemfire3_w1-gst-dev12_6348:6348):16162, 10.138.44.112(dataStoregemfire2_w1-gst-dev12_37336:37336):52253, 10.138.44.112(accessorgemfire4_w1-gst-dev12_24644:24644):64839, 10.138.44.112(accessorgemfire2_w1-gst-dev12_40732:40732):54553, 10.138.44.112(dataStoregemfire3_w1-gst-dev12_3060:3060):43488, 10.138.44.112(accessorgemfire5_w1-gst-dev12_25924:25924):56180,10.138.44.112(dataStoregemfire5_w1-gst-dev12_25904:25904):14125, 10.138.44.112(locatorgemfire1_w1-gst-dev12_50584:50584:locator):16639, 10.138.44.112(dataStoregemfire1_w1-gst-dev12_31492:31492):40311]] [info 2014/07/26 02:03:00.437 PDT dataStoregemfire5_w1-gst-dev12_25904 <Pooled Waiting Message Processor 11> tid=0xb5] DeposePrimaryBucketMessage$DeposePrimaryBucketResponse wait for replies completed
If the request is never satisfied (there is no corresponding
wait for replies completed
), look at the stack dumps for the non-responsive member. There could be a Java-level deadlock within that JVM.There can also be distributed deadlocks between members. This requires following the
15 seconds have elapsed
warnings to the remote members and looking at the stack dumps. Searching forwaiting to lock
in the stack dumps can also help to identify the problematic JVM. Once found in a non-responsive member, find the thread in that JVM that holds the lock and determine what prevents it from releasing the lock.This example shows the outstanding request from the system log and the relevant stack dumps from the non-responding JVM.
The system log shows that vm 12659 is still awaiting a response from vm 12706
Code Block dataStoregemfire1_12659/system.log:[warning 2014/03/04 08:18:27.954 PST dataStoregemfire1_w1-gst-dev25_12659 <ResourceManagerRecoveryThread> tid=0x60] 15 seconds have elapsed while waiting for replies: <ManageBucketMessage$NodeResponse 6407 waiting for 1 replies from [10.138.44.125(dataStoregemfire3_w1-gst-dev25_12706:12706)<v315>:2225]> on 10.138.44.125(dataStoregemfire1_w1-gst-dev25_12659:12659)<v314>:55872 whose current membership list is: [[10.138.44.125(dataStoregemfire2_w1-gst-dev25_12480:12480):24218, 10.138.44.125(locatorgemfire1_w1-gst-dev25_22478:22478:locator):13070, 10.138.44.125(accessorgemfire3_w1-gst-dev25_12501:12501):18675, 10.138.44.125(dataStoregemfire1_w1-gst-dev25_12659:12659):55872, 10.138.44.125(dataStoregemfire5_w1-gst-dev25_12483:12483):27445, 10.138.44.125(accessorgemfire1_w1-gst-dev25_12607:12607):46692, 10.138.44.125(dataStoregemfire4_w1-gst-dev25_12500:12500):48232, 10.138.44.125(accessorgemfire2_w1-gst-dev25_12705:12705):11732, 10.138.44.125(dataStoregemfire3_w1-gst-dev25_12706:12706):2225]]
The stack dumps from 12706 show the
waiting to lock <monitor>
andlocked <monitor>
in the stack dumpsCode Block "Pooled Waiting Message Processor 12" daemon prio=10 tid=0xdf040000 nid=0x383f waiting for monitor entry [0xd960b000] java.lang.Thread.State: BLOCKED (on object monitor) at com.gemstone.gemfire.internal.cache.PartitionedRegionDataStore.grabFreeBucket(PartitionedRegionDataStore.java:424) - waiting to lock <0xe47a7fa8> (a com.gemstone.gemfire.internal.cache.ProxyBucketRegion) at com.gemstone.gemfire.internal.cache.PartitionedRegionDataStore.grabBucket(PartitionedRegionDataStore.java:2940) at com.gemstone.gemfire.internal.cache.PartitionedRegionDataStore.handleManageBucketRequest(PartitionedRegionDataStore.java:1032) at com.gemstone.gemfire.internal.cache.partitioned.ManageBucketMessage.operateOnPartitionedRegion(ManageBucketMessage.java:123) at com.gemstone.gemfire.internal.cache.partitioned.PartitionMessage.process(PartitionMessage.java:297) at com.gemstone.gemfire.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:357) at com.gemstone.gemfire.distributed.internal.DistributionMessage$1.run(DistributionMessage.java:420) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at com.gemstone.gemfire.distributed.internal.DistributionManager.runUntilShutdown(DistributionManager.java:720) at com.gemstone.gemfire.distributed.internal.DistributionManager$6$1.run(DistributionManager.java:1057) at java.lang.Thread.run(Thread.java:662) "Recovery thread for bucket _B__PR__1_46" daemon prio=10 tid=0xdc965400 nid=0x36dd in Object.wait() [0xdb3b7000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) at com.gemstone.gemfire.internal.cache.persistence.PersistenceAdvisorImpl$MembershipChangeListener.waitForChange(PersistenceAdvisorImpl.java:1047) - locked <0xe6223758> (a com.gemstone.gemfire.internal.cache.persistence.PersistenceAdvisorImpl$MembershipChangeListener) at com.gemstone.gemfire.internal.cache.persistence.PersistenceAdvisorImpl.getInitialImageAdvice(PersistenceAdvisorImpl.java:820) at com.gemstone.gemfire.internal.cache.persistence.CreatePersistentRegionProcessor.getInitialImageAdvice(CreatePersistentRegionProcessor.java:47) at com.gemstone.gemfire.internal.cache.DistributedRegion.getInitialImageAndRecovery(DistributedRegion.java:1377) at com.gemstone.gemfire.internal.cache.DistributedRegion.initialize(DistributedRegion.java:1196) at com.gemstone.gemfire.internal.cache.BucketRegion.initialize(BucketRegion.java:264) at com.gemstone.gemfire.internal.cache.LocalRegion.createSubregion(LocalRegion.java:1229) at com.gemstone.gemfire.internal.cache.PartitionedRegionDataStore.createBucketRegion(PartitionedRegionDataStore.java:742) at com.gemstone.gemfire.internal.cache.PartitionedRegionDataStore.grabFreeBucket(PartitionedRegionDataStore.java:451) - locked <0xe47a7fa8> (a com.gemstone.gemfire.internal.cache.ProxyBucketRegion) at com.gemstone.gemfire.internal.cache.PartitionedRegionDataStore.grabFreeBucketRecursively(PartitionedRegionDataStore.java:301) at com.gemstone.gemfire.internal.cache.PartitionedRegionDataStore.grabBucket(PartitionedRegionDataStore.java:2956) at com.gemstone.gemfire.internal.cache.ProxyBucketRegion.recoverFromDisk(ProxyBucketRegion.java:417) at com.gemstone.gemfire.internal.cache.ProxyBucketRegion.recoverFromDiskRecursively(ProxyBucketRegion.java:386) at com.gemstone.gemfire.internal.cache.PRHARedundancyProvider$5.run2(PRHARedundancyProvider.java:1934) at com.gemstone.gemfire.internal.cache.PRHARedundancyProvider$RecoveryRunnable.run(PRHARedundancyProvider.java:2243) at com.gemstone.gemfire.internal.cache.PRHARedundancyProvider$5.run(PRHARedundancyProvider.java:1926) at java.lang.Thread.run(Thread.java:662)
Special considerations for Geode clients
Geode clients can fail with ServerConnectivityExceptions when the servers are too busy to handle client requests. For example, with large GC pauses or distributed deadlocks.
Look for is being terminated because its client timeout
messages in the server system logs and to determine whether or not this is occurring in your application. If so, review the server side system logs, stack dumps and statistics to determine the cause.
Code Block |
---|
[warning 2015/03/14 06:00:15.062 PDT bridgegemfire_1_3_w1-gst-dev08_25525 <ClientHealthMonitor Thread> tid=0x54] Server connection from [identity(w1-gst-dev08(edgegemfire_1_1_w1-gst-dev08_25635:25635:loner):39198:65125d18:edgegemfire_1_1_w1-gst-dev08_25635,connection=1; port=49383] is being terminated because its client timeout of 30,000 has expired. |
Improving traceability during initial development
Function Execution
To trace function execution from the initiator to the member executing the function, pass the initiating thread id to the function using withArgs
and log in both. Of course, this could easily be a string containing the pid, DistributedMemberId or any other identifying information for your application.
Code Block | ||
---|---|---|
| ||
ArrayList aList = new ArrayList(); aList.add("myFunctionOperation"); aList.add(Thread.currentThread().getId()); System.out.println("Executing " + aList.get(0) + " from " + aList.get(1)); ResultCollector drc = dataSet.withFilter(keySet).withArgs(aList) .execute(myFunction.getId()); } |
Log these values within the function to help with tracing during development.
Code Block | ||
---|---|---|
| ||
public void execute(FunctionContext context) { ArrayList arguments = (ArrayList)(context.getArguments()); String operation = (String)arguments.get(0); Object initiatingThreadID = arguments.get(1); String aStr = "In execute with context " + context + " and operation " + operation + " initiated by thread thr_" + initiatingThreadID + "_"; for (int i = 2; i < arguments.size(); i++) { aStr = aStr + " additional arg: " + arguments.get(i); } System.out.println(aStr); |
Events
Since Geode supports multiple CacheListeners, consider adding a LogListener which simply logs the relevant portion of the events as they are processed. This provides another way to enable traceability in your application during the early stages of development. For client/server applications, it can help to identify the originating member of an operation and the server that forwarded that event to a specific client.
If you are not using the CallbackArgument for your application, use the callbackArgument to encode information about the caller or the data, which you can log in your LogListener.
Events for operations initiated in the local JVM are logged by the calling thread as shown below. In this case vm_1_thr_10_edge4_w1-gst-dev18_10648.
Code Block |
---|
[info 2014/05/30 14:04:10.674 PDT <vm_10_thr_10_edge4_w1-gst-dev18_10648> tid=0x51] Calling remove with key Object_395 value null, containsKey true, containsKeyOnServer true [info 2014/05/30 14:04:10.674 PDT <vm_10_thr_10_edge4_w1-gst-dev18_10648> tid=0x51] Invoked util.SilenceListener for key Object_395: afterDestroy in edge4 event=EntryEventImpl[op=DESTROY;key=Object_395;oldValue=null;newValue=null;callbackArg=null;originRemote=false;originMember=w1-gst-dev18(edgegemfire4_w1-gst-dev18_10648:10648:loner):9766:1aa5f04e:edgegemfire4_w1-gst-dev18_10648;callbacksInvoked;version={v3; rv5; mbr=b90d31569c3243e8-8a2bcb43babe154a; ds=1; time=1401483850674; remote};id=EventID[threadID=2;sequenceID=0]] whereIWasRegistered: 10648 event.getKey(): Object_395 event.getOldValue(): null event.getNewValue(): null event.isLoad(): false event.isLocalLoad(): false event.isNetLoad(): false event.isNetSearch(): false event.isConcurrencyConflict(): false event.getDistributedMember(): w1-gst-dev18(edgegemfire4_w1-gst-dev18_10648:10648:loner):9766:1aa5f04e:edgegemfire4_w1-gst-dev18_10648 event.getCallbackArgument(): null event.getRegion(): /testRegion event.isDistributed(): true event.isExpiration(): false event.isOriginRemote(): false Operation: DESTROY Operation.isLoad(): false Operation.isLocalLoad(): false Operation.isNetLoad(): false Operation.isNetSearch(): false Operation.isPutAll(): false Operation.isDistributed(): true Operation.isExpiration(): false [info 2014/05/30 14:04:10.674 PDT <vm_10_thr_10_edge4_w1-gst-dev18_10648> tid=0x51] Done calling remove with key Object_395 value null, return value is true |
Events fired in remote members are fired on asynchronous threads. In the case of clients, this asynchronous thread provides the identity of the server hosting the client's HARegionQueue. In this case bridgegemfire5_w1_gst_dev18_79056.
Code Block |
---|
[info 2014/05/30 14:04:10.674 PDT <Cache Client Updater Thread on w1-gst-dev18(bridgegemfire5_w1-gst-dev18_79056:79056)<v60>:3080 port 27043> tid=0x26] Invoked util.SilenceListener for key Object_395: afterDestroy in edge3 event=EntryEventImpl[op=DESTROY;key=Object_395;oldValue=null;newValue=null;callbackArg=null;originRemote=true;originMember=w1-gst-dev18(:loner):9766:1aa5f04e;callbacksInvoked;version={v3; rv5; mbr=b90d31569c3243e8-8a2bcb43babe154a; ds=1; time=1401483850674; remote};id=EventID[threadID=2;sequenceID=0];isFromServer] whereIWasRegistered: 53128 event.getKey(): Object_395 event.getOldValue(): null event.getNewValue(): null event.isLoad(): false event.isLocalLoad(): false event.isNetLoad(): false event.isNetSearch(): false event.isConcurrencyConflict(): false event.getDistributedMember(): w1-gst-dev18(:loner):9766:1aa5f04e event.getCallbackArgument(): null event.getRegion(): /testRegion event.isDistributed(): true event.isExpiration(): false event.isOriginRemote(): true Operation: DESTROY Operation.isLoad(): false Operation.isLocalLoad(): false Operation.isNetLoad(): false Operation.isNetSearch(): false Operation.isPutAll(): false Operation.isDistributed(): true Operation.isExpiration(): false |