IntelliJ IDEA 2017.1 Help

Advanced debugging scenarios

Debugging Asynchronous flow

Reactive programming is increasingly popular and with the help of many frameworks and libraries out there, developers are writing a lot more asynchronous code.
The flow in an asynchronous application is a major challenge for debugging tools and the developers who use them. The execution jumps between frames and makes it harder to understand and follow the code.
Stepping forward is the easier bit. We can insert breakpoints at different points in the code and regardless of the executing thread, see the progress from one code snippet to the next.
When doing that, pay attention to whether a breakpoint suspends just a thread or the entire application. The decision is based on the goal of the debugging session. If you want to check the state of all threads and see what thread has progressed and what thread might be unnecessarily waiting, you can freeze the entire system at this point and view call frames and stack traces of all threads.
If you're debugging a specific action, you can suspend just one thread and let the rest of the system keep working.

Async Stacktraces

The real pain with asynchronous debugging starts when we want to look back from a specific point in the code and understand how we got here. Consider the example of asynchronous code below (using JDK's CompletableFutures):

private void asyncExample() throws InterruptedException, ExecutionException { final CompletableFuture<String> future = supplyAsync(() -> "F"). thenApplyAsync(this::append_oo); System.out.println(future.get()); } private String append_oo(String str) { return str +"oo"; }
When we stop inside the method append_oowe can see that the stack trace gives us very little information.
Specifically, we can't see the future applied above it nor can we see the asyncExample method that for us started it all.

In an asynchronous context, stack traces will only show us a very limited picture and what we really need is the flow of information between the threads or the combined stack traces of all threads that got us to this point (also known as the causality chain).

IntelliJ IDEA provides a way to view those Async stacktraces.
While in a debugging session, IntelliJ IDEA will capture stacktraces and will show them later when viewing the stacktrace of the next part of the asynchronous flow:


Or even clearer when we filter the external libraries:

The stacktraces to capture and the point to insert them needs to be configured in the debugger preferences under Async Stacktraces.
IntelliJ IDEA needs to know the class name and the method to appear at the top of the stack trace which we need to capture. In our example java.util.concurrent.CompletableFuture thenApplyAsync. The debugger also needs to know the position in the other stack trace where we want to insert the captured stack trace. In our example it is java.util.concurrent.CompletableFuture$AsyncApply exec.
In order to match the two stack traces we also needs two keys - one for each context which will point to the same value when one stack trace is indeed the next logical step in our async chain. In our example the key expression we can use is param_0 (parameter 0 - the method reference we pass into thenApplyAsync). It will be matched with fn - the variable inside java.util.concurrent.CompletableFuture$AsyncApply that holds the function in the second call frame.

This should be configured in the debugger settings once and configuration for common async frameworks such as CompletableFuture used here will be provided.

Debugging multithreaded applications

Multithreaded applications are the biggest challenge to debug. These applications are a lot less deterministic and much harder to control. The illusion of a sequential flow we get from stepping in a debug session does not help either and can be misleading. In a multithreaded environment, and in particular when investigating issues that can be concurrency bugs, we need to try to step less and fine-tune our breakpoints more. This is because a lot of the concurrency bugs depend on a specific interaction between different threads, and an intrusive debugging session will interfere with that. We'll show how using various Breakpoint properties allows us to limit the interference to a minimum. The other important topic is controlling and switching between different threads in the application. We'll go through some examples of debugging different concurrency bugs to demonstrate how IntelliJ IDEA's features help with this.

Controlling a breakpoint

IntelliJ IDEA debugger properties allow us to control the actions taken when a breakpoint is triggered. Some of them define an action, and others are there to add further conditions on whether to take the action at all. This fine level of control of the breakpoints is critical for concurrency bugs, because most will only be reproduced when threads interact in a very specific way. Any interference of the breakpoints may prevent us from reproducing the bug.

Breakpoint actions

Deciding on the breakpoint action depends on what we want to achieve in the debugging session.
Suspend entire VM
If we want to view all threads, with their state and call frames, when one thread reaches a specific point in the code, or when a certain condition occurs, then suspending the entire VM is our best approach.
Suspend a thread
In many cases, suspending only one thread and not the whole VM is preferable. This is especially true when the application is part of a larger system and suspending the VM will cause either an overflow of messages waiting to be served, or request timeouts that end up breaking the entire system. When we have many worker threads, it is better to keep almost all working and focus on just one thread which is interesting to us.
When we deal with a concurrency bug, any suspension of execution may prevent us from reproducing the bug. We can opt to make the breakpoint not suspend anything, just log either a message or a value of a particular expression to the console, then inspect the log. This works well when we have a strong theory about what exactly are we looking for.

Restrict breakpoint with conditions

Apart from being convenient, breakpoint conditions let us minimize the intrusive nature of the debugging session. They allow us to limit the breakpoint actions to only what we see as absolutely essential.

Conditional expression
Conditional expressions are the most widely used condition. They allow us to trigger the breakpoint only when our application reaches a specific state. Ideal if we can define an expression that captures when things start to go wrong.
Pass count
Pass count is useful in code that is being run many times, either an event handler or a loop and the interesting scenario we're after only manifests itself after a specific number of passes.
Remove once hit
We use this when the code is being hit many times but only the first case is interesting. The Remove Once Hit option is especially useful in two scenarios:
  • When the breakpoint action is to log rather than suspend, which means we don't have the ability to remove or disable the breakpoint after it was hit.
  • When the code is executed by many threads and we only want to suspend one of them.
Dependency on other breakpoint
A very useful feature. Its obvious use is as a filter to triggering a breakpoint in a scenario where we're interested in a visit to a method, or a specific state in the code only after another state was reached. But as well as that, we can use it to reproduce a particular concurrency issue, as it can help us suspend threads and control which thread reaches which particular line in the code and in which order.
Allows us to filter triggering by class or specific instance.

Debugging long running scenarios

Method Breakpoint and Field Watchpoint slow down code execution considerably. When executing the same code a huge number of times, even conditional breakpoints slow down the processing enough for it to be noticeable. This is a real issue because the scenario of an event handler processing millions of events is fairly common (think of replaying a journal file or processing huge production log files) and evaluating a breakpoint condition inside that event handling code can slow down the system to an unusable state. To overcome that, assuming we can modify running code, we can improve the speed by employing a little trick we shall call "breakpoint in code". This trick is very useful when we debug processing of millions of events where only one causes a problem and we have no idea in advance which is the problematic one, and can save us a lot of waiting for a conditional breakpoint to be triggered. The fastest code is the executed code that was compiled and optimized by the JVM. We want to use that fact and so, instead of writing a condition on a breakpoint, we introduce it to our executed code in a way that we can manipulate later. We then debug without any breakpoint, thus running in the fastest way a debugging session can run, and introduce a breakpoint while the code is running only when we actually know we will hit it.

Breakpoint in code

  1. We introduce a loop to the code with our condition. This means we enter the loop only if the interesting state occurs. We then print something to the console so we will know when the code has entered the loop. Because the loop does not change any state, once we enter the loop we will stay inside it.
    while (bugCondition(msg)) { System.out.println("gotcha!"); try { Thread.sleep(1000); } catch (InterruptedException e) { //ignore } }
  2. At this point we initiate the debug session, sit and wait for the "gotcha!" to appear. The console will show us we "hit" the "breakpoint".

    Console shows breakpoint hit
  3. When the "gotcha!" does appear, we introduce a real line breakpoint inside the loop. The breakpoint will be hit and suspend the VM or thread. Now we can inspect the event and its state. If inspecting is not enough, the last thing to do is to make our code exit the loop. There are two options to do that.

    • We can take advantage of the Evaluate Expression intrusive nature to evaluate a code fragment that will actually modify the loop condition to false. This is easily done if we use a field or variable as the condition of the loop, since we can then modify its value.

      boolean enterLoop = bugCondition(msg); while (enterLoop) { System.out.println("gotcha!"); try { Thread.sleep(1_000); } catch (InterruptedException e) { //ignore } }
      Use Evaluate Expression to stop the loop
    • We can exit the loop by using another feature of a debugging session, HotSwap. This allows us to modify the running code during debugging, compile it and then IntelliJ IDEA will hot swap the debugged classes with the new version. All we need to do is change the loop condition to 'false'. By default, IntelliJ IDEA will detect that a class has a new version and will ask us whether to reload the class with the new version.

      Reload classes?

      Once the new version is loaded, the new loop condition will make the code exit the loop and we can continue debugging from that point. You can either put another breakpoint after the loop to suspend the execution again or just step through the 'false' loop condition.

Looking for a race condition

A race condition is a common issue in multithreaded applications. Multiple threads access and modify the same state, potentially corrupting it or causing undesired flow. A race condition can be a very subtle bug and is usually hard to reproduce. That is because it only occurs when the threads execute the code in a very specific order. Other execution orders will look fine and not cause any issues.

When looking for a race condition among threads, our debug run must start as minimally intrusively as possible to not interfere with the execution order. Once we gain some information or have an assumption on what the the execution order is that will cause the bug, we can also use the debug features to reproduce it using a dependency between breakpoints.

Detecting race condition resulting in a corrupted state

Sometimes race conditions only occur once in every ten or a hundred runs of the system. If we suspect there is a race condition in our multithreaded code, we must always make sure that the intrusive nature of the debugging session does not make the issue non-reproducible. For example, here we've created a system of publishers and subscribers, however all our subscribers share a primitive (and not a thread-safe) counter to count the total number of consumed messages.

private class Subscriber implements Runnable { @Override public void run() { while (true) { String msg = messageQueue.poll(); if (msg != null) { if (msg.equals(STOP)) { break; } else { // race condition right here! counter++; } } } } }

Once we've made sure the issue can be reproduced in the debug mode, we try setting a breakpoint with logging instead of suspending the program execution. Here again, just the fact we are logging from all threads to the same console may "synchronize" the threads in such a way that will "solve" the bug. We need to be sure we can still reproduce it even if now it might take more attempts. Logging the suspected state can narrow down our options and allow us to see that the problem is not with the number of calls to the method but with the counter field.

Avoiding debugger overhead

A race condition such as the one in our previous example will, on most machines, turn out to be a "subtle" race condition. By "subtle" we mean that any modification or change to the runtime environment can "fix" it. Remember that the origin of the bug is the fact that advancing a primitive counter is not an atomic operation.

// race condition right here! counter++;

To reproduce the bug we need two threads, both reading the same value: the "second" thread must read the value before the "first" one updates it and flushes its CPU cache. Easy enough to create on multi-core machines during a normal run, but almost impossible to reproduce in a debugging session.

Logging, via a breakpoint, at that same point synchronizes the threads, as they all need to write to the same log. This also flushes the CPU caches of all threads, as writing to the log is atomic. In short, it prevents us from reproducing the bug. Suspending either the VM or the thread cannot help us here either, as we can't separate the two instructions (reading the counter value and incrementing it) to break between them. At this point, we need to make some assumptions then prove or refute them. Since we cannot use any breakpoints, our only hope is that we can change the actual executed code and introduce new code that will be compiled and therefore will interfere less.

This is very much a last resort option. A good pattern to help us here is a trace buffer.

Trace buffer

We can introduce an internal buffer and store the interesting values in this buffer. Think of it as a localized, very efficient in memory log. We must make sure that:
  • We have a buffer per thread, and those buffers are isolated so this does not introduce new concurrency issues.
  • Because the buffer is per thread, it does not need to be thread safe and must not be thread-safe. This is because we want to avoid introducing any synchronization points.
  • The values we insert must not be references to a real state that can change, but copies or log messages.
  • The introduced code is as minimal as possible, to minimize its effect on the running code.
  • We print or log the contents of the buffers only after the execution has ended, to avoid making the logging action a synchronization mechanism between threads. Another option is to only store the values in the trace buffer, then inspect its contents by putting a breakpoint after the critical part of the code has finished executing.
private class Subscriber implements Runnable { private int index = 0; private final int[] traceBuffer = new int[NUMBER_OF_SUBSCRIBERS_AND_PUBLISHERS * 100]; @Override public void run() { while (true) { String msg = messageQueue.poll(); if (msg != null) { if (msg.equals(STOP)) { break; } else { traceBuffer[index++] = counter; // race condition right here! counter++; } } } } }

For example, here we introduced a primitive int array large enough for all messages, and in order to prove our suspicion of a bug in the counter, we store just the counter values before advancing it. Yes, it may not be the exact value advanced by the counter, but it will prove our assumption if the same counter value is reported by several threads. After all events are completed, we can inspect the trace buffers and find the duplicates.

Detecting a race condition resulting in unexpected flow control

private class Subscriber implements Runnable { @Override public void run() { String msg; while (true) { msg = messageQueue.poll(); if (msg != null) { if (msg.equals(STOP)) { break; } // else do something } } // Will NOT work with multiple subscribers, as main thread will // wake up when the first subscriber is done. // Using a CountDownLatch here is a much better approach. synchronized (messageQueue) { messageQueue.notify(); } } }

In this example we have another race condition, but the contended shared state is not directly visible and is only deduced by what seems like a wrong flow control: the first subscriber will wake up the main thread, which will exit even if the second thread is still processing a message. We can't inspect or print the waiting thread in a 'trace buffer' in this case, but we can inspect the position of various threads when we suspend the entire application.

  1. First, we can suspend the main thread after it wakes up. We can see that sometimes one of the subscriber threads is still marked running. This allows us to assume the problem's origin is in the fact that the notify() method is called too soon.
  2. We can suspend only one of the subscriber threads. That will cause the other thread to notify the main thread. This will prove to us the problem can happen in any subscriber and is in its logic.
  3. To be sure, we can suspend the entire VM earlier, just before a subscriber notifies the main thread. We can then inspect the status of the two subscriber threads and prove that one of them is still polling, while the other has already finished and is about to notify the main thread it's done.

    Suspending the application before main thread notified

    In this screen capture, we can see from inspecting the threads that both are marked 'RUNNING', which means that while the first is about to notify the main thread it is done (inside the synchronizaed block), the other can still be processing messages.

  4. To prove our assumption beyond any doubt we can also put a breakpoint inside the polling loop of the subscribers. We make that breakpoint depended on the previous breakpoint just before we notify the main thread we're done. Hitting the dependant breakpoint (as shown below) proves our theory.

    Breakpoint inside the subscriber

Detecting a deadlock

A deadlock occurs when two threads will conflict in such a way that both are preventing each other from working at all. Once they occur, deadlocks are easy to spot by looking at the frames of all threads. We can do this by using Thread dump. This feature is also available when running. If we know we're chasing a deadlock, the running mode is even preferable to debugging. This is because we will not interfere at all with the execution this way, and the snapshot will be the output of a Java thread dump of the application. A thread dump can detect deadlocks and warn about them. For example, in the dump below we can see the process found 1 deadlock between the PublisherThread (which is stuck in line 44) and SubscriberThread (in line 78).

Thread dump
  1. In this example, we can see that both threads are stuck waiting for a lock, which means that another thread is not freeing those locks. We can also see that both are waiting for different locks, as the synchronizer id is different. Even more informative is the deadlock summary at the top that tells us what thread holds each lock. We can see that the two deadlocked threads are holding the lock the other thread is trying to obtain.
  2. This should already give us plenty of information on how the deadlock occurs. If it is still unclear how our code reached a deadlock, we can then try debugging with breakpoints just before we hit the lines provided by the thread dump. When we have a theory of what is wrong, we can try reproducing the scenario by using the dependency between breakpoints.
  3. We can now create a Suspend Thread breakpoint on one of those threads and verify, using another thread dump snapshot, the other thread reached its deadlock position.

    Breakpoint in Publisher thread
    Now we can inspect the state just before one of the threads gets deadlocked.
  4. Another option is to put suspend thread breakpoints on both threads and switch between them. Inspecting the states of the Publisher and Subscriber in this example will show us the confusion that caused the deadlock.

    Breakpoint in the publisher
    Breakpoint in the subscriber
  5. When we inspect our lock instances we can see that the concurrent code was actually correct, but we confused the read lock and write lock when we passed them to the two objects. Look at the lock instances ids in the image above.
  6. Indeed, when we then inspect the constructions (where we injected those locks) we can see the bug:
    ThreadGroup threadGroup = new ThreadGroup("Demo"); new Thread(threadGroup, new Subscriber(messageQueue, readLock, writeLock), "SubscriberThread").start(); //passing locks in the wrong order will cause deadlock between publisher and subscriber new Thread(threadGroup, new Publisher(messageQueue, writeLock, readLock), "PublisherThread").start();

Detecting a livelock

A livelock is a scenario where threads are not blocked, but still unable to make a progress. From the outside, a livelock should behave just like a deadlock, but because the threads are not blocked, the snapshot (thread dump) will not alert us on any deadlock.

  1. One strategy to try before starting debugging, is to repeat the Thread Dump several times then compare the stack traces for various threads. This should give us a clear view of the problematic areas in the code, in most cases a loop that the code cannot escape from.

    Livelock snapshot
  2. If we're still unsure, we can use a Pass Count with a large number that we assume will only be reached in a livelock situation.

    Using pass count
    We can also Step through the code and verify exactly what area of code is being executed repeatedly but not progressing.
  3. At this point we can use a Conditional Expression to capture the point in the execution when we enter the livelock situation (i.e when the application reaches a state that will prevent it from escaping the executed code block).

    In this example, we assume that the STOP message failed to break us from the loop, so either it was never sent or that it was not handled so we'll introduce a breakpoint with a condition that looks for a STOP message.

    Using a conditional to capture livelock situation
  4. Our breakpoint is hit, meaning the STOP message was sent but not handled.

    Breakpoint hit
  5. We step in, inspect the state with Evaluate Expression and find the bug. The 'valid' method does not think STOP is a valid message and puts us in this livelock scenario.

    Using Evaluate Expression to test the theory
Last modified: 5 April 2017