Tutorial: Java Debugging Deep Dive - Help

Overview

Debugging is one of the most powerful tools in any developer's arsenal. It gives us a unique insight into how a program runs and allows us to gain a much deeper understanding of the piece of code we debug. It allows us to trace running code and inspect the state and the flow of the execution. As part of that, it gives us the illusion of a sequential flow. This is very intuitive and powerful but also may be misleading as most modern applications are multithreaded.

"Debugging" suggests we deal with bugs but this is actually a misnomer. The information we get from debugging is useful even when there is no problem with the code. Finding bugs just happens to be a very common use case for the knowledge we can get from a debug session.

The IntelliJ IDEA debugger offers a rich experience that helps us to easily debug anything from the simplest code to complex multithreaded applications.

Before we start, a word of caution: debugging is a very powerful tool but it does come with a cost. The debug process is part of the runtime and therefore affects it. Every evaluation of an expression happens using the same memory of the debugged application, and can modify and potentially corrupt the state. During this tutorial, bear in mind that debugging is an intrusive approach that may affect the outcome of the debugged application. We will explore a few ways to minimize its impact and sometimes even exploit it. The timing of execution is also very different when you debug code compared to running it. The minimal debug tracking overhead in itself may already be enough to change the timing of events and therefore the application behaviour. Every breakpoint or log is a possible synchronization point, and stepping obviously changes the timings significantly. As we are about to see, this becomes a critical issue in multithreaded environments, when sometimes reproducing a bug depends on a very specific sequence of events.

Last point to remember is that debugging is not a substitute for understanding the code. In fact, the only way to learn from a debug session is to constantly compare the information the debugger shows us with our expectations from the code and how we think it "should" behave. Before starting a debugging session we must have some knowledge of what we're trying to achieve by it. If we're looking for a bug, we need to roughly know what is incorrect, i.e. what is different from the expected behaviour or state. In most cases we will also have some initial assumption as to why things are wrong. This will dictate how our debugging session should be conducted. When we debug, we must always compare that information with our expectations, and pay close attention when the code deviates from these expectations.
This is the point where debugging is so effective.
This is the point where we learn.

In this tutorial we try to dive deeper into debugging techniques and assume you are already familiar with the basic concepts such as:

Line breakpoints to suspend the JVM or thread.
Stepping.
Classes configured to be skipped.
Force to step into "skipped code".
Evaluating Expressions
Using a watch
Defining a type renderer

Debugging code that was compiled without the debug flag

Code that was compiled without the debug flag cannot be debugged. There is no way to step into this code. When the debugger encounters such code during a debugging session, it will step over that part of code.

Line breakpoints are also not possible to define and hit. However, this is where the Method Breakpoint might save us, as we can still define in IntelliJ IDEA a breakpoint to stop before entry or exit from a specific method, even if the method itself was compiled without the debug flag.

When viewing the state, since the actual variables within the method cannot be inspected, we will see a warning message instead.

Debugging without source code

If we don't have the source to specific code, IntelliJ IDEA will still decompile the class and show our steps in the decompiled source. This is very helpful, but note that the generated decompiled class may look different from the original, and if the lines do not match, debugging in decompiled code may be confusing. Always try to obtain the source code of the classes you want to step into.

Detecting unexpected state or flow

This section covers what to do if we know where things have already gone wrong, but don't know why.

Exploring the call frames

A Line Breakpoint should be enough for most cases of detecting the cause behind an unexpected call or call with unexpected parameter values to a method. If we're not sure where it's being called from, we can put the breakpoint inside the method. When the VM is suspended, click on the previous call frames to view the call stack and inspect the state in each scope to see how we got here.

Drop frames

If we stepped too far and want to go back up the stack to then re-execute the code, we can use the Drop Frame feature. It's a useful feature, but also potentially dangerous: we must be aware that re-executing the code will execute the same instructions twice, and if those instructions modify state we might end up in a corrupted state, and certainly in a scenario that would not happen in a normal run under the same conditions. To make the impact of Drop Frame obvious, consider this simple program:

public class DropFrameDemo
{
    private static int state = 0;

    public static void main(final String[] args)
    {
        modifyStateBasedOnParameter(state);
        modifyStateBasedOnStaticField();
    }

    // dropping frame within this method,
    // and executing again will print state = 2
    private static void modifyStateBasedOnStaticField()
    {
        state++;
        System.out.println("state = " + state);
    }

    // dropping frame from within this method,
    // and executing again will print state = 1
    private static void modifyStateBasedOnParameter(final int parameter)
    {
        state = parameter + 1;
        System.out.println("state = " + state);
    }
}

Breaking inside modifyStateBasedOnParameter() will not impact the state because IntelliJ IDEA remembers the parameter values passed in to that frame and will not recalculate those. However, breaking inside modifyStateBasedOnStaticField() will make the state field equal '2'. A value which is impossible under a normal run of main().

Detecting unexpected flow by method

An alternative to having a line breakpoint defined within the problematic method is to define a Method Breakpoint. This type of breakpoint is not attached to a source code line number, but to the entry and exit of a call to a method. It is especially useful in two main cases:

When a method is defined by an interface and we want to breakpoint in all implementations of it.
When we don't have the source code, only a decompiled version, and we still want to inspect the ins and outs of a method call, without any confusing differences in line numbers between the compiled class and the decompiled source code.

Detecting unexpected object state

Sometimes its hard to figure out the exact flow that caused a field to get to some unexpected state. In those special cases we can use a breakpoint that will be hit anytime the program either read from or write to a specific field. see Field Watchpoint.

Detecting unexpected exception thrown

Although not strictly a debugging feature, when we want to investigate why an exception was thrown, we can analyse the exception stack trace and quickly get to the line of code that generated that exception. From there the combination of Line Breakpoint and Stepping is usually enough to figure out what is wrong.

Sometimes however, the exception is wrapped in another exception or caught and swallowed by the catch block. All we see are its side effects (perhaps a log) but not its stack trace. For that we can use an Exception Breakpoint.

Debugging Asynchronous flow

Reactive programming is increasingly popular and with the help of many frameworks and libraries out there, developers are writing a lot more asynchronous code.
The flow in an asynchronous application is a major challenge for debugging tools and the developers who use them. The execution jumps between frames and makes it harder to understand and follow the code.
Stepping forward is the easier bit. We can insert breakpoints at different points in the code and regardless of the executing thread, see the progress from one code snippet to the next.
When doing that, pay attention to whether a breakpoint suspends just a thread or the entire application. The decision is based on the goal of the debugging session. If you want to check the state of all threads and see what thread has progressed and what thread might be unnecessarily waiting, you can freeze the entire system at this point and view call frames and stack traces of all threads.
If you're debugging a specific action, you can suspend just one thread and let the rest of the system keep working.

Async Stacktraces

The real pain with asynchronous debugging starts when we want to look back from a specific point in the code and understand how we got here. Consider the example of asynchronous code below (using JDK's CompletableFutures):

private void asyncExample() throws InterruptedException, ExecutionException
{
    final CompletableFuture<String> future = supplyAsync(() -> "F").thenApplyAsync(this::append_oo);
    System.out.println(future.get());
}

private String append_oo(String str)
{
    return str +"oo";
}

When we stop inside the method append_oowe can see that the stack trace gives us very little information.
Specifically, we can't see the future applied above it nor can we see the asyncExample method that for us started it all.

In an asynchronous context, stack traces will only show us a very limited picture and what we really need is the flow of information between the threads or the combined stack traces of all threads that got us to this point (also known as the causality chain).

IntelliJ IDEA provides a way to view those Async stacktraces.
While in a debugging session, IntelliJ IDEA will capture stacktraces and will show them later when viewing the stacktrace of the next part of the asynchronous flow:

Or even clearer when we filter the external libraries:

ij debugging async causality chain filtered

The stacktraces to capture and the point to insert them needs to be configured in the debugger preferences under Async Stacktraces.
IntelliJ IDEA needs to know the class name and the method to appear at the top of the stack trace which we need to capture. In our example java.util.concurrent.CompletableFuture thenApplyAsync. The debugger also needs to know the position in the other stack trace where we want to insert the captured stack trace. In our example it is java.util.concurrent.CompletableFuture$AsyncApply exec.
In order to match the two stack traces we also needs two keys - one for each context which will point to the same value when one stack trace is indeed the next logical step in our async chain. In our example the key expression we can use is param_0 (parameter 0 - the method reference we pass into thenApplyAsync). It will be matched with fn - the variable inside java.util.concurrent.CompletableFuture$AsyncApply that holds the function in the second call frame.

This should be configured in the debugger settings once and configuration for common async frameworks such as CompletableFuture used here will be provided.

Debugging multithreaded applications

Multithreaded applications are the biggest challenge to debug. These applications are not deterministic and much harder to control. The illusion of a sequential flow we get from stepping in a debug session does not help either and can be misleading.
When investigating issues that can be concurrency bugs, we need to try to step less and fine-tune our breakpoints more. This is because a lot of the concurrency bugs depend on a specific interaction between different threads, and an intrusive debugging session will interfere with that. We'll show how using various Breakpoint properties allows us to limit the interference to a minimum. The other important topic is controlling and switching between different threads in the application. We'll go through some examples of debugging different concurrency bugs to demonstrate how IntelliJ IDEA's features help with this.

Controlling a breakpoint

IntelliJ IDEA debugger properties allow us to control the actions taken when a breakpoint is triggered. Some of them define an action, and others are there to add further conditions on whether to take the action at all. This fine level of control of the breakpoints is critical for concurrency bugs, because most will only be reproduced when threads interact in a very specific way. Any interference of the breakpoints may prevent us from reproducing the bug.

Breakpoint actions

Deciding on the breakpoint action depends on what we want to achieve in the debugging session.

If we can define the condition or point in the code where we can get more insight from viewing the entire system state, we should suspend the entire VM.

Sometimes, suspending only one thread and not the whole VM is preferable. This is especially true when the application is part of a larger system and suspending the VM will cause either an overflow of messages waiting to be served, or request timeouts that end up breaking the entire system. When we have many worker threads, it is better to keep almost all working and focus on just one thread which is interesting to us.

When we deal with a concurrency bug, any suspension of execution may prevent us from reproducing the bug. We can opt to make the breakpoint not suspend anything, just log either a message or a value of a particular expression to the console, then inspect the log. This works well when we have a strong theory about what exactly are we looking for.

Restrict breakpoint with conditions

Apart from being convenient, breakpoint conditions let us minimize the intrusive nature of the debugging session. They allow us to limit the breakpoint actions to only what we see as absolutely essential.

Conditional expressions are the most widely used condition. They allow us to trigger the breakpoint only when our application reaches a specific state. Ideal if we can define an expression that captures the exact point when things start to go wrong.

Pass count is useful in code that is being run many times, either an event handler or a loop and the interesting scenario we're after only manifests itself after a specific number of passes.

We use this when the code is being hit many times but only the first case is interesting. The Remove Once Hit option is especially useful in two scenarios:

When the breakpoint action is to log rather than suspend, which means we don't have the ability to remove or disable the breakpoint after it was hit.
When the code is executed by many threads and we only want to suspend one of them.

A very useful feature. Its obvious use is as a filter to triggering a breakpoint in a scenario where we're interested in a visit to a method, or a specific state in the code only after another state was reached. But as well as that, we can use it to reproduce a particular concurrency issue, as it can help us suspend threads and control which thread reaches which particular line in the code and in which order.

Allows us to filter triggering by class or specific instance.

Debugging long running scenarios

Method Breakpoint and Field Watchpoint slow down code execution considerably. When executing the same code a huge number of times, even conditional breakpoints slow down the processing enough for it to be noticeable. This is a real issue because the scenario of an event handler processing millions of events is fairly common (think of replaying a journal file or processing huge production log files) and evaluating a breakpoint condition inside that event handling code can slow down the system to an unusable state. To overcome that, assuming we can modify running code, we can improve the speed by employing a little trick we shall call "breakpoint in code". This trick is very useful when we debug processing of millions of events where only one causes a problem and we have no idea in advance which is the problematic one, and can save us a lot of waiting for a conditional breakpoint to be triggered. The fastest code is the executed code that was compiled and optimized by the JVM. We want to use that fact and so, instead of writing a condition on a breakpoint, we introduce it to our executed code in a way that we can manipulate later. We then debug without any breakpoint, thus running in the fastest way a debugging session can run, and introduce a breakpoint while the code is running only when we actually know we will hit it.

Breakpoint in code

We introduce a loop to the code with our condition. This means we enter the loop only if the interesting state occurs. We then print something to the console so we will know when the code has entered the loop. Because the loop does not change any state, once we enter the loop we will stay inside it.
while (bugCondition(msg)) { System.out.println("gotcha!"); try { Thread.sleep(1000); } catch (InterruptedException e) { //ignore } }

The sleep here is just to avoid bombarding the console with "gotcha!" messages.
At this point we initiate the debug session, sit and wait for the "gotcha!" to appear. The console will show us we "hit" the "breakpoint".
When the "gotcha!" does appear, we introduce a real line breakpoint inside the loop. The breakpoint will be hit and suspend the VM or thread. Now we can inspect the event and its state. If inspecting is not enough, the last thing to do is to make our code exit the loop. There are two options to do that.
- We can take advantage of the Evaluate Expression intrusive nature to evaluate a code fragment that will actually modify the loop condition to false. This is easily done if we use a field or variable as the condition of the loop, since we can then modify its value.
  
  boolean enterLoop = bugCondition(msg); while (enterLoop) { System.out.println("gotcha!"); try { Thread.sleep(1_000); } catch (InterruptedException e) { //ignore } }
- We can exit the loop by using another feature of a debugging session, HotSwap. This allows us to modify the running code during debugging, compile it and then IntelliJ IDEA will hot swap the debugged classes with the new version. All we need to do is change the loop condition to 'false'. By default, IntelliJ IDEA will detect that a class has a new version and will ask us whether to reload the class with the new version.
  
  Once the new version is loaded, the new loop condition will make the code exit the loop and we can continue debugging from that point. You can either put another breakpoint after the loop to suspend the execution again or just step through the 'false' loop condition.

Looking for a race condition

A race condition is a common issue in multithreaded applications. Multiple threads access and modify the same state, potentially corrupting it or causing undesired flow. A race condition can be a very subtle bug and is usually hard to reproduce. That is because it only occurs when the threads execute the code in a very specific order. Other execution orders will look fine and not cause any issues.

When looking for a race condition among threads, our debug run must start as minimally intrusively as possible to not interfere with the execution order. Once we gain some information or have an assumption on what the execution order is that will cause the bug, we can also use the debug features to reproduce it using a dependency between breakpoints.

Detecting race condition resulting in a corrupted state

Sometimes race conditions only occur once in every ten or a hundred runs of the system. If we suspect there is a race condition in our multithreaded code, we must always make sure that the intrusive nature of the debugging session does not make the issue non-reproducible. For example, here we've created a system of publishers and subscribers, however all our subscribers share a primitive (and not a thread-safe) counter to count the total number of consumed messages.

private class Subscriber implements Runnable
{
    @Override
    public void run()
    {
        while (true)
        {
            String msg = messageQueue.poll();
            if (msg != null)
            {
                if (msg.equals(STOP))
                {
                    break;
                }
                else
                {
                    // race condition right here!
                    counter++;
                }
            }
        }
    }
}

Once we've made sure the issue can be reproduced in the debug mode, we try setting a breakpoint with logging instead of suspending the program execution. Here again, just the fact we are logging from all threads to the same console may "synchronize" the threads in such a way that will "solve" the bug. We need to be sure we can still reproduce it even if now it might take more attempts. Logging the suspected state can narrow down our options and allow us to see that the problem is not with the number of calls to the method but with the counter field.

Avoiding debugger overhead

A race condition such as the one in our previous example will, on most machines, turn out to be a "subtle" race condition. By "subtle" we mean that any modification or change to the runtime environment can "fix" it. Remember that the origin of the bug is the fact that advancing a primitive counter is not an atomic operation.

// race condition right here!
counter++;

To reproduce the bug we need two threads, both reading the same value: the "second" thread must read the value before the "first" one updates it and flushes its CPU cache. Easy enough to create on multi-core machines during a normal run, but almost impossible to reproduce in a debugging session.

Logging, via a breakpoint, at that same point synchronizes the threads, as they all need to write to the same log. This also flushes the CPU caches of all threads, as writing to the log is atomic. In short, it prevents us from reproducing the bug. Suspending either the VM or the thread cannot help us here either, as we can't separate the two instructions (reading the counter value and incrementing it) to break between them. At this point, we need to make some assumptions then prove or refute them. Since we cannot use any breakpoints, our only hope is that we can change the actual executed code and introduce new code that will be compiled and therefore will interfere less.

This is very much a last resort option. A good pattern to help us here is a trace buffer.

Trace buffer

We can introduce an internal buffer and store the interesting values in this buffer. Think of it as a localized, very efficient in memory log. We must make sure that:

We have a buffer per thread, and those buffers are isolated so this does not introduce new concurrency issues.
Because the buffer is per thread, it does not need to be thread safe and must not be thread-safe. This is because we want to avoid introducing any synchronization points.
The values we insert must not be references to a real state that can change, but copies or log messages.
The introduced code is as minimal as possible, to minimize its effect on the running code.
We print or log the contents of the buffers only after the execution has ended, to avoid making the logging action a synchronization mechanism between threads. Another option is to only store the values in the trace buffer, then inspect its contents by putting a breakpoint after the critical part of the code has finished executing.

private class Subscriber implements Runnable
{
    private int index = 0;
    private final int[] traceBuffer = new int[NUMBER_OF_SUBSCRIBERS_AND_PUBLISHERS * 100];

    @Override
    public void run()
    {
        while (true)
        {
            String msg = messageQueue.poll();
            if (msg != null)
            {
                if (msg.equals(STOP))
                {
                    break;
                }
                else
                {
                    traceBuffer[index++] = counter;
                    // race condition right here!
                    counter++;
                }
            }
        }
    }
}

For example, here we introduced a primitive int array large enough for all messages, and in order to prove our suspicion of a bug in the counter, we store just the counter values before advancing it. Yes, it may not be the exact value advanced by the counter, but it will prove our assumption if the same counter value is reported by several threads. After all events are completed, we can inspect the trace buffers and find the duplicates.

Detecting a race condition resulting in unexpected flow control

private class Subscriber implements Runnable
{
    @Override
    public void run()
    {
        String msg;
        while (true)
        {
            msg = messageQueue.poll();
            if (msg != null)
            {
                if (msg.equals(STOP))
                {
                    break;
                }
                // else do something
            }
        }
        // Will NOT work with multiple subscribers, as main thread will
        // wake up when the first subscriber is done.
        // Using a CountDownLatch here is a much better approach.
        synchronized (messageQueue)
        {
            messageQueue.notify();
        }
    }
}

In this example we have another race condition, but the contended shared state is not directly visible and is only deduced by what seems like a wrong flow control: the first subscriber will wake up the main thread, which will exit even if the second thread is still processing a message. We can't inspect or print the waiting thread in a 'trace buffer' in this case, but we can inspect the position of various threads when we suspend the entire application.

First, we can suspend the main thread after it wakes up. We can see that sometimes one of the subscriber threads is still marked running. This allows us to assume the problem's origin is in the fact that the notify() method is called too soon.
We can suspend only one of the subscriber threads. That will cause the other thread to notify the main thread. This will prove to us the problem can happen in any subscriber and is in its logic.
To be sure, we can suspend the entire VM earlier, just before a subscriber notifies the main thread. We can then inspect the status of the two subscriber threads and prove that one of them is still polling, while the other has already finished and is about to notify the main thread it's done.

In this screen capture, we can see from inspecting the threads that both are marked 'RUNNING', which means that while the first is about to notify the main thread it is done (inside the synchronizaed block), the other can still be processing messages.
To prove our assumption beyond any doubt we can also put a breakpoint inside the polling loop of the subscribers. We make that breakpoint depended on the previous breakpoint just before we notify the main thread we're done. Hitting the dependant breakpoint (as shown below) proves our theory.

Detecting a deadlock

A deadlock occurs when two threads will conflict in such a way that both are preventing each other from working at all. Once they occur, deadlocks are easy to spot by looking at the frames of all threads. We can do this by using Thread dump. This feature is also available when running. If we know we're chasing a deadlock, the running mode is even preferable to debugging. This is because we will not interfere at all with the execution this way, and the snapshot will be the output of a Java thread dump of the application. A thread dump can detect deadlocks and warn about them. For example, in the dump below we can see the process found 1 deadlock between the PublisherThread (which is stuck in line 44) and SubscriberThread (in line 78).

In this example, we can see that both threads are stuck waiting for a lock, which means that another thread is not freeing those locks. We can also see that both are waiting for different locks, as the synchronizer id is different. Even more informative is the deadlock summary at the top that tells us what thread holds each lock. We can see that the two deadlocked threads are holding the lock the other thread is trying to obtain.
Always name your threads in multithreaded applications according to their function. Anyone needing to look at a thread dump later, will thank you.
This should already give us plenty of information on how the deadlock occurs. If it is still unclear how our code reached a deadlock, we can then try debugging with breakpoints just before we hit the lines provided by the thread dump. When we have a theory of what is wrong, we can try reproducing the scenario by using the dependency between breakpoints.
We can now create a Suspend Thread breakpoint on one of those threads and verify, using another thread dump snapshot, the other thread reached its deadlock position.

Now we can inspect the state just before one of the threads gets deadlocked.
Another option is to put suspend thread breakpoints on both threads and switch between them. Inspecting the states of the Publisher and Subscriber in this example will show us the confusion that caused the deadlock.
When we inspect our lock instances we can see that the concurrent code was actually correct, but we confused the read lock and write lock when we passed them to the two objects. Look at the lock instances ids in the image above.
Indeed, when we then inspect the constructions (where we injected those locks) we can see the bug:
ThreadGroup threadGroup = new ThreadGroup("Demo"); new Thread(threadGroup, new Subscriber(messageQueue, readLock, writeLock), "SubscriberThread").start(); //passing locks in the wrong order will cause deadlock between publisher and subscriber new Thread(threadGroup, new Publisher(messageQueue, writeLock, readLock), "PublisherThread").start();

Detecting a livelock

A livelock is a scenario where threads are not blocked, but still unable to make a progress. From the outside, a livelock should behave just like a deadlock, but because the threads are not blocked, the snapshot (thread dump) will not alert us on any deadlock.

One strategy to try before starting debugging, is to repeat the Thread Dump several times then compare the stack traces for various threads. This should give us a clear view of the problematic areas in the code, in most cases a loop that the code cannot escape from.
If we're still unsure, we can use a Pass Count with a large number that we assume will only be reached in a livelock situation.

We can also Step through the code and verify exactly what area of code is being executed repeatedly but not progressing.
At this point we can use a Conditional Expression to capture the point in the execution when we enter the livelock situation (i.e when the application reaches a state that will prevent it from escaping the executed code block).

In this example, we assume that the STOP message failed to break us from the loop, so either it was never sent or that it was not handled so we'll introduce a breakpoint with a condition that looks for a STOP message.
Our breakpoint is hit, meaning the STOP message was sent but not handled.
We step in, inspect the state with Evaluate Expression and find the bug. The 'valid' method does not think STOP is a valid message and puts us in this livelock scenario.