Allocation tests: unit testing for memory allocation in the JVM

Posted on Jul 27, 2023

Allocating memory is slow.

Peter Lawrey explains nicely in his article “Java is Very Fast, If You Don’t Create Many Objects” the effect allocating objects has on throughput - even if you ignore the GC cost, just the allocation of the objects can reduce your throughput by 25% in the worst case. Add the latency jitter caused by GC on top of that, and the less predictable heap usage on top of that, and you can see that for applications that are not I/O bound¹, thinking about allocations more explicitly can be quite important.

I work(ed²) at a company with a solution in the fixed-income trading³ space - our system is quite large, built top-to-bottom in Java in a single monorepo using various open-source foundations - and we took a decision early on to be “fast”.

“Fast” is a bit of a loose term - how fast is fast enough? How much effort is it worth putting in to be “fast”? This is always different from project to project. It also has many meanings - throughput, latency, jitter, resource overhead. However, there are some measures one can take when building a system that makes that system architected to be “fast by default”. Martin Thompson calls this “mechanical sympathy”⁴.

# Garbage-free in steady-state

One of the ways we decided to try to be fast by default is to be “garbage-free in steady-state”. Let’s break down what that means:

           garbage          -free        in steady-state
              ^               ^                 ^
       objects that are       |        once the application
       allocated and not      |     is warmed-up and processing
       re-used, therefore     |        messages / requests
       being collected by     |
           the GC             |
                        the absence of

Notice that “once the application is warmed-up” is quite vague - this can be quite hard to define for your project. We will discuss this more later.

There’s a great description about this that we have internally:

The objective is to minimise the rate of allocation, i.e. less than x MB allocated per hour to limit the need for garbage collection once the system is up and running. Keeping in mind the objective above, we can be more relaxed when handling infrequent events (e.g. a client library successfully logs in), but avoid any allocation when handling potentially frequent events (e.g. handling a price contribution message)

# How do you write code that is garbage free?

This is not a blog post on how to write garbage-free code in Java, but to summarise, there are two primary things that need to be done to avoid creating garbage in a Java application:

Avoid primitive boxing

Integer, Long, Boolean, etc. are all “boxed” primitives, meaning they are regular ints, longs and booleans wrapped in objects - primarily so they can be used generically⁵, or in collections such as ArrayList. Agrona is a library that has various utilities and collections designed for working directly with primitives rather than relying on autoboxing (amongst other things).

Re-use objects / object pooling

De-serialising a payload in to an object to process a message or a request? Have one instance of the DTO object and re-use it once the processing is complete, or copy data out of the object if you need to retain the data. Storing objects in a list that you are constantly adding / removing from? Store those objects in a pool⁶ and re-use them. This does introduce risk - what if you forget to reset an object back to its initial state before re-using it? - and so good discipline must be exercised when programming in this way (although tooling such as good tests and code generation can alleviate most of the pain).

# The problem: how do you know if you are actually garbage-free?

It’s all well and good having the desire to be garbage-free - but like in all things when building software, how can you be sure you’ve done it correctly? Well, tests, of course!

How do you test that you’re “garbage-free”?

Instantiate your system under test
Warm the system up by repeating a test scenario or mix of test scenarios a number of times
Repeat the scenarios again
Assert no memory allocations happened during step 3

Easy, right? Let’s break it down.

# Instrumenting allocations in the JVM

Before you can think about writing tests for how much memory allocation is occuring in your system, you first need to understand how you can even see and hook in to these memory allocations. Google have an amazing library called allocation-instrumenter that is loaded as a Java Agent⁷ and uses ASM to manipulate the bytecode of your application. You add a “sampler” using the library when the Java Agent is loaded, and your sampler is called every time an object is allocated.

# Instantiating your system under test

This might be a single class, a single module, or in this project’s case a large portion of an event-sourced system with a few stubs at the I/O boundary - but you should be able to instantiate the system you want to test in-process. Threading is a large issue here - doing this sort of testing is a lot easier with a single-threaded system under test, or one where the threading is managed outside of the system itself - if your system under test is creating a lot of threads internally, it can be hard to isolate memory allocations.

The reason for this becomes clear when looking at what a sampler is for the allocation instrumenter:

AllocationRecorder.addSampler(new Sampler()
{
  public void sampleAllocation(int count, String desc, Object newObj, long size)
  {
    LOGGER.info("I just allocated the object " + newObj + " of type " + desc + " whose size is " + size);

    if (count != -1)
    {
      LOGGER.info("It's an array of size " + count);
    }
  }
});

How do you know whether this allocation happened in your production code (where less allocation is better), or your stub / test harness code, (where generally you would avoid trying to reduce allocations to ensure that your test code is correct)? One way is inspecting the stack trace, but that’s slow and finnicky - however, due to the way the instrumenter works by hooking in to allocations, your sampler will always be called by the same thread the allocation is made on. Therefore, you can keep a list of threads you are monitoring, and accumulate allocations based on that - a nice and neat solution.

# Warming up

When attempting to reduce or eliminate steady-state allocations in a Java app, there are things that need “warming up” that will cause allocations to still happen early on in your applications lifecycle - first of all, your application needs to boot, load classes, etc., and then allocate arrays and buffers and HashMap nodes and all sorts of things depending on your code. In the same way you need to warm up the JIT before attempting to perform microbenchmarks (one of the things JMH helps you with), you need to warm up your application to be ready to be allocation tested. In this project we still allow ourselves expandable array buffers, expandable hash maps, etc. - so if your state grows, you will still allocate more. You could allocate all of this up front and have absolutely no allocation post-warmup if you were so inclined⁸, but that’s difficult and takes a lot of effort for minimal incremental gains (for most purposes anyway).

Therefore, for the scenarios you would like to test allocations for, run them a few hundred times against your application to the point that they recycle all their re-used objects, and then only enable your allocation test harness once that warmup is done. Then, run the scenarios again, and you can start to make assertions on what allocations do or do not happen.

# An aside: the test harness

If you’ve seen my post on testing with internal DSLs, then you’ll understand what I think of when I hear “test harness”. I want to be able to write a specification (a test) for a unit or system in a way that isn’t directly interfacing with said system, and then have some sort of test harness directly interface with the system in a way that allows me to add pre-test setup, various invariants, etc. Even if you don’t go so far as to write a DSL in the spirit of that post, or you want to but haven’t got there yet, you’ll still want some sort of rudimentary bespoke harness for allocation testing - adding and removing samplers, isolating production allocations however it is you’d like to achieve that, running assertions on what allocations have happened - otherwise your tests will very quickly become difficult to manage.

# Running the allocation test, and asserting

Now your system is warmed up, add a sampler and run the scenarios again. There are lots of tricks you can use to make stamping out any allocations easier - throwing an exception in your sampler to cause a stack trace on allocations, accumulating what class names are being allocated and asserting on them at the end, ignoring certain kinds of allocations you find acceptable - do what works best for your system.

# A few helpful tips and tricks

Samplers are re-entrant

What does “re-entrant” mean in this context? It essentially means that if you allocate inside the sampler callback, your sampler callback is called again for that allocation. This can cause infinite loops, so you have to be careful. One solution to this is to essentially “pause” your allocation sampler at the start of the callback handler and “resume” it at the end:

AtomicBoolean paused = new AtomicBoolean(false);

AllocationRecorder.addSampler(new Sampler()
{
  public void sampleAllocation(int count, String desc, Object newObj, long size)
  {
    if (paused.get())
    {
      return; // early return!
    }
    
    // obviously be aware of your threading model here, as discussed earlier
    // generally its easiest if each sampler is only used on a single thread
    // if (Thread.currentThread() != targetThread) { return; }

    paused.set(true); // pause the sampler

    new Object(); // do some allocations

    paused.set(false); // right at the end, resume it again
  }
});

Good practice may be to have a wrapper class around these samplers that you use everywhere that mean you can enforce things like pausing for re-entrancy always happens.

Have a sanity check to make sure the agent is actually loaded

If you don’t load the allocation instrumenter agent, everything will work just fine, and adding a sampler will not fail, but it will never actually be called. One neat trick is, at the start of your test / when you first trigger your allocation test harness, do a little check to make sure you can actually see allocations happening:

AtomicBoolean works = new AtomicBoolean(false);

Sampler sampler = (count, desc, newObj, size) -> works.set(true);

AllocationRecorder.addSampler(sampler);

new Object();

AllocationRecorder.removeSampler();

if (!works.get())
{
  throw new RuntimeException();
}

Falsification is one of the most important things of writing tests (and one of the core reasons for TDD), and this makes sure you can falsify every test properly.

Don’t run allocation tests in your regular build

You probably want to do quite a lot in your allocation tests - run your scenario 1,000 times, turn on the allocation instrumenter, and then run the scenario another 1,000 times. This can be slow - and not just because of the size of the scenario. In my experiments I see that the agent slows down tests by 5-10% just by being loaded in the JVM - and depending on what your sampler is doing, that effect could be even stronger. This should be normal - run your allocation tests in a separate job, and on a slower schedule. It’s likely not a big deal if some allocation slips through the cracks in a single commit / PR. If it’s really that important to you, just don’t release a version of your system until it has passed allocation tests.

# Summary

So, there you have it: a fairly niche testing technique laid bare. It’s not for everybody - most services are probably much better off with a thread-per-request, garbage-producing codebase such as the Spring Boot model, as single-threaded asynchronous garbage-free code is extremely tricky and time consuming to write. If you do write code like that though, it’s going to be blazing fast.

There are some resources I’ve seen on unit testing memory allocations - a fun one is compiling your C code with an allocator that simply fails - however, I’ve not seen anything like this before. I’d love to hear thoughts or questions on this, so please feel free to email me (blogthoughts at this domain).

This is a critical point - most applications today are I/O bound, usually network I/O bound but in many cases also disk I/O bound (such as databases). This is fine! Not everybody needs to work on systems that are CPU bound! This is where runtimes like Go and Node.js shine, when your application is mostly waiting on the network and the amount of raw computation is very low. This blog post is discussing a different type of application, one where we need to be very fast, to fully saturate a CPU and network line with useful logic. Lots of applications in finance and the financial services sector are like this. ↩︎
Depending on when this is read, I may or may not work there - who knows what the future holds. As of writing, I’ve been working there for just over a year. ↩︎
Fixed Income. Mostly medium-to-high frequency depending on the venue and whether it’s Dealer-to-Client or Dealer-to-Dealer - it’s not Jane Street, but it does need to be pretty quick. ↩︎
This is also the name of his blog, which is great reading. ↩︎
This is essentially an artifact of some painful backwards compatibility stuff in Java due to the way generics are implemented via type erasure. The second-order footnote is that allegedly it actually wasn’t due to backwards compatibility at all, but that type-erasure was genuinely a better path due to various runtime characteristics, but it seems like there’s not quite consensus on that - but now we’re really getting off track. Ultimately we have to live with Java’s decision. ↩︎
With stormpot, for example. ↩︎
“In general, a Java Agent is just a specially crafted JAR file. It utilizes the Instrumentation API that the JVM provides to alter existing byte-code that is loaded in a JVM.” ↩︎
…and from what I hear, some people do actually do this for latency jitter purposes. I don’t. ↩︎