Photo Credit: Victorgrigas / CC-BY-SA-3.0

In the early days of an application, deploying new features and functionality are at the top of the list.  As the number of users grows and the traffic increases, other issues make their way to the forefront.  Whether because of complaints from users, monitoring alerts about overloaded machines, or even a whopping AWS bill, it becomes important to be able to answer the question, “What is my application truly spending our servers’ CPU time and resources doing?”

The first advice usually offered is to profile your application.  In the context of JVM-based application, that would mean using a profiler like YourKit or VisualVM.  That sounds like a good plan, but can you really do that in production?  On already burdened systems, the additional load of profiling’s heavy overhead is often too much to bear.  Even a lighter weight method like aspect profiling can impose too much load for production.  Even were the hardware capable, the scale can be too great for the profiling tools, which may be unable to handle the enormous amount of data collected.

A profiler could be run in the lab at a smaller scale, but the lab and production are different worlds.  Simulating the behavior of client applications in a way that is truly representative of production can be difficult or impossible. This is particularly true where client requests involve complex state for example in the situation where the client is built-in to mobile device’s operating systems. Different devices can behave in surprisingly different ways and it can be difficult to obtain an example of every kind of device and OS revision.  Added to this, the data processed in production may contain confidential information. In order to protect your users’ privacy, it cannot be used for testing purposes, with highly sensitive data, it may even be illegal to do so.  An additional complication is that most CPU and memory profiling tools are targeted at resources spent by function or by object class, when, to answer production questions, what is often needed is to know what resources were spent processing each request, since knowing what caused the resource usage is as important as knowing what resources were used.

Is the profile data obtained in the lab, under a limited load, using simulated clients, scripted requests, and synthetic data going to be able to answer questions about production performance? Unless there is a smoking gun, probably not.

In brief, you need information about the performance and resource usage of your application in the production environment, but you cannot run the profiling tools on production systems.  The profiler can be run in the lab, but the lab setting does not reveal what is truly happening in production.

In this article we will discuss an alternative technique of self-measurement which we have used with success analyzing tricky performance problems for our consulting clients. It can offer insight into performance and resource consumption in large-scale services with very low overhead and so is applicable in a production context.

Where to Begin

One of the first and simplest questions that always needs answered is, “How long did it take?”  The simplest way to answer it is to time a block of code (represented by the call to the function doSomething), then output the measured elapsed time to a log file:

long start = System.currentTimeMillis();
debug("doSomething() took %d ms", System.currentTimeMillis() - start);

DEBUG: doSomething() took 417ms

Admittedly, this is not a very exciting example, but it would be remiss not to mention that even this simple output answers an important question with negligible impact on performance.

But it cannot tell us many other things we would like to know:  How much CPU was used in that time?  How much memory?  How much of the elapsed time was it really doing work? (If this request is waiting on a condition to be fulfilled, it may be normal for it to take a long time when considered by the wall clock, and the more important metric is how much of that time was true work.)

Fortunately, the JVM offers lightweight ways to obtain CPU and memory resource information that will help us answer those questions.  The standardized interface offers counters for CPU time by thread, measured in nanoseconds.  The Oracle/Sun family of JVMs extend the functionality via the similarly named management interface with a counter for the bytes of memory allocated by the current thread.  Measuring CPU time is pretty self-explanatory, but bytes allocated, especially in a memory-managed language like Java, deserves additional comment.  The count provided is the (approximate) total of bytes allocated onto the heap by the indicated thread.  The count includes both bytes that have been released and those still in use.

The overall byte allocation count is an important metric, even when peak memory usage remains moderate, since allocated memory must still be garbage collected.  It is rather like incurring a debt which must be repaid later.  More allocations will mean more CPU time spent in the garbage collector; it may also cause more frequent stop-the-world events.  Memory allocation can also be used to obtain a rough estimate of the number of objects created.  The better known the type of object being created, the better the estimate that can be derived; but even when the kind is not known, a large byte allocation count can be as a strong indication that something in the application is behaving unexpectedly: objects that could be reused are being repeatedly created and destroyed, arrays copied, B-trees rebalanced, etc., all of which take a toll on performance both upfront and again during garbage collection.

By taking advantage of these JVM counters, we can instrument our application to obtain a much more detailed picture of its performance and use of system resources.

CPU time and memory allocation are typically tracked from the creation of the thread, so to get useful information about the resources used by a section of code, one cannot use the returned values directly.  Instead, one needs to treat them rather like how elapsed time was treated in the first example and stash the current values before executing whatever section of code needs to be measured, then compare the values before and after that section of code has finished.  For example:

ThreadMXBean bean = (ThreadMXBean) ManagementFactory.getThreadMXBean();
//enable the counters, in case they are not already enabled

long start = System.currentTimeMillis();
long startCpu = bean.getCurrentThreadCpuTime();
long startMem = bean.getThreadAllocatedBytes(Thread.currentThread().getId());


log.debug("doSomething() took %d ms; CPU %.3f ms; Mem: %.2f kB",
    System.currentTimeMillis() - start,
    (bean.getCurrentThreadCpuTime() - startCpu) / 1000000.0,
    bean.getThreadAllocatedBytes(threadId) / 1024.0);

DEBUG: doSomething() took 412ms; CPU: 386.187ms; Mem: 2728.33 kB

With negligible additional overhead we have gotten a good deal more information: we now know how much CPU time our thread used and how many bytes of memory it allocated in addition to the elapsed time.  It makes for pretty messy code though.  The lines of instrumentation heavily outnumber the lines of application code.  Instrumenting an entire application this way would lead to code that was nearly unreadable and difficult to maintain.

If we move the boilerplate for stashing the initial counter values, calculating the differences, and formatting the message into a separate class, our application code can stay tidy (full ResourceUsageCounter code):

//create a counter and start it counting
ResourceUsageCounter counter = new ResourceUsageCounter().start();


//stop the counter and log its results
debug("doSomething() took (%s)", counter.halt());

DEBUG: doSomething() took (Lifetime: 369 ms; Active: 369 ms; CPU: 346.172 ms; MemAlloc: 2783.20 kB)

In this output we also introduce the distinction alluded to earlier between ‘Active’ time–how long it was actually running–and the ‘Lifetime’–the age of the counter from when it was created until it was halted (or now, if it hasn’t been halted)–since the counter no longer has to be started immediately, it can be paused and restarted many times, etc.

The ResourceUsageCounter class also allows for analysis that is finer-grained without needing to create additional counter objects for each section:

//create a counter and start it
ResourceUsageCounter counter = new ResourceUsageCounter().start();

//'split' the counter (save the last result but keep timing)
debug("doFirstPart() took (%s)", counter.split().last());

//'split' the counter again
debug("doSecondPart() took (%s)", counter.split().last());

//halt the counter and log the results
debug("Doing everything took (%s)", counter.halt());

DEBUG: doFirstPart() took (Active: 172 ms; CPU: 146.111 ms; MemAlloc: 98.21 kB)
DEBUG: doSecondPart() took (Active: 101 ms; CPU: 98.480 ms; MemAlloc: 256.88 kB)
DEBUG: Doing everything took (Lifetime: 273 ms; Active: 273 ms; CPU: 244.591 ms; MemAlloc: 355.09 kB)

Making use of additional language features in Java 8, this can be simplified further:

try (ResourceUsageCounter x = new ResourceUsageCounter().start()) {
    x.onPause((t) -> debug("Part %d took (%s)", t.numPauses(), t.last()));
    x.onHalt((t) -> debug("Total took (%s)", t));

    for (int i = 0; i < numParts; i++) {

DEBUG: Part 1 took (Active: 134 ms; CPU: 116.910 ms; MemAlloc: 108.32 kB)
DEBUG: Part 2 took (Active: 141 ms; CPU: 103.112 ms; MemAlloc: 196.68 kB)
. . .
DEBUG: Part 347 took (Active: 160 ms; CPU: 143.121 ms; MemAlloc: 179.22 kB)
DEBUG: Total took (Lifetime: 31490 ms; Active: 38139 ms; CPU: 18784.890 ms; MemAlloc: 51770.98 kB)

Here we use the counter as an AutoCloseable resource, and attach some simple lambdas to handle the logging.  One would want to be a bit cautious with this approach, as it could generate a landslide of log output, but it highlights the flexibility of the new counter objects, which can offer obtain detailed resource information without adding a lot of code,. hardly any more than in our simple System.currentTimeMillis() – start timer example.

More Complicated Scenarios

As mentioned in the introduction, when dealing with a real-world application in production, one of the most important things to know is what resources are spent servicing each request, from start to finish.  That is key information for capacity planning, establishing performance baselines, and singling out requests which are unusually resource intensive.

Up to now in our simple examples, everything has been executed synchronously.  In a real environment, our operation may be placed into a queue before being processed; it might be executed in part by thread A, put back into the queue, and later finished by thread B (or C, or D, or…).  In some cases, more than one thread may be processing parts of the operation at the same time.  While that work is underway, the operation could be canceled by the client or by an error in another part of the application.  Measuring application performance in such an environment poses challenges.

There a couple of things that need to be addressed to help with this more complicated–but more realistic–scenario.  The first is to establish how we can accumulate this resource usage over the lifetime of the operation, and the second is how to deal with concurrency, since so many of the underlying JVM counter values are thread-specific, but the operation needs to support concurrent access by multiple threads.

In this final set of examples (complete code here) we use Servlet API 3 for asynchronous request processing, and Java’s standard ExecutorService to provide a pool of workers for processing the requests.  The ExampleServlet class finds primes two ways: by applying a sieve and by testing random integers for primality.

$ wget -O - -q http://localhost:8080/example

2, 3, 5, 7, 1048601, 11, 1101021611, 13, 1572887, 1502108039, 102237719 . . .

The important takeaway for this example is not the kind of work that it does, but rather that the operation could be processed by multiple workers from the thread pool or closed by a timeout or error event from the servlet container, at any time.  Whatever happens, while the operation is underway the CPU and memory usage spent processing it should be tracked, and when the operation is closed the final results should be reported.

The operation is represented by the MyOperation class:

private class MyOperation implements AsyncListener {
    //our new concurrency-safe resource counter class
    final ConcurrentResourceUsageCounter perf = new ConcurrentResourceUsageCounter();
    //the Servlet API 3 async context
    final AsyncContext context;
    . . .
    void complete() {
        if (!closed.compareAndSet(false, true)) {
            return; //we can only be closed once
        . . .
    void onComplete(AsyncEvent event) {
    void onError(AsyncEvent event) {
        . . . handle error and close the context . . .
    void logResult(String tag) {
        log(String.format("%s: OP-%s :: [%s]", tag, id, perf));
    . . .

ConcurrentResourceUsageCounter is a wrapper around the earlier ResourceUsageCounter, accumulating individual ResourceUsageCounter results together, with a read/write lock for synchronizing access to the combined results.  Making ResourceUsageCounter concurrency-safe itself might seem like the obvious choice, but in this instance, it is more flexible and cleaner to move that up a layer rather than encumber ResourceUsageCounter with everything required for concurrent access.  This also simplifies the problem to one of mere synchronization, while making ResourceUsageCounter safe for concurrency, though possible, would be more complicated since the underlying JVM metrics are thread-specific.  Here is part of the class, showing the key workings:

public class ConcurrentResourceUsageCounter {
    private final ReadWriteLock lock = new ReentrantReadWriteLock();
    private final Lock r = lock.readLock();
    private final Lock w = lock.writeLock();

    //this counter will not be started and stopped; it is the
    //destination for accumulating results from other counters
    private final ResourceUsageCounter main = new ResourceUsageCounter();

    public ResourceUsageCounter start() {
        return new ResourceUsageCounter().start();

    public void add(ResourceUsageCounter c) {
        try {
        finally {
    . . .

When a request arrives, the resulting operation is submitted to a minimum of two worker threads to process the operation until it reports that enough primes have been found or that the operation has been closed (note that most the error handling has been removed for readability):

executor.submit(() -> {
    try {
        //start the resource counter
        ResourceUsageCounter t = op.perf.start();

        while (!op.isClosed() && !op.isFull()) {
            //could be anything, but for us, find some primes

            //update the counter in the loop. we don't want
            //to wait until it is all over, else the main counter
            //may not have received anything from us before op.complete()
            //is called by another thread.

        //halt the timer
    } finally {
        //finish out the request

COMPLETE: OP-0d656dc3-3206-4052-aca2-9b468f690304 :: [Lifetime: 684 ms; Active: 369 ms; CPU: 151.669 ms; MemAlloc: 6166.78 kB]

Each worker obtains a ResourceUsageCounter instance and uses it to update the main counter after each loop via the addLast() method, which adds the values of the latest run of the counter rather than the total values.  This allows updating the ConcurrentResourceUsageCounter progressively using a single ResourceUsageCounter object without having to wait till everything is over and the counter is halted.

The full code in GitHub goes a a little further and breaks down the resource usage by which prime finding method was used, either by trial (tilted toward CPU usage) and by the sieve of Eratosthenes (tilted towards memory usage):

Total: [Lifetime: 7728 ms; Active: 13923 ms; CPU: 10333.819 ms; MemAlloc: 47030.79 kB]
    Sieve: [Hits: 220117; Misses: 2829629 :: (Lifetime: 7729 ms; Active: 237 ms; CPU: 148.205 ms; MemAlloc: 18243.85 kB)]
    Trial: [Hits: 80074; Misses: 3174220 :: (Lifetime: 7729 ms; Active: 9334 ms; CPU: 6326.153 ms; MemAlloc: 7509.11 kB)]

Next Steps

Once the basic tools and techniques are in place, all sorts of application-specific improvements might be added.  One might consider extending the ResourceUsageCounter to log automatically on close, making resource measurement for a code block as simple as surrounding it with a  try{}:

try (MyBlockCounter x = new MyBlockCounter("mycodeblock")) {
    . . . code in here . . .

mycodeblock :: [Lifetime: 987 ms; Active: 21 ms; CPU: 11.223 ms; MemAlloc: 2122.22 kB]

A related improvement which automatically adds the results to the parent ConcurrentResourceUsageCounter is used in the ExampleServlet for maintaining the prime statistics.

try (ResourceUsageCounter x = {
    . . . run the sieve . . .

All our examples have logged the counter data.  One might consider sending certain statistics to a graphing platform like Grafana.  In some cases, it might be worthwhile to collect performance data into analytic database like Apache Impala or Amazon Redshift.  If the application is distributed, counter objects could be serialized and passed across application boundaries as part of an end-to-end performance tracing mechanism.  For additional insight, application-level measurements might be combined with low-level JVM information gathered from USDT probes.

Once you start combining performance data from multiple sources, there are few limits on what might be done, including creating an entire performance analysis system like Facebook’s Canopy.  But no one achieves a system like that in one step.  However far you decide to take the instrumentation and analysis of your application, hopefully these examples will help you get off to a good beginning.


The Java JVM makes valuable information for understanding performance and resource usage available, which if used effectively, can give insight into your application without requiring special tools, agent libraries, or manipulation of the byte-code, while performing well in environments where profiling and heap analysis are completely impractical.

Adding even a little of this kind of instrumentation to an application at a high-level often exposes issues which would have gone unnoticed on heavily-loaded production systems before.  Extending that instrumentation lower and lower, one can begin to answer questions like, “Why did that seemingly identical request use ten times the CPU of the previous ones?” and “How in the world did that simple operation end up allocating 12GB of memory?” questions that it is not only impossible to answer without a window into the inner-workings of your application, but impossible to ask.


Example Code:

Further Reading: