Benchmark Database Access with Java 21 Virtual Threads

Virtual Threads become available in Java 21. They are a new type of threads that are not bound to a platform thread (also called OS or kernel threads), but instead are managed by the JVM itself. Virtual threads are very similar with Kotlin coroutines, or Go goroutines, and they were created to solve the same problem, but there are some Java specifics.

Under the hoods, the work is still done by platform threads which are managed in a FIFO work-stealing ForkJoinPool. Virtual threads implementation is another layer of abstraction on top of the platform threads. They aim to improve performance with more efficient use of resources and without changing how the code is written (and especially how easy is to read and maintain it) like in the case of Reactive or Futures.

I was curious to see how much improvement to database access we can get by using Java 21 virtual threads, if any.

Interesting Findings

I’ve read this interesting benchmark article on MariaDB blog which shows an almost unbelievable improvement in throughput. For single row select, for example, ×5 improvement and even ×9 for a SELECT 1 query … well that’s huge.

However, there are some aspects that can greatly influence the results. For example, the use of Executors.newCachedThreadPool for testing the platform threads. A cached thread pool spawns new threads as needed. This means that if the execution is fast enough, the thread pool will reuse previous threads, which is great. However, if the execution is not that fast, the thread pool will create a lot of new threads all fighting for a connection. Performance problems and out-of-memory crashes could also occur if too many threads are created, because each platform thread allocates a substantial amount of memory.

The network latency between the client and the database server can also be an impacting factor.

Another aspect is that the benchmark executes just 100 queries which is… not much.

I’ve decided to run a benchmark to explore a bit more in depth and see what I get. Because we are using PostgreSQL, I was also interested in how this would work with the PostgreSQL JDBC driver.

Benchmark Setup

The benchmark uses JMH to run iterations, and each iteration executes a high number of queries in parallel using either platform or virtual threads.

I have gathered a minimal set of data, which reflects a somehow pertinent use case. It contains a small set of bird info with their scientific classification.

The benchmark initializes a database on startup in a docker container using testcointainers on the same machine. This has some disadvantages, the client workload itself can influence the results, but is much easier to automate. We can verify the reliability of the results from the jmh output (for example, stdev must be low enough when running multiple forks).

The benchmark includes tests for both MariaDB and PostgreSQL.

A Simple Query

The first test is a simple find-by-code query, this is the implementation:

public Optional<Organism> findByCode(final String code) {
    try (final var connection = dataSource.getConnection()) {
        try (final var statement = connection
                .prepareStatement(
                        """
                        SELECT *
                        FROM organism
                        WHERE code = ?
                        """
                )
        ) {
            statement.setString(1, code);
            final var resultSet = statement.executeQuery();
            if (resultSet.next()) {
                final var organism = Organism.fromResultSet(resultSet);
                return Optional.of(organism);
            } else {
                return Optional.empty();
            }
        }
    } catch (SQLException e) {
        throw new RuntimeException("Could not execute query for findByCode", e);
    }
}

It executes the select query matching by code, which has a unique index in the database, and then maps the result to a Java object.

Then, to execute the query in parallel:

// [1]
public List<Optional<Organism>> parallelFindByCode(final List<String> codes) {
    final var callables = codes.stream()
            .map(code ->
                    (Callable<Optional<Organism>>) () -> database.findByCode(code)
            )
            .toList();
    return invokeAllAndGet(callables);
}

// [2]
private <T> List<T> invokeAllAndGet(final List<Callable<T>> callables) {
    try {
        final var futures = threadExecutor.invokeAll(callables); // [3]
        return futures.stream()
                .map(future -> {
                    try {
                        return future.get();
                    } catch (InterruptedException | ExecutionException e) {
                        throw new RuntimeException(e);
                    }
                })
                .toList();
    } catch (InterruptedException e) {
        throw new RuntimeException(e);
    }
}

I implemented these in a class named ManyQueriesExecutor which executes multiple queries in parallel using different thread pools. For example, the method parallelFindByCode [1] receives a list of codes and executes in parallel a query for each code, gathering and returning the results. The threadExecutor [3] can be either a CachedThreadPool or a VirtualThreadPerTaskExecutor. By calling invokeAll [3] all the tasks are executed in parallel and the results are gathered in a list of completed futures.

As a side note, even in micro-benchmarks, I do not like to use data that is too abstract, or implementation that is too far away for a pertinent use case. In the above code, the parallel implementation is actually returning the entire mapped list of objects from the database. This way, is less likely to mess something up in the benchmark code. It can also be unit tested and determine faster if there is any issue. If it is crashing when running with JMH, is much harder to debug as that is happening in a separate process. Also, in a real world scenario, we would need a mapped object anyway. Maybe the approach is not that pure, but I think the results are more relevant.

Each JMH iteration executes 5.000 queries (called operations in jmh).

First, let’s test MariaDB with two different thread executors:

With platform threads (CachedThreadPool)
With virtual threads (VirtualThreadPerTaskExecutor)

The initialization code looks like this:

public ManyQueriesExecutor mariaCached;
public ManyQueriesExecutor mariaVirtual;

@Setup(Level.Trial)
public void setupTrial() {
    Containers.MARIADB_CONTAINER.start();
    mariaDataSource = Containers.getDataSourceOfContainer(
            Containers.MARIADB_CONTAINER
    );
    mariadb = new Database(mariaDataSource);
}

@Setup(Level.Iteration)
public void setupIteration() {
    mariaCached = new ManyQueriesExecutor(
            mariadb, 
            Executors.newCachedThreadPool()
    );
    mariaVirtual = new ManyQueriesExecutor(
            mariadb, 
            Executors.newVirtualThreadPerTaskExecutor()
    );
}

Then the actual benchmark code is very simple:

@@Benchmark
@OperationsPerInvocation(OPERATIONS_PER_INVOCATION)
public void findByCode_with_Mariadb_and_CachedThreadPool(
        final Blackhole blackhole,
        final BenchmarkState state
) {
    blackhole.consume(state.mariaCached.parallelFindByCode(state.codes));
}

@Benchmark
@OperationsPerInvocation(OPERATIONS_PER_INVOCATION)
public void findByCode_with_Mariadb_and_VirtualThreads(
        final Blackhole blackhole,
        final BenchmarkState state
) {
    blackhole.consume(state.mariaVirtual.parallelFindByCode(state.codes));
}

The Blackhole is used to ensure the result is not optimized away by the compiler. The parallelFindByCode method is returning the mapped objects for all our test codes. The BenchmarkState is useful to set up the test and ensure the initialization code, which in our case also includes launching the database, creating the thread pools, the list of codes etc., is not included in the benchmark. The annotation @OperationsPerInvocation is used to let JMH know how many operations (queries in our case) are executed in a single invocation and show correct results.

The dataset is very small with just 277 entries and the code column also has an index, so the query itself executes very fast.

For MariaDB on my MacBook (8-Core Intel Core i9 2,3 GHz) the JMH results are:

Benchmark	Score	Error	Units	Improvement
Find by Code
CachedThreadPool	1867.489	± 78.924	ops/s	×1
VirtualThreadPerTaskExecutor	4073.376	± 180.103	ops/s	×2.18

Find by Code - MariaDB

That looks very nice, with a +118% improvement, still not near the above-mentioned results. For PostgreSQL the results are similar with an even better performance out of the box compared with MariaDB:

Benchmark	Score	Error	Units	Improvement
Find by Code
CachedThreadPool	1997.990	± 41.610	ops/s	×1
VirtualThreadPerTaskExecutor	5094.044	± 290.077	ops/s	×2.54

Find by Code - MariaDB & PostgreSQL

The better performance of PostgreSQL may be because the database itself or the JDBC driver is a little faster.

Also tried with a superfluous query SELECT 1 like in the above-mentioned benchmark, and the results are similar.

Let’s also try with a more complex (read slower) query.

Slower Query

Now, making a query slower on such a small dataset is not that easy, so I had to get creative. I did the opposite of what one would normally do: use of distinct, groupings with aggregated conditions, suffix search, etc. 🙈. Also used updates and time-based conditions to avoid caching. The performance drop was fabulous 😁.

The “Update and Grouping Complex Query” first updates a record and then performs a somehow complex query to get records updated recently grouped by a column.

I will only show the PostgreSQL results from now on; the results for MariaDB are similar, just a bit slower.

So here are the results with a more complex/slower query:

Benchmark	Score	Error	Units	Improvement
Find by Code
CachedThreadPool	1997.990	± 41.610	ops/s	×1
VirtualThreadPerTaskExecutor	5094.044	± 290.077	ops/s	×2.54
Update and Grouping Complex Query
CachedThreadPool	724.340	± 15.948	ops/s	×1
VirtualThreadPerTaskExecutor	1298.718	± 50.907	ops/s	×1.79

Update and Grouping Complex Query - PostgreSQL

As expected, the performance of the execution for the more complex query is much worse compared with the simple one. There is a relative improvement using virtual threads compared to the cached thread pools, but not as big as for the simpler query. Frankly, this was a bit surprising, but it can be explained by the fact that we also have a database connection pool that can be a limiting factor.

Skipping the Middle Man

As mentioned above, an unbounded cached thread pool can result in out-of-memory errors, which I actually experienced when running the tests when trying with a huge number of queries in parallel. Using an unlimited number of threads is a terrible idea in general, when things get tough, it will just crash the server.

Using a thread pool with a fixed number of threads can solve this problem. But there is another aspect, using too many threads is also not good because they compete for the same resources, in our case, connections in the same database connection pool.

A better approach would be to use a thread pool where the number of threads is small. Even with this approach, the tasks’ execution times are not exactly equal, so with massive parallelism, there will be inefficiencies.

In the beginning, I’ve mentioned that virtual threads run on platform threads created with an instance of a ForkJoinPool. A ForkJoinPool efficiently use the available platform threads in an uneven situation, which will always be the case with more than one thread, by implementing a work-stealing algorithm with queues for each thread.

Let’s see what happens when we skip the middle man and run the benchmark using a ForkJoinPool directly.

The initialization is very similar with the above:

public ManyQueriesExecutor postgreFork;

@Setup(Level.Iteration)
public void setupIteration() {
    // ...
    postgreFork = new ManyQueriesExecutor(postgresql, ForkJoinPool.commonPool());
}

Where commonPool is a shared ForkJoinPool instance with a number of threads equal to the number of cores available to the JVM - 1.

And here are the results:

Benchmark	Score	Error	Units	Improvement
Find by Code
CachedThreadPool	1997.990	± 41.610	ops/s	×1
VirtualThreadPerTaskExecutor	5094.044	± 290.077	ops/s	×2.54
ForkJoinPool	5283.633	± 172.193	ops/s	×2.64
Update and Grouping Complex Query
CachedThreadPool	724.340	± 15.948	ops/s	×1
VirtualThreadPerTaskExecutor	1298.718	± 50.907	ops/s	×1.79
ForkJoinPool	1317.077	± 33.615	ops/s	×1.82

With ForkJoinPool - PostgreSQL

The results are even better with ForkJoinPool. This is because virtual threads, even if very lightweight, still have some implementation overhead. And for CPU-bound tasks, which our benchmark becomes in this extreme case, platform threads with ForkJoinPool is still the better choice. On the other hand, virtual threads perform best when they are doing nothing, and for IO-bound tasks they are the better choice.

How About the Latency?

Let’s now simulate a latency between the client and the database server, which should be more realistic in a real-world scenario. This hopefully would also transform our tasks from CPU-bound to IO-bound. With a simulated latency of +10ms on each request, the results are:

Benchmark	Score	Units	Improvement
CachedThreadPool	2205.905	ops/s	×1
VirtualThreadPerTaskExecutor	5004.089	ops/s	×2.27
ForkJoinPool	1254.107	ops/s	×0.57

With Simulated Call Latency - PostgreSQL

In this case ForkJoinPool is not that good anymore, because the number of threads is limited, and at one point they are all blocked. Until a thread becomes free to execute the next task, the next ones are waiting. On the other hand, the CachedThreadPool is performing more of the same because it can scale up as needed (or at least until it runs out of memory and crashes because it cannot create new threads). Theoretically the VirtualThreadPerTaskExecutor is the best choice because it is not blocked in case of latency. The virtual thread is just parked instead if blocked as with ForkJoin. And it can create much more threads without crashing when compared to Cached.

However, there is a catch with database access. When I said simulate a latency, I cheated and just added Thread.sleep delay outside a connection open/close block - [1] in the code below.

If the delay is happening when a connection is open - [2] in the code below, which is the case when executing a query, the connection pool will run out of connections. In the real case, it is likely that we are still capped, because the thread will be pinned until the connection is released.

public Optional<Organism> findByCode(final String code) {
    try {
        // <-- [1] "simulated" delay here
        Thread.sleep(executeQueryCheatingDelay.toMillis()); 
    } catch (InterruptedException e) {
        Thread.currentThread().interrupt();
        throw new RuntimeException(e);
    }
    try (final var connection = dataSource.getConnection()) {
        try (final var statement = connection
                .prepareStatement(
                        // ...
                )
        ) {
            // <-- [2] in reality, the delay is somewhere here
            // ...
        }
    }
}

Let’s see what is happening when we simulate a real latency. To avoid all doubt that the latency is happening on the actual connection to the database, I’ve used a Toxiproxy testcontainer to simulate the latency. The new results with the same added latency of 10ms are:

Benchmark	Score	Units
CachedThreadPool	626.894	ops/s
VirtualThreadPerTaskExecutor	762.063	ops/s
ForkJoinPool	757.304	ops/s

With Real Connection Latency - PostgreSQL

The results are much worse because we are now limited by the pinned threads which are limited in turn by the connection pool. If the queries are not fast enough, the connection pool will run out of connections and the threads will be blocked waiting for a free connection.

Removing the Bottleneck

So now that we’ve seen that the connection pool can still cause our threads to block, let’s try to increase the number of connections in the pool. Of course, this could scale up to a point; it will still be limited by the number of connections that the database can handle.

Let’s see what happens if we increase connection pool size from 10 to 1.000:

Benchmark	Score	Units	Improvement
Find by Code
CachedThreadPool	6126.189	ops/s	×1
ForkJoinPool	6534.681	ops/s	×1.06
VirtualThreadPerTaskExecutor	19063.620	ops/s	×3.11
Update and Grouping Complex Query
CachedThreadPool	1659.567	ops/s	×1
ForkJoinPool	1629.866	ops/s	×0.98
VirtualThreadPerTaskExecutor	2453.289	ops/s	×1.47

With Big Connection Pool - PostgreSQL

This is quite interesting. The raw performance with more connections in the pool is much better for all cases. The relative performance of virtual threads is also better relative to smaller connection pool. The ForkJoinPool is not that good anymore compared with the cached thread pool. Because of the smaller number of threads, they are more likely to be blocked than the other alternatives, which can benefit from the greater number of connections in the pool.

However, for the more complex and slow query, the performance improvement is still not that great. This is because we likely hit another bottleneck, the database itself. There is nothing we can do about it in the client code.

Hitting the Next Wall

Benchmarking, like most of the software development tasks, is never really “done”. There will always be something else to tweak, test, improve. If is not in our components or code, it will be in a dependency, or the infrastructure, a config value somewhere, or even the benchmark itself.

Even if we try to isolate the benchmark as much as possible, the performance numbers we are getting, might not tell the whole story. We can assume a certain pattern of usage, but in reality, the usage pattern can be different and change our results completely.

There will always be a bottleneck. When we remove one, we will hit the next one. The only sane way is to find the right balance and optimize for our use cases.

Conclusion

For database access alone, virtual threads are a bit tricky because there are a couple of variables to take into consideration, like the database latency and the connection pool size. Still, the results are very good and even in the worst case scenario, they perform as good as the alternatives. Also, when the database access is not the only thing that the application is doing, which is … most of the time, virtual threads can also be a better choice. One of the nice things about virtual threads is that you do now have to think about it too much. You just write the code as you would normally do, and it will work about the same or better than using classic thread pool.

Now, if we had virtual database connections, that would have been spectacular.

The code for this benchmark is available on GitHub.

Interesting Findings#

Benchmark Setup#

A Simple Query#

Slower Query#

Skipping the Middle Man#

How About the Latency?#

Removing the Bottleneck#

Hitting the Next Wall#

Conclusion#