Running Java in Production: A SRE’s Perspective

Running Java in Production: A SRE’s Perspective - JVM Advent

RelatedInsightsHighlights

Specific to Java, there are metrics that can be helpful to understand the health, and performance of the application. Guiding future decisions on how to scale and optimise the application. Garbage collection time, heap size, thread count, JIT time are all important and Java specific.

Running Java in Production: A SRE’s Perspective - JVM Advent

Looking at the worst case latency is more meaningful, and more reflective of the user perceived performance.

Running Java in Production: A SRE’s Perspective - JVM Advent

When something breaks, troubleshooting the issue should be possible from only the metrics being collected. You should not be to depending on log files, or looking at code, to deal with an outage.

Running Java in Production: A SRE’s Perspective - JVM Advent

There are many features of the JVM that have a fixed cost per running JVM, such as JIT and garbage collection. Your application may also have fixed overheads, such as resource polling (backend database connections), etc. If you run fewer, but larger (in terms of CPU and RAM) instances, you can reduce this fixed cost, getting an economy of scale.... See more

Running Java in Production: A SRE’s Perspective - JVM Advent

Make sure the most important and useful metrics are exported from the Java application, are collected and easily graphed.

Running Java in Production: A SRE’s Perspective - JVM Advent

In a multi-server deployment, it is best practice to slowly ramp up traffic to a newly started task, giving it time to warm up, and to not harm the overall performance of the service. You may be tempted to warm up new tasks by sending it artificial traffic, before it is placed into the user-serving path. Artificial traffic can be problematic if it... See more

Running Java in Production: A SRE’s Perspective - JVM Advent

When starting up, the Java Virtual Machine (JVM) reserves a large chunk of OS memory and splits it into heap and non-heap. The non-heap contains areas such as Metaspace (formally called Permgen), and stack space. Metaspace is for class definitions, and stack space is for each thread’s stacks. The heap is used for the objects that are created, which... See more

Running Java in Production: A SRE’s Perspective - JVM Advent

SSharat Buddhavarapu

You can and probably should set the size of your heap. Finding a good size is a function of balancing three things: minimizing memory usage, optimizing the young space on heap (so it is large enough not to spill objects into old space), and optimizing the old space (so it is large enough to not cause GC thrashing).

As mentioned earlier, GC is a source of long tail latency, so should be monitored closely. The time taken for each phase of the GC should be recorded, as well as the fullness of heap space (broken down by young/old/etc) before and after GC runs. This provides all the hints needed to either tune, or improve the application to get GC under control.

Running Java in Production: A SRE’s Perspective - JVM Advent

Keep it simple, and make the application as quick and easy to version, using a single Fat JAR, or executable where possible.