Performance Analysis in Java Applications: A Flink Deep Dive
Recently, I’ve been working on a Flink Java application experiencing significant performance issues. This led me down a path of discovering the rich set of diagnostic tools available in the Java ecosystem for identifying and resolving backpressure and performance bottlenecks.
The Flink dashboard provides multiple complementary analysis tools, each offering specific insights into application state and behavior.
Performance Analysis Toolkit
Available Tools in Flink
- Thread Dump - Point-in-time snapshot of all thread states
- CPU Profiling - Flame graph analysis of CPU consumption
- Allocation Profiling - Flame graph analysis of memory allocation patterns
- Metrics & Dashboards - Prometheus + Grafana for time-series data
- Exception History - Historical view of errors and exceptions
- Memory Metrics - Built-in Flink dashboard metrics
- Logs - TaskExecutor and JobManager logs for detailed analysis
Each tool serves a distinct purpose in the diagnostic workflow. Let’s dive deeper into one of the most fundamental tools: the thread dump.
Thread Dump Analysis
What Is a Thread Dump?
A thread dump is a point-in-time snapshot of all threads in the JVM. It captures:
- Thread states (RUNNABLE, WAITING, BLOCKED, TIMED_WAITING)
- Stack traces (what method each thread is currently executing)
- Lock information (held locks and waiting threads)
- Thread metadata (priority, daemon status, thread ID)
Characteristics
| Property | Description |
|---|---|
| Capture Type | Point-in-time snapshot |
| Overhead | Very lightweight |
| Scope | Current execution state only |
| Limitations | No historical data or resource consumption patterns |
Common Use Cases
- ✅ Diagnosing deadlocks
- ✅ Identifying blocked threads
- ✅ Understanding current code execution
- ✅ Troubleshooting application hangs
- ✅ Analyzing concurrency bottlenecks
Understanding Thread States
Thread states are critical for interpreting dumps correctly:
RUNNABLE
The thread is actively executing or ready to execute.
- May be using CPU or waiting for OS resources (I/O, network)
- Performance signal: Many threads in RUNNABLE with identical stack traces indicate a CPU bottleneck in that code path
BLOCKED
The thread is waiting to acquire a monitor lock held by another thread.
- Performance signal: Multiple threads BLOCKED on the same lock indicate a concurrency bottleneck
- Common cause of reduced throughput
WAITING
The thread is waiting indefinitely for another thread’s action.
- Triggered by:
Object.wait(),Thread.join(),LockSupport.park() - Common pattern: Idle thread pool workers waiting for tasks
TIMED_WAITING
The thread is waiting with a specified timeout.
- Triggered by:
Thread.sleep(),Object.wait(timeout),LockSupport.parkNanos() - Similar to WAITING but will automatically resume after timeout
Example Thread Dump (Flink Application)
{
"threadInfos": [
{
"threadName": "main",
"stringifiedThreadInfo": "\"main\" Id=1 WAITING on java.util.concurrent.CompletableFuture$Signaller@2993e0bc
at java.base@17.0.12/jdk.internal.misc.Unsafe.park(Native Method)
- waiting on java.util.concurrent.CompletableFuture$Signaller@2993e0bc
at java.base@17.0.12/java.util.concurrent.locks.LockSupport. park(Unknown Source)
...
at app//org.apache.flink.runtime.taskexecutor.TaskManagerRunner.main(TaskManagerRunner.java:475)"
},
{
"threadName": "flink-pekko.actor.default-dispatcher-4",
"stringifiedThreadInfo": "\"flink-pekko.actor. default-dispatcher-4\" Id=35 RUNNABLE
at java.management@17.0.12/sun.management.ThreadImpl.dumpThreads0(Native Method)
at java.management@17.0.12/sun.management.ThreadImpl.dumpAllThreads(Unknown Source)
at app//org.apache. flink.runtime.util.JvmUtils.createThreadDump(JvmUtils.java:50)
...
at java.base@17.0.12/java.util.concurrent. ForkJoinWorkerThread. run(Unknown Source)"
},
{
"threadName": "Source: Kafka Source - CounterUnifiedEventSlimV2 -> Filter (25/128)#15",
"stringifiedThreadInfo": "\"Source: Kafka Source - CounterUnifiedEventSlimV2 -> Filter (25/128)#15\" Id=118752 TIMED_WAITING
at java. base@17.0.12/jdk.internal.misc. Unsafe.park(Native Method)
at app//org.apache. flink.streaming.runtime.tasks.mailbox.TaskMailboxImpl.take(TaskMailboxImpl. java:149)
at app//org.apache.flink. streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor. java:229)"
},
{
"threadName": "Aggregate - Sliding Window -> Update - Lifelong Counter (25/128)#15",
"stringifiedThreadInfo": "\"Aggregate - Sliding Window -> Update - Lifelong Counter (25/128)#15\" Id=118778 RUNNABLE
at app//org.apache.flink.contrib.streaming.state.RocksDBCachingPriorityQueueSet.isPrefixWith(RocksDBCachingPriorityQueueSet.java:334)
at app//org.apache.flink.contrib.streaming.state. RocksDBCachingPriorityQueueSet.peek(RocksDBCachingPriorityQueueSet. java:134)
at app//org.apache.flink. streaming.api.operators.InternalTimerServiceImpl.onProcessingTime(InternalTimerServiceImpl.java:293)"
}
]
}
How to Read & Analyze Thread Dumps
Step 1: Identify the Problem Pattern
| Symptom | What to Look For |
|---|---|
| Application Hang | Deadlocks (circular lock dependencies) or all threads waiting on external resources (DB, I/O) |
| High CPU Usage | Many threads in RUNNABLE state with similar stack traces |
| Slow Performance | Many threads in BLOCKED state waiting on the same lock |
| Backpressure | Task threads in WAITING/TIMED_WAITING while data accumulates |
Step 2: Analyze Thread States Distribution
# Quick command to count thread states (if you have text dump)
grep "java.lang.Thread. State" threaddump.txt | sort | uniq -c
- Healthy application: Balanced distribution with most threads WAITING (idle pool workers)
- CPU-bound problem: High percentage of RUNNABLE threads
- Lock contention: High percentage of BLOCKED threads
Step 3: Look for Repetitive Stack Traces
Multiple threads with identical stack traces indicate:
- CPU hotspot: If threads are RUNNABLE
- Lock bottleneck: If threads are BLOCKED
- Design issue: Possible need for async processing or better concurrency design
Step 4: Check for Deadlocks
Modern JVM thread dumps automatically detect and report deadlocks:
Found one Java-level deadlock:
=============================
"Thread-1":
waiting to lock monitor 0x00007f8a1c004e00 (object 0x00000000d5f78a20, a java.lang.Object),
which is held by "Thread-2"
"Thread-2":
waiting to lock monitor 0x00007f8a1c007360 (object 0x00000000d5f78a30, a java.lang.Object),
which is held by "Thread-1"
Flink-Specific Insights
Kafka Source Thread (TIMED_WAITING)
"Source: Kafka Source - CounterUnifiedEventSlimV2 -> Filter (25/128)#15"
TIMED_WAITING at TaskMailboxImpl.take()
Interpretation: Source thread is idle, waiting for messages from Kafka. This is normal behavior when there’s no data to process.
Aggregate Thread (RUNNABLE)
"Aggregate - Sliding Window -> Update - Lifelong Counter (25/128)#15"
RUNNABLE at RocksDBCachingPriorityQueueSet.isPrefixWith()
Interpretation: Thread is actively processing window aggregations using RocksDB state backend. If many threads show this, it could indicate RocksDB performance issues.
Best Practices
1. Take Multiple Dumps
A single thread dump is a snapshot. Take 3-5 dumps with 5-10 second intervals to see patterns.
2. Correlate with Metrics
Combine thread dumps with:
- CPU usage metrics
- GC logs
- Application-specific metrics (throughput, latency)
3. Use Proper Tools
- jstack: Command-line tool for thread dumps
- VisualVM: GUI with thread dump analysis
- FastThread: Online thread dump analyzer (fastthread.io)
- Flink Dashboard: Built-in thread dump feature
4. Document Baseline Behavior
Take thread dumps during normal operation to understand healthy patterns.