Morphium 6.2.4 — When Messaging Dies Quietly

Sometimes the interesting story behind a release is not what made it into the CHANGELOG. It is why the entries are there. For 6.2.4, the reason was a bug I would rather not have lived through. It forced a hardening pass that makes Morphium messaging more resilient in several places. Here is the story and what came out of it.

The Symptom

A production application. Two services exchange requests and replies through Morphium messaging. Runs for hours. Then nothing. Requests still land cleanly in the messaging collection, but the answers do not come back. Or only after what feels like forever.

The log says nothing. No stacktrace, no warning, no "messaging stopped". MongoDB is reachable. The change stream endpoint answers. The application does not crash. It is just half-alive.

A closer look reveals a pattern. Roughly every 10 seconds, one wave of backlog gets processed. That was the decisive hint. SingleCollectionMessaging has a fallback poll that runs every 10 seconds and checks for messages the change stream missed. That part still worked. Nothing else did.

A restart of the application fixed the problem. Until next time.

The Wrong Trail: a Stuck Cursor

First hypothesis, and the most likely one when you hear "change stream broken, polling fine": the change stream cursor is stuck. MongoDB change streams are tunneled tailable cursors. Under load, a cursor can stay formally connected and stop delivering events. The reason is often too much traffic that the client does not care about, or a "resume needed" zone without an invalidate event.

Two of the six commits in 6.2.4 came out of this hypothesis. Both are useful even if they did not fix the actual bug.

Pipeline Filter (server-side `$match`)

The change stream watches an entire messaging collection. It delivers every insert and update to every listening client. With many recipients and many topics, that turns into a lot of wire traffic in a hurry. Until now, the client filtered locally. It decoded messages just to throw them away.

The new pipeline filter adds a $match stage as the second step in the aggregation pipeline of the watch() call. Only events whose recipient is the local sender id (or null for broadcast) and whose sender is not the local id reach the client. The server filters. The wire traffic shrinks.

There is a subtle dependency. $match runs against the actual MongoDB field names (snake_case for camelCase Java properties). That is why this commit went hand in hand with the field-name translation fix in the aggregator stages, which is also part of 6.2.4.

CS Stall Watchdog

The second commit takes the original hypothesis at face value. What if the change stream really does sit silent while there is backlog to deliver? There is a passive watchdog now that combines two signals. A liveness marker the change stream listener updates on every event, and the knowledge from the fallback poll: is there backlog in the collection that the cursor should have seen?

If the marker is older than 2 × FALLBACK_POLL_INTERVAL × pause and the poll picks up unprocessed messages at the same time, the watchdog restarts the cursor. With a log entry. With a counter:

WARN ... Main change stream for 'msg' silent for 42180ms
while polling found backlog — restarting (restart #3)

The counter is exposed as messaging.getCsStallRestarts(), a public health metric. Ideal value: 0. If it stays above 0, load is too high even after filtering, and setBatchSize(1) plus pause tuning is the next thing to look at.

Both commits are good and right. They were not the actual bug.

The Thread Dump

While the first commits sat in the branch, a new incident came in. With it, a thread dump. 4 hours 22 minutes uptime, 279 threads. I searched for Msg, the name SingleCollectionMessaging gives its main thread inside run() via setName(...).

Nothing.

$ grep -c "Msg " threaddump.txt
0

But two threads named changeStream-... were alive and well. 49 worker threads (msg-thr-N) sat idle, waiting for work. 100 decouple_thr-N threads, the decoupling pool workers, also idle.

That was the moment it became clear what we were looking at. Half-stroke. The change stream lives, the workers live, and only the main thread, the one mediating between the two, is gone. And nobody noticed.

Producer and Consumer, and Why a Dead Main Thread is so Invisible

A first-time reader of SingleCollectionMessaging might assume the change stream callback processes messages. It does not.

handleChangeStreamEvent() (lines 441 to 568) is a pure producer. It does two things:

Writes the new event into the processing queue, a BlockingQueue<ProcessingElement>.
Marks the message id in the idsInProgress set.

That is it. No listener call, no processMessage, no lockAndProcess. The architecture is intentional. The change stream is a push source. It must never get blocked by long-running listeners. Otherwise the cursor falls behind, MongoDB hangs onto the resume token, and in the worst case you get ChangeStreamHistoryLost.

The consumer of the queue is the main thread in its run() loop, and only the main thread:

var prEl = processing.poll(1000, TimeUnit.MILLISECONDS);
if (prEl != null) {
    // re-fetch the current version
    var msg = q.get();
    // idempotency: already processed?
    if (recentlyCompleted.contains(msg.getMsgId())) continue;
    // lock + dispatch
    if (msg.isExclusive()) lockAndProcess(msg);
    else processMessage(msg);
}

That is where the work happens. Re-fetching the current message version. Idempotency checks. Lock handling for exclusive messages. The queueOrRun call that drops listener lambdas into the msg-thr-N pool.

When this thread is dead:

The change stream keeps writing events into the processing queue.
Nobody pulls them out. The queue grows. The idsInProgress set turns into a graveyard.
Listeners are never called.
The msg-thr-N workers stay idle, because queueOrRun runs in the main thread, and the main thread is gone.

From the outside, it looks like an application that is "still doing something". The fallback poll twitches every 10 seconds, but it works on its own poll cursor, not on the dead queue. That is where the seemingly random waves of activity come from.

The Second Line of Defense: Die Loud, Not Quiet

When a bug is this invisible, you want it to make noise the next time. Two more commits.

`catch (Throwable)` in the `run()` Loop

The original code had a catch (Exception e) around the main loop. Sounds careful. It is not enough. Errors, including OutOfMemoryError, StackOverflowError, LinkageError, are not Exceptions. They got through silently. The thread died silently with them.

The fix:

} catch (Throwable t) {
    // catch Throwable (not just Exception) so an Error
    // does not silently kill the main messaging thread.
    log.error("Unhandled throwable in messaging main loop "
              + "for '{}' — keeping thread alive", coll, t);
}

Throwable catches anything coming up the stack. Logs it with the full trace. The loop runs on. Next time an OutOfMemoryError slips through, we see it. That alone tells us what to do.

Main Thread Liveness Check

Belt and suspenders. If the main thread somehow still dies despite catch (Throwable), an unhandled terminate() race, a JVM bug, whatever it might be, a liveness check now runs every 30 seconds. It reacts when running == true and mainThread.isAlive() == false:

ERROR ... FATAL: main messaging thread for 'msg' is no longer
alive but running=true. Listeners will not be called.
Application restart recommended.

It is not an attempt at resuscitation. A dead thread stays dead. A clean application restart is the right answer. But the message is in the log now. Visible. Monitorable.

Smoking Gun: 20 MB BSON

Four commits in, branch green, diagnostics sharpened. The next incident, this time with the new logging in place, turned up this:

Caused by: BSONObjectTooLarge: BSONObj size: 20090907 (0x132901B) is invalid
  at ... MorphiumWriterImpl.insert(...)
  at ... InsertMongoCommand.execute(...)
  at ... SingleMongoConnection.readSingleAnswer(...)
  at ... Msg.sendAnswer(...)
  at ... [listener lambda]
  at ... ThreadPoolExecutor$Worker.run(...)

Twenty megabytes. That is past the hard 16 MB BSON limit. A server constraint, not a tuning knob. MongoDB rejects the document. The driver throws.

The stack tells the story. A listener processes an incoming request and ends with Msg.sendAnswer(...). The reply contains a much larger result set than we ever expected. The insert into the messaging collection fails.

Now the subtle part. Why does the receiver thread die when sending a reply?

sendAnswer() is not a sender in the classical sense. It is called from a listener that is busy answering a request. The listener does not run in the main thread. It runs in a msg-thr-N worker. A thrown MorphiumDriverException (now MorphiumDocumentTooLargeException) would be caught there. The worker would log it and keep going.

But the 20 MB document has to be built first. The BSON encoder allocates the document fully in memory before the insert command goes out on the wire. Under heap pressure, with 20 MB plus Java object overhead and several parallel replies in flight, that allocation step can throw an OutOfMemoryError. And OutOfMemoryError is a nasty traveler in the JVM. It can land in a completely different place than the code that triggered the allocation. For example, in the main thread that happens to be doing its own processing.poll(...) next door.

That matches the thread dump. No main thread. Change stream and workers alive. The msg-thr-N worker survived. The JVM hands the OutOfMemoryError to a thread of its choosing. The main thread became collateral damage of the heap crisis.

The Driver Level Fix: `MorphiumDocumentTooLargeException`

The application-side hardening (catch (Throwable), liveness check) makes sure these cases are no longer silent. For the diagnosis itself, a second, much more general improvement helps.

Until now, every server-side BSON size error came back as a generic MorphiumDriverException with the message "Error: 10334 — BSONObj size: ... is invalid". Catching it programmatically meant parsing the errmsg. Ugly.

6.2.4 has a dedicated subclass:

public class MorphiumDocumentTooLargeException
extends MorphiumDriverException { ... }

It is mapped centrally in SingleMongoConnection.readSingleAnswer(), the choke point every server response goes through. That means the new exception applies automatically to insert, update, replace, save, findAndModify, and any future write path. No per-call instrumentation needed. It recognizes both error 10334 (legacy) and error 17280 (newer storage paths), and falls back to substring detection on the errmsg if needed.

Code that used to have generic exception handling can now catch deliberately:

try {
    morphium.store(hugeDoc);
} catch (MorphiumDocumentTooLargeException e) {
    // split at the root, compress, offload to GridFS, ...
} catch (MorphiumDriverException e) {
    // everything else
}

Four New Log Lines to Tell Apart

After the upgrade, these are the four log lines you can now see, each with its own meaning.

Log line	What happened	What to do
`WARN ... change stream silent for ...ms ... — restarting`	The watchdog restarted a silent cursor	Watch it. If it stays above 0, reduce load.
`ERROR ... Unhandled throwable in messaging main loop ... keeping thread alive`	An `Error`/`Exception` would have killed the thread, got caught	Read the stack, fix the cause
`ERROR ... FATAL: main messaging thread ... no longer alive but running=true`	The main thread died anyway	Restart the application, check heap and logs
`MorphiumDocumentTooLargeException`	A document above 16 MB was about to be written	Catch it, split it, or refuse it

What is Still Open in `MultiCollectionMessaging`

Time to be honest. The hardening above sits entirely in SingleCollectionMessaging. The reason is architectural. SingleCollectionMessaging extends Thread and has the one run() loop that can die. We fixed that one.

MultiCollectionMessaging is built differently. It is not a thread. It uses a ScheduledExecutor with scheduleWithFixedDelay(...) for polling and runs listeners directly from the change stream callbacks. The "silent main thread death" bug does not exist in that form there.

But there are analogous risks we did not address in 6.2.4.

The polling path in MultiCollectionMessaging already has catch (Throwable) around the scheduled body. Good.
Listeners run inline from the change stream callback. If a listener throws an Error there, it can take the change stream monitor down with it. There is no watchdog for that.
The pipeline filter for change stream traffic is not in Multi.

What both implementations get equally is the driver level fix with MorphiumDocumentTooLargeException. It applies anywhere. SingleCollectionMessaging. MultiCollectionMessaging. Plain morphium.store(...) calls. Aggregation output. Whoever sends a 16 MB document to Mongo gets a specific, catchable exception.

That MultiCollectionMessaging needs analogous hardening is on the TODO list, and will likely land in 6.2.5 or 6.3.

Other Things in 6.2.4

So the CHANGELOG is not entirely unmentioned, in short:

Aggregator field name translation in pipeline stages. Java camelCase fields now translate correctly to MongoDB snake_case inside aggregation stages too. Practical side effect: this was a prerequisite for the messaging pipeline filter.
Query.distinct() and explainDistinct(). Same field name translation, same effect. Distinct queries on camelCase properties work correctly now.

What Came Between 6.2.2 and 6.2.4 (a.k.a. 6.2.3)

If you missed 6.2.3, the highlights:

defaultQueryTimeoutMS, decoupled from the connection pool maxWaitTime. You can finally configure "wait long for a connection, time out queries quickly".
storeList(..., continueOnError). Batch stores that keep going on individual failures. ordered: false for Morphium.
Batched versioned-entity updates. Fewer round-trips for lists with @Version fields.
Auto-retry on transient WriteConflict (error 112) outside transactions.
CosmosDB compatibility: no more collation: null in write commands.
Structured writeErrors in insert and write commands. Finally matches the MongoDB wire format.
A pile of PoppyDB fixes.

How to Get It

Maven Central as always:

<dependency>
    <groupId>de.caluga</groupId>
    <artifactId>morphium</artifactId>
    <version>6.2.4</version>
</dependency>

Source, CHANGELOG, issues: github.com/sboesebeck/morphium

Lessons Learned

Three things I take from this session:

catch (Exception) is no defense against Errors. Main loops that carry a thread should be wrapped in catch (Throwable). No matter how defensive the rest of the code looks.
Producer/consumer architectures need a liveness watch. When the producer and consumer live in different threads and only talk through a queue, the death symptom is asymmetric. A liveness check on the consumer thread costs little and tells a lot.
Write-side limits often surface in the wrong thread. An OutOfMemoryError can hit the component that has nothing to do with the write that triggered it. A stack pointing at "the writer" does not mean the writer dies.

Until the next release.

Morphium 6.2.4 — When Messaging Dies Quietly

Morphium 6.2.4 — When Messaging Dies Quietly

The Symptom

The Wrong Trail: a Stuck Cursor

Pipeline Filter (server-side $match)

CS Stall Watchdog

The Thread Dump

Producer and Consumer, and Why a Dead Main Thread is so Invisible

The Second Line of Defense: Die Loud, Not Quiet

catch (Throwable) in the run() Loop

Main Thread Liveness Check

Smoking Gun: 20 MB BSON

The Driver Level Fix: MorphiumDocumentTooLargeException

Four New Log Lines to Tell Apart

What is Still Open in MultiCollectionMessaging

Other Things in 6.2.4

What Came Between 6.2.2 and 6.2.4 (a.k.a. 6.2.3)

How to Get It

Lessons Learned

Pipeline Filter (server-side `$match`)

`catch (Throwable)` in the `run()` Loop

The Driver Level Fix: `MorphiumDocumentTooLargeException`

What is Still Open in `MultiCollectionMessaging`