David Hallberg

MQ Assumptions Might Send the Wrong Message

Although not the official motto of the U.S. Postal Service, the following ancient words by the Greek philosopher Herodotus have formed my expectation for mail delivery — and quite possible yours:

Neither snow, nor rain, nor heat, nor gloom of night, stays these couriers from the swift completion of their appointed rounds.

What if my assumption is wrong? What if my letter doesn’t make it to its destination? In recent years, a couple of postal carriers have been convicted of stealing packages. The people who mailed those packages assumed their couriers would swiftly complete their appointed rounds. They operated on a false assumption. Although the vast majority of mail arrives without a hitch, theirs didn’t.

So it is with Millennium’s delivery system. Be careful to assume that all transactions are going through.

One of my recent clients had been told that Millennium cannot lose transactions because it uses IBM’s WebsphereMQ for communication. That’s true to an extent: If a transaction is put into a persistent queue (Millennium RDM queue), and if the receiving MQ queue exists and does not have InhibitPut enabled, the data in that transaction will make it. That’s two “ifs,” and here’s another one: If the filesystems or disk drives that the MQ uses for logs fills up, MQ on that node will stop. This means no messages are getting from their source to their destination.

What about non-persistent queues (SSREP)? They have limitations too: The transaction will go through if the requesting process is connected to the SSREP queue, if the Shared Service Queue Administrator (SCP Entry ID 36) does not have InhibitPut enabled, if the request processed by the server (like CPM Script) completes its work successfully, if the SSREP queue is not full, and if the Exception queue is not full. Whew! Any of these issues could cause MQ to fail to deliver something or to deliver what is called a null return, a return that contains no data.

This might sound to you like a transient issue with a specific request or server that only happens on occasion. Unfortunately for my client, this particular problem caused more than one production outage. How could one request cause a production outage? The problem started to spiral when the Java server that made the request to CPM Script did not get a response back. The following data shows what happened next with the Java server:

The generic Java driver sees that the request from the Java service clinical_event never received a response from CPM Script (illustrated by the first row of data: “… There was no message that met the selection criteria”). Since the request from the user BOBSURUNCLE did not get the data needed from CPM Script, the transaction “msvc_svr_get_medication_administrations” failed. I would hope that the clinician was told about the problem and resubmitted the request. But I do not know if that happened. I do know that the client had their production environment hang for a while. Once they called support, they were told to cycle all of the CPM Script servers. After cycling, everything started working again.

So I urge you to be careful with your assumptions. Sometimes the mail carrier fails to deliver your package or delivers a package with nothing in it. Sometimes WebsphereMQ fails to complete a transaction or sends it to the wrong place or completes it without the right data. The impact might be small, affecting just one clinician, or it can be large, affecting the entire organization. Keep an eye on the message log event “QUE_Get.” It will be a very good indicator that something is going wrong, especially with any of the Java services.

Prognosis: Having a reliable messaging system is important; having messages properly handled by specific executables is priceless.