Let’s Put Clothes on the Emperor

When clients are looking for improved Millennium® performance – which they almost always are – our conversations often go like this:

Client: “Can you please help me parse and manage the RTMS timer files?”

Me: “Certainly. But I thought we wanted to improve Millennium’s performance for your clinicians and business staff. The RTMS timers will do nothing to assist in resolving performance for the point of care people and the business people utilizing Millennium.”

Client (with great indignation): “What are you talking about? The RTMS timers are THE performance metric for Millennium.”

With a name that stands for Real Time Monitoring System, you would assume that statement to be true. Its truth, however, depends on how you define the word performance. To clinicians and the business staff, performance relates to how quickly they can get their job accomplished. They are concerned with Millennium’s speed. The data center staff, especially those at hosting vendors, define performance as uptime. They are concerned with the amount of time that the backend systems are available. Some may boast that they offer a performance guarantee of 99.8%. Such assurances mean nothing to clinicians, who assume that the system should always be up. Their standard is a system that runs, in the words of a friend of mine, at blink speed.  

And while RTMS timers once did measure speed, Millennium architecture changes have altered the meaning of the data the timers deliver. The result is that end-users might complain that the system is slow but the metrics look fine inside the data center. A bit of history will help explain the discrepancies.

Millennium did not originally have RTMS timers in its code. Instead, the middleware had a setting to allow the Citrix server or fat client to log all of the transactions being sent from that device to the application node. In the days when the applications and database resided on the same node, this log was a great metric. However, the setting had some drawbacks. There would be an observable performance degradation in Millennium applications with this logging turned on (degradation of 10 percent or more was not unheard of), and the approach did not show any grouping of applications and functions. So it was a great tool for the middleware team but not so helpful for the application engineering teams.

A grouping feature was eventually deployed. Initially, there was a high degree of linking between the functions being measured between the Citrix server or fat client and the application node, which proved helpful in seeing backend performance issues. Over time, though, we have seen that Windows tab changes and other functions actually only measure how fast the Citrix server or fat client hardware is running rather than how long it takes the application and database nodes to process a transaction for the end-user. I call this “RTMS timer death.” You now have lots of functions being timed, but you can no longer be assured that the function being measured is actually providing a roundtrip view of the transaction time.

Several other changes are also obscuring the quality of the RTMS timer data.

  1. A greater and greater reliance on asynchronous work leads the interactive user to believe the system is more responsive, but it is in fact a bit of a performance cheat. For instance, I may do a function in PowerChart in which a request is sent to CPM Script and CPM Script responds when the function is complete. In fact, even though CPM Script is done with the function, CPM Script actually sent the request to CPM Process and CPM Process exploded my request into numerous other requests that it then handed to other Millennium executables.
  2. If the Citrix or fat client application (e.g., PowerChart) crashes or is X’d out, you will see a DBAPI_Exception event for the Millennium application executables that were processing the requests from this client. The Data or Record information starts with “DbDestroyTimer: invalid object handle: xxx,” where xxx is the handle being used to update the RTMS timer data. What does this mean? It’s incomplete data. It’s not clear that each of these events represents an RTMS timer that started but never finished, nor is it intuitive that there is a relevant impact, much less intervention, to consider.
  3. Two files in $cer_mgr – slareporting.cerner and timer.cerner – control whether a timer is active, what functions are used, if the timer meets the hit/miss count, and the timer’s average or mean time. Though it wasn’t originally the case, the CRM Timer information is now configured in the timer.cerner file. In the slareporting.cerner file, the 54 application timers are defined. In the timer.cerner file, the 1,022 functions are defined. This would seem pretty good – until you remember that there are more than 35 million lines of code for Millennium, more than 4GB of Citrix and fat client files, and more than 5,000 Oracle tables. In the slareporting.cerner file, by definition, you have to have each of the 54 applications run 500 times in two hours in order for the data to be counted. You also have to have each application run at least 10,000 times a month or the data does not count.

Let me use PowerChart to explain the potential for problems. Let’s assume that it’s 7 a.m. and physicians are starting to round, nurses are changing shifts and many other clinicians are getting down to their day’s work. In all, 499 clinicians log into PowerChart from 7-9 a.m. The IT department gets reports from these users that Millennium is slow. IT checks the slareporting.cerner file and sees nothing counted for the day’s performance. Why? The slareporting.cerner file must have a minimum of 500 functions of an application – like logging into PowerChart – in a two-hour window, or they do not count. Over the course of a month, if your site did 9,999 PowerChart logins, it would be stated that you did not do enough work to be statistically significant. For the purposes of a Lights On Network™ evalution, this data would not be considered substantial enough for valid comparison. Therefore, if you logged a Service Request stating you are having performance issues, you are likely to be told that based on the data you do not have any issues.

The initial intent of the RTMS timers was right on the money: to create a mechanism to allow Millennium sites to look at a clinical function and understand where the issue is. However, the execution and changes over time have made the timers unusable. Every client who I have worked with in processing their RTMS data has logged Service Requests and been told either 1) there is no problem or 2) a code level update will make the problem go away.

This approach to analytics showcases the different definitions of performance that I mentioned earlier. The clinical and business staffs are telling you that Millennium is slow and unpredictable, but the data shows no performance issues. This conundrum with the impenetrability of the data leads many organizations to conclude that the environment is unmanageable and they need to outsource the support for Millennium Citrix servers, application nodes, Oracle databases, etc. After all, the outsource vendors guarantee “performance” to be 99.8 percent.

I’ve already discussed how meaningless that statistic is to everyone outside the data center. A 99.8 percent uptime from the clinicians’ perspective makes no claims to speed and probably means that the .2 percent downtime will occur when they need the system the most. Here’s one more caveat: The statistic refers only to unplanned downtime. The two to four hours of scheduled downtime for weekly preventative maintenance are not included in the number. I have been in healthcare IT since 1989. I have been employed by four different hospital systems, have been an HIT consultant and have worked for Cerner, and I have never been allowed to have that much downtime in a month, let alone a week.

Nor was I allowed to ignore the clinicians’ desire for blink speed. When I did performance benchmarking at a facility running a different clinical system, we had the system tuned so we could do the screen changes in 0.6 seconds. Seriously. On one occasion we had a drive problem that spiked the time to 0.9 seconds. The screen flips were 0.3 seconds slower, about the time it takes to blink your eyes, and our help desk phones were on fire from the outraged clinicians complaining of performance problems.

So what can you do to help resolve your performance problems? The first step is probably the hardest. Stop believing the propaganda that your system is fine. Anyone telling you that you don’t have a performance issue that can be tuned is like the “tailor” telling the emperor that his new clothes are exquisite. They’re trying to convince you of something your eyes and brain know is not true. Listen instead to the people who interactively use your system every day. The clinicians and business staff will be the first to notice a difference in performance. The next step is trying to understand what you can do to monitor the performance of all your Millennium systems. The last step is to determine what staff and budget changes you need to make to address the issues.

There is a lot you can do with the tools you already have. You’ll find lots of tips in previous blog postings. I also encourage you to respond to this blog with your own questions and insights. I welcome a lively dialogue.