After our last post, “Improving Interface Failovers,” a few people wondered why some Cerner Millennium® servers — most often Ops Job (Millennium batch process), interface and MDI servers — would not start during the failover. Because this situation can have disastrous consequences, it’s an issue worth addressing.
Imagine the following scenario: It’s 3 a.m. on Memorial Day, and your primary application node crashes. The failover scripts run so that users can log into Millennium, but none of the Ops Jobs are running. That means the nursing staff is missing care plans for its shift change, new patients and their lab results are not visible in Millennium, and the Emergency Department director is calling the CIO, COO and CEO complaining that his staff has to go on diversion because they cannot get the patients into Millennium. A storm just ruined your holiday plans.
Had you been aware of a few simple steps that would make your servers fail over successfully, your day could have been clear sailing.
Why didn’t the Ops Job servers start? If the server on a specific Server Control Panel (SCP) Entry ID has not been started on this node before, the Security Master server has to be running to create the service with the correct permissions. If no Security Master server is running after the failover process completes, any of the servers or services that have not gotten to a running state on the failover node prior to the failover cannot be started.
A little analysis will explain why this scenario happens so frequently. Way back when, about 90 days before you first brought Millennium up, consultants from the vendor swooped in to set up failover. They ran through a couple of tests and showed you how, should the primary node go down, you could still log into Millennium. They flew back home, and you turned your attention back to what seemed like more critical work that had to be done before go-live.
Now, many years later, you have added and deleted Ops Jobs, your old registration system has been upgraded, the inbound feed changed names to allow testing into Millennium production and to minimize the cutover time, you added about a dozen MDI servers, you moved on to a different Millennium client, or you were promoted, and your project staff is scattered around various clinical areas. Go-live and pre-go-live are remembered for the number of service packs you had to install and test and the huge amount of training you had to do. Who recalls the little detail about the need for the Security Master to be running to allow a new Millennium server to start on a node?
What, in fact, is the Security Master? For organizations using LDAP to access and maintain their user directory, the Security Master server updates the sec_user.dat file with local accounts, for example SYSTEM, SYSTEMOE and CERNER. For those not using LDAP, the Security Master also is used to add, remove, modify and delete all user IDs. So when a new person joins your organization and you run the tool HNAUser, a new row is added to sec_user.dat as well as some Oracle tables. Additionally, the Security Master creates a service key for the Millennium processes as they start up. These service keys only exist in the sec_user.dat file; they are not kept in LDAP. This key is used to make sure the processes running on the application nodes are legitimate processes. During normal operations, no one really cares about the security services. They quietly do their work when you stop and start Millennium servers on the application nodes, and everyone — especially your security manager — is happy. Until a failover.
Let’s go back to our Memorial Day crisis, where your family headed to the lake without you. You, like most Millennium clients, have Node 1 of SCP Entry ID 32 configured as the Security Master and Node 2 as the Security Slave. With this common configuration, however, you do not fail over the Security Master, so all of the Ops Jobs you added after go-live, along with new interfaces and MDI servers, didn’t seem to start. Yet nothing shows in any of the message logs. You called your vendor’s Immediate Response Center, but the response has not been immediate enough. After calling you back in 15 to 45 minutes to verify you are still having the problem, IRC might take four more hours to contact you again. In the meantime, the ED director is calling for your resignation and the ICU, CCU and NNICU staff are looking for a throat to choke.
How could you have prevented the chaos and been fishing for something other than answers? The single biggest thing you can do is NOT put a Security Slave on SCP Entry ID 32. Use any other SCP Entry ID — 31 and 19 are often open. After you have the Security Slave configured to run on a different SCP Entry ID, copy the Security Master from the primary node to the secondary nodes. Verify that the Instance is set to 0, which will ensure that SCP Entry ID 32 does not start up and that CPM HNAM Agent does not start it. Finally, put 32 in the cerner.vars file as a single-instance server. That way, when the failover scripts run, you will be able to start the Ops Jobs, interfaces and MDIs as needed to minimize your downtime.
Here’s one last detail for all the system administrators who like such specifics: Create a chron job to automatically copy the $cer_config’s security.jnl file from the application node where the Security Master is running to the other application nodes. Then verify that the permissions are “-rw-rw—–” and the owner and group (assuming your domain name is prod) are both d_prod.
Prognosis: A little preparation can dramatically improve the failover process and make your night, weekend and holiday events much shorter and quieter