Magic Numbers…

Actually, I had reserved this particular Friday especially for an OpenShift4 Bare-Metal installation, but you know Murphy: Firstly, anything that can go wrong will go wrong and secondly, Murphy was an optimist. So I will report on my experiences with RedHat another time.

On the morning of the Friday in question, a colleague called with the usual phrase that technicians have learned to dread: “Please take a look.” The dreaded “5 minutes”. But unfortunately, it was important, because a critical application on the production system was at a standstill and nothing worked. So I went looking for the problem and tried to find the cause. Since the error occurred from one day to the next without any changes to the program, I suspected a data constellation that was problematic for the program as the reason for the crash.

Maybe we should have installed a fancy blue screen for something like that, as Windows users have been used to that for decades and then everyone would know what was going on. Well, it wasn’t the data – everything looked fine. The software had not been modified. We were also able to rule out user error this time. I call in a second colleague who was also involved in the project at the time. Neither of us can make any sense of it and bafflement sets in.

Oh, did I mention that it was a Friday? And what’s more, it was a long weekend – which I was really entertaining because everyone else was on holiday. And that was exactly the problem, because we could by now rule out the possibility that it was neither the software nor the data. So we started looking for environmental changes – but it seemed that no Windows updates had been run. And even if there had been, in the short term we couldn’t possibly reverse it for several thousand users.

Then comes the notorious moment when the first managers are called in and a Teams session is briefly scheduled. Then, as we think about alternatives and workarounds, another inkling comes to me, but it is only vague. Just to elucidate: modern software has maybe 20% original code and 80% that is invoked from program libraries. For example, no one would build the secure data exchange via SSL and HTTP for an office application themselves – special libraries are used for this. So we opted to update to a more recent module in one of these libraries. This might have undesired side effects, but when nothing works any more, every straw counts.

OK, as a result version 11 became version 15. Celebrations all around: it worked. In the post-analysis we came to the conclusion that the interface between Windows and Java (used by us for the program) had apparently been changed by an automatic operating system update. Our learning: you can automate as much as you like, but hosting and IT infrastructure management are complex and many new fields of activity are emerging. It is by no means the case that more automation always costs jobs. My first articles for the Harlequin already tackled this topic, and I will probably come back to it in the next weeks/months. Until then, allow time for updates!

Original text: MHA
English translation: BCO


Leave a Reply

Your email address will not be published. Required fields are marked *