For Crises: Organizing Data Center Operations Remotely
Interview with Dr. Rainer Weidmannn, Partner and expert for data centers at Detecon
Rainer, what must companies do to ensure the secure, uninterrupted operation of their data centers (DCs) even during the days of the coronavirus?
Rainer Weidmann: Obviously, an uninterrupted power supply is required at all times to ensure the operation of the data center infrastructure and IT operations. This is the responsibility of the technical facility department at the building level. Technical precautions include in particular emergency power supply systems, UPS (uninterruptible power supply), air conditioning systems, and switchgear. Of course, it helps enormously if system infrastructures are redundant and designed without any so-called SPOFs (single points of failure). Otherwise, there can be no guarantees that the very first malfunction will not immediately result in the failure of the entire system – especially if personnel who can respond immediately are not on the site. Ideally, the IT components are served by two separate active supply paths (A/B).
A specific example: a short circuit in the LVS (low-voltage main distribution board) results as a rule in the entire data center shutting down unless a redundant LVS immediately ensures uninterrupted power. If one is in place, the center can “survive” this first malfunction. A power outage of 12 milliseconds at the IT components will cause them to fail. If STS (static transfer switches) are installed in the IT power supply downstream of the UPS, they must switch in a shorter time than 12ms.
In 2018, an incident at Hamburg Airport showed that a short circuit can paralyze flight operations completely for almost 24 hours. A redundant power source also gives personnel more time to determine the root cause of a malfunction.
What role does possible remote monitoring play?
Rainer Weidmann: Normally, the necessary monitoring for the technical building equipment is performed by a local facility management system that is visualized in control centers. However, the building control system should also be operable by remote access and should also have the capability to send text messages or emails to a specific group of recipients. If such functions are available, on-site personnel in the control centers can be reduced to the absolute minimum. The entire remote strategy and the related authorizations (read or write) must be carefully defined because sensitive data are involved. After all, in the worst case, malicious hackers could even shut down the data center from the outside.
It is also possible to relocate or centralize the control stations for several data centers at an external location. In this case, however, extremely tight security precautions must be implemented to guard against possible cyberattacks. If this has been done, a dark data center that does not require permanent on-site operating personnel and can be maintained completely from a remote location would also be conceivable. Nevertheless, it must be equipped with effective redundancies on site so that a technician must be dispatched solely in the event of malfunctions.
How can IT systems and data be protected?
The principle is the same here: through redundancy and even more redundancy! Of course, the use of mirrored systems or the use of cloud technology is recommended because all involved parties are less vulnerable in the event of damage to the local physical server. The application and database can then be moved to another mirrored or virtual machine. Apart from the one-time setup, the cloud requires no on-site presence and can be operated remotely. Another important aspect here is the network, which should also be redundant.
Unfortunately, small and medium-sized companies in particular often forego this redundancy, at least in part, for cost reasons. Yet medium-sized companies often employ only one or two IT experts to secure operations, and the risk of failure rises dramatically if these people actually contract COVID-19.
What stress tests do you recommend to companies?
Companies that do not set down in detail the procedures to be followed in the case of catastrophes and that do not thoroughly test the process live end-to-end at least once are culpable of negligence. High-level concepts and papers have generally been prepared, but do these scenarios really work in an emergency? Can a complete data center be restored and become operational again at a different location within 48 hours? A business impact analysis must also determine what effects a failure of the data center would have on the business as a whole so that the situation can be assessed in an emergency.
Regarding the coronavirus, companies should perform a general review of their activities to determine which ones can be performed remotely at all. Do employees working from home have the required resources in terms of bandwidths and lines? Are additional company lines required? Ideally, tests will be carried out in advance and clarify as well whether sufficient software licenses and infrastructure such as digital signatures are available. These steps will reveal whether capacities and resources are adequate to run the operation externally if necessary.