Franek Sodzawiczny, founder and CEO at Zenium, argues why we need to be wary of operational fiction in the data centre…
In a bid to maximise uptime, many organisations try to reduce risk by taking a ‘tick box’ approach to the management of their data centre operations. Long lists of essential requirements may very well suggest that a certain level of reliability or robustness can be guaranteed. Requesting that the latest technological innovations are incorporated into the design of data halls could indeed contribute towards continuous service delivery, but I stress the word ‘could’. The key to operational professionalism and more importantly performance is not what can be done on paper but what can be put into practice in reality.
Experience over the years has proven that over-complicated design and specification can, in fact, increase the possibility of downtime because unnecessarily complex systems can be harder to maintain and to fix in the event of failure. If reports are true that 75% of all downtime in the data centre is as a result of human error, the most sensible approach is to resist the urge to be an early adopter of new technologies not yet understood, avoid complex designs and simply implement systems that are easier to operate and maintain.
Engineers that leverage what they have learned in the past when they design, commission and operate a data centre are more able to avoid what has failed and repeat what has worked. A data centre that is both operational and experience-led is able to reinvest that knowledge in the design and construction of every hall.
Logic tells us that the most effective way to manage risk of any kind, for example, is to avoid a single point of failure wherever possible. The aim is to contain individual, relatively small problems at a local level, preventing them from escalation into a major problem across the entire facility but, unfortunately, not all single points of failure are that obvious.
For example, if you use a building management system (BMS) for remote enable/disable of critical equipment, a simple software failure could turn off perfectly healthy pieces of equipment with subsequent loss of services to the tenant. As a result, good practice suggests that it’s best to keep it simple and test for every conceivable eventuality before a client takes occupancy of the space. Of course, the simpler the solution, the more likely it will be that you will be able to test for every possible scenario during fully loaded integrated systems tests (ISTs).
Data centre operators need to be more transparent about the pros and cons of different systems
Data centre operators are ultimately judged on what they achieve and failure to hit SLAs only results in operational targets not being met and penalty clauses being invoked. The best way to ensure that SLAs are practical, feasible and achievable is to take an engineering-led approach to operations management. In other words, think of SLAs as a numerical function of the engineering, not a negotiating element of the contract. Projections should not be unrealistic and cutting edge technology should not be specified simply to impress. It is imperative that downtime or meantime between failures (MTBT) is fully understood by clients. The impact that design and commissioning decisions can have on SLAs and operational efficiencies must be taken on board.
For example, the desire to reduce the cost and time to install pipework might be regarded as the best way to achieve an earlier completion date, but while plastic pipes might be cost effective, they are also more prone to cracking, which could ultimately disrupt the supply of water for the cooling system. Experience shows that plastic pipes are not worth the risk but it’s on the data centre operator to explain why it’s worth the extra time and money to stick with heavy weight steel pipes in most circumstances.
Clearly, data centre operators need to be more transparent about the pros and cons of different systems, approaches and accreditations. They need to listen to what the client wants and then explain and justify their recommendations, even if that means suggesting an alternative course of action if that is in the client’s best interest. They should regard an SLA as a promise to deliver and be prepared to renegotiate an SLA if it is not technically or operationally viable.
After all, the only way to reduce any potential for infrastructural weakness or operational complications that could hinder resilience or efficiency is for data centre operators to harness their experience of design, commissioning and management and ultimately stand by their advice as professionals.
This post originated at Data Centre Management magazine, from the same publisher as The Stack. Click here to find out more about the UK’s most important industry publication for the data centre space.