PDA

View Full Version : BSAC Web Services : What Went Wrong


Keith Lawrence(BSAC)
15-11-2004, 10:18
I would like to explain a little about what happened, what we did about it, and what actions we can take to stop it happening again.

Background : We own our web server, it?s our responsibility, it was purchased about three years ago. It is co-located via our ISP (Easily) at Redbus Interhouse in London?s Docklands. Redbus is a state-of-the-art data centre, high security, multiple back-ups, standby generators ?the type of thing one would expect for reliability...

Tuesday 9th November : In the early hours two floors of Redbus suffered a major power failure. It shouldn?t have happened but it did, the event made all of the computer press! Hundreds of servers and several ISP?s were badly affected, the BSAC web server was one of those which lost power. We were trying all day Tuesday to find out what had happened, it was about 18:00 Tuesday evening we finally found out that our web server had failed to restart, it was dead.

Wednesday 10th November : An engineer was on site trying to get us going again, it still wouldn?t restart. The fault was that the power failure had corrupted the disk drives. I took the decision late Wednesday afternoon that the server software should be reloaded. This process went on late into Wednesday evening.

Thursday 11th November : The server was running, our old data was saved. At around 16:00 a completely blank web server was given back to us. Mike and myself set about configuring it and reloading the web sites from our backups.

Friday 12th November Onwards : The process of reloading our content continued. Over the weekend myself and Vic Watson spent a considerable amount of time trying to get our (complex) forum system running again, it was Vic who finally succeeded in the early hours of Monday morning!

OK, some answers to some perfectly valid questions :

Why didn?t we have a backup system? : It was my decision some time ago that one was not required. Major failures such as we?ve just had are very rare, this is the first time that I?ve seen a web server do this. A backup system would probably double our costs, I would rather spend members money elsewhere. I do not deem the web as ?business critical?, whilst we have backup and standby systems for our essential HQ services such as email there is not one for web services. Our lack of a backup system has just cost us about 60 hours of web downtime, but it has saved us several thousand pounds over the last three years. It is still my personal view, despite recent happenings, that my decision on this was correct.

Why couldn?t you tell us what was going on? : Because we did not have the ?fine control? over our DNS (the thing that says where www.bsac.org is) that we would like, the internet DNS system was pointing our web sites at a dead server! This was something that we knew about, discussions were actually going on within the IT Team about it when our web server failed.

What are you doing about it? : We had planned to replace the web server during 2005 anyway, subject to final Council approval of the overall 2005 budgets this will be going ahead. Discussions about web servers, hosting, support and everything connected with this area are already underway within the IT Team. So ?
1. We are going to take control of our DNS. This will allow us to very quickly configure a ?sorry, we?ve got problems? emergency service elsewhere should this happen again.
2. We want to replace the web server during 2005, more modern servers are less prone to this type of error anyway.
3. We will be looking again at backup systems, we may be able to use our old web server as an emergency backup once the new one is commissioned.
4. We need to review our procedures and put a formal contingency plan in place rather than ?sort it out when it happens? as we did this time.


So on behalf of the BSAC I would like to apologise to our members for the failure of our web services. Although the initial events were outside of our control it is still my direct responsibility to you and to Council for the running of our IT systems, it is to me that you should address any comments and criticisms. I am sorry that it happened, I am sorry that we couldn?t do more to keep you informed, I will be doing all that I can to improve our systems for the future.

Kind Regards

Keith Lawrence
BSAC Council Member
BSAC IT Team Leader

Gary Cameron
15-11-2004, 16:26
Keith

Been there and know what it is like. I think you really ought to congratulate yourself(s) rather than take blame. I also don't think you need a full system backup for a WEB server. It is very easy to condemn in hindsight.

But I am curious about the power supply. Was the server feed by a UPS (Un-Interupted Power Supply) as part of the site supply system. OR did we have a separate one.

Regards

Gary Cameron
York

Keith Lawrence(BSAC)
15-11-2004, 17:58
But I am curious about the power supply. Was the server feed by a UPS (Un-Interupted Power Supply) as part of the site supply system. OR did we have a separate one.

You're curious, I'm curious, I know of several very curious ISP's as well! We relied on the UPS provided by our supplier who in turn relied on Redbus. There's no point in spending money on UPS's when you're in a ruddy great data centre with their own massive UPS systems and backup generators.

What follows is the Redbus explanation, it's s*ds law of course but it seems that the testing and commissoning of a brand new UPS actually caused the problem. Reminds me a bit of a friend of mine years ago... they lost their server room to a fire... the power supply to one of their UPS units (designed of course to protect them) burst into flames and took the server room with it! There is no such thing as 100% relaibility in this business.

Keith L

From Redbus -

Further to your request for more information with regards the incident on Tuesday 9th of November 2004 at our Harbour Exchange facility, I attach below a brief outline of the events, as they are known at this time.

On Tuesday 9th of November 2004, we were informed of a fault on our systems at around 00.20Hrs

The fault was fully rectified 12 minutes later at approximately 00.32Hrs

At this stage an inspection of the floors commenced.

As you have been made aware Redbus is currently in the process of carrying out a ?1.8 million pound upgrade to its Harbour Exchange data centre. The upgrade includes a new redundant power supply and cooling system which is being built on top of and in addition to our existing infrastructure. Once complete the original system will in part be dismantled and removed. In preparation of this upgrade various commissioning procedures are required.

During one of the routine switching schedules a critical component failure on one of the new UPS Systems was encountered which sent the module into fault. This resulted in the system not going through its usual static bypass mode which is designed to protect the critical load.

At this stage the specialists on hand were able to manually transfer the load back onto the existing system.

The fault was an isolated incident, which affected only part of the Harbour Exchange Data centre. It can only at this stage be surmised that the cause of component failure was due to either a hardware fault on the new equipment, this is however unlikely as the unit was rigorously tested prior to commissioning. It could also have resulted from a large surge in the incoming mains supply which hit the UPS while the switches
were momentarily opened. Redbus is seeking clarification on this point from its electricity supplier as various other anomalies had been encountered throughout the previous day.

The upgrade team have been clearly briefed and will not attempt to reinstate any switching schedules until the cause of the failure has been clearly identified.

I can only express our sincere apologies for the loss of service which this incident represents.

Paul Leyland
15-11-2004, 18:20
Just to say thank you for giving up you time and for all the effort you put in - and unpaid too!" Well Done to you and your team.

Gary Cameron
16-11-2004, 09:55
You're curious, I'm curious, I know of several very curious ISP's as well! We relied on the UPS provided by our supplier who in turn relied on Redbus. There's no point in spending money on UPS's when you're in a ruddy great data centre with their own massive UPS systems and backup generators.

As far as I am aware, it is not good practice to cascade UPS systems, therefore I would say that you have done everything you could and more. Even with the use of hindsight.
Thanks
Gary Cameron
DO YSAC

DAVID__MILLS
16-11-2004, 11:14
I would like to explain a little about what happened,
Keith Lawrence
BSAC Council Member
BSAC IT Team Leader

I think its is to your credtit to get server back up and running. nice work thanks

David

alunharford
17-11-2004, 21:12
What follows is the Redbus explanation, it's s*ds law of course but it seems that the testing and commissoning of a brand new UPS actually caused the problem. Reminds me a bit of a friend of mine years ago... they lost their server room to a fire... the power supply to one of their UPS units (designed of course to protect them) burst into flames and took the server room with it! There is no such thing as 100% relaibility in this business.

Not supprising. They cause more problems than they solve.
If you use redundant PSUs with UPSs then they're useful, but even then you need somebody on site 24/7 because they tend to want to burst into flames all the time.