Keith Lawrence(BSAC)
15-11-2004, 10:18
I would like to explain a little about what happened, what we did about it, and what actions we can take to stop it happening again.
Background : We own our web server, it?s our responsibility, it was purchased about three years ago. It is co-located via our ISP (Easily) at Redbus Interhouse in London?s Docklands. Redbus is a state-of-the-art data centre, high security, multiple back-ups, standby generators ?the type of thing one would expect for reliability...
Tuesday 9th November : In the early hours two floors of Redbus suffered a major power failure. It shouldn?t have happened but it did, the event made all of the computer press! Hundreds of servers and several ISP?s were badly affected, the BSAC web server was one of those which lost power. We were trying all day Tuesday to find out what had happened, it was about 18:00 Tuesday evening we finally found out that our web server had failed to restart, it was dead.
Wednesday 10th November : An engineer was on site trying to get us going again, it still wouldn?t restart. The fault was that the power failure had corrupted the disk drives. I took the decision late Wednesday afternoon that the server software should be reloaded. This process went on late into Wednesday evening.
Thursday 11th November : The server was running, our old data was saved. At around 16:00 a completely blank web server was given back to us. Mike and myself set about configuring it and reloading the web sites from our backups.
Friday 12th November Onwards : The process of reloading our content continued. Over the weekend myself and Vic Watson spent a considerable amount of time trying to get our (complex) forum system running again, it was Vic who finally succeeded in the early hours of Monday morning!
OK, some answers to some perfectly valid questions :
Why didn?t we have a backup system? : It was my decision some time ago that one was not required. Major failures such as we?ve just had are very rare, this is the first time that I?ve seen a web server do this. A backup system would probably double our costs, I would rather spend members money elsewhere. I do not deem the web as ?business critical?, whilst we have backup and standby systems for our essential HQ services such as email there is not one for web services. Our lack of a backup system has just cost us about 60 hours of web downtime, but it has saved us several thousand pounds over the last three years. It is still my personal view, despite recent happenings, that my decision on this was correct.
Why couldn?t you tell us what was going on? : Because we did not have the ?fine control? over our DNS (the thing that says where www.bsac.org is) that we would like, the internet DNS system was pointing our web sites at a dead server! This was something that we knew about, discussions were actually going on within the IT Team about it when our web server failed.
What are you doing about it? : We had planned to replace the web server during 2005 anyway, subject to final Council approval of the overall 2005 budgets this will be going ahead. Discussions about web servers, hosting, support and everything connected with this area are already underway within the IT Team. So ?
1. We are going to take control of our DNS. This will allow us to very quickly configure a ?sorry, we?ve got problems? emergency service elsewhere should this happen again.
2. We want to replace the web server during 2005, more modern servers are less prone to this type of error anyway.
3. We will be looking again at backup systems, we may be able to use our old web server as an emergency backup once the new one is commissioned.
4. We need to review our procedures and put a formal contingency plan in place rather than ?sort it out when it happens? as we did this time.
So on behalf of the BSAC I would like to apologise to our members for the failure of our web services. Although the initial events were outside of our control it is still my direct responsibility to you and to Council for the running of our IT systems, it is to me that you should address any comments and criticisms. I am sorry that it happened, I am sorry that we couldn?t do more to keep you informed, I will be doing all that I can to improve our systems for the future.
Kind Regards
Keith Lawrence
BSAC Council Member
BSAC IT Team Leader
Background : We own our web server, it?s our responsibility, it was purchased about three years ago. It is co-located via our ISP (Easily) at Redbus Interhouse in London?s Docklands. Redbus is a state-of-the-art data centre, high security, multiple back-ups, standby generators ?the type of thing one would expect for reliability...
Tuesday 9th November : In the early hours two floors of Redbus suffered a major power failure. It shouldn?t have happened but it did, the event made all of the computer press! Hundreds of servers and several ISP?s were badly affected, the BSAC web server was one of those which lost power. We were trying all day Tuesday to find out what had happened, it was about 18:00 Tuesday evening we finally found out that our web server had failed to restart, it was dead.
Wednesday 10th November : An engineer was on site trying to get us going again, it still wouldn?t restart. The fault was that the power failure had corrupted the disk drives. I took the decision late Wednesday afternoon that the server software should be reloaded. This process went on late into Wednesday evening.
Thursday 11th November : The server was running, our old data was saved. At around 16:00 a completely blank web server was given back to us. Mike and myself set about configuring it and reloading the web sites from our backups.
Friday 12th November Onwards : The process of reloading our content continued. Over the weekend myself and Vic Watson spent a considerable amount of time trying to get our (complex) forum system running again, it was Vic who finally succeeded in the early hours of Monday morning!
OK, some answers to some perfectly valid questions :
Why didn?t we have a backup system? : It was my decision some time ago that one was not required. Major failures such as we?ve just had are very rare, this is the first time that I?ve seen a web server do this. A backup system would probably double our costs, I would rather spend members money elsewhere. I do not deem the web as ?business critical?, whilst we have backup and standby systems for our essential HQ services such as email there is not one for web services. Our lack of a backup system has just cost us about 60 hours of web downtime, but it has saved us several thousand pounds over the last three years. It is still my personal view, despite recent happenings, that my decision on this was correct.
Why couldn?t you tell us what was going on? : Because we did not have the ?fine control? over our DNS (the thing that says where www.bsac.org is) that we would like, the internet DNS system was pointing our web sites at a dead server! This was something that we knew about, discussions were actually going on within the IT Team about it when our web server failed.
What are you doing about it? : We had planned to replace the web server during 2005 anyway, subject to final Council approval of the overall 2005 budgets this will be going ahead. Discussions about web servers, hosting, support and everything connected with this area are already underway within the IT Team. So ?
1. We are going to take control of our DNS. This will allow us to very quickly configure a ?sorry, we?ve got problems? emergency service elsewhere should this happen again.
2. We want to replace the web server during 2005, more modern servers are less prone to this type of error anyway.
3. We will be looking again at backup systems, we may be able to use our old web server as an emergency backup once the new one is commissioned.
4. We need to review our procedures and put a formal contingency plan in place rather than ?sort it out when it happens? as we did this time.
So on behalf of the BSAC I would like to apologise to our members for the failure of our web services. Although the initial events were outside of our control it is still my direct responsibility to you and to Council for the running of our IT systems, it is to me that you should address any comments and criticisms. I am sorry that it happened, I am sorry that we couldn?t do more to keep you informed, I will be doing all that I can to improve our systems for the future.
Kind Regards
Keith Lawrence
BSAC Council Member
BSAC IT Team Leader