Always available

Skill

Always available

This means that authorized users have timely and easy access to information services. IT resources and infrastructure should remain robust and fully functional at all times even during adverse conditions, such as database conundrums or fall-overs. It involves protecting against malicious codes, hackers, and other threats that could block access to the information system.

Availability ensures that a system’s authorized users have timely and uninterrupted access to the information in the system and to the network. Here are the methods of achieving availability:

Distributive allocation. Commonly known as load balancing, distributive allocation allows for distributing the load (file requests, data routing and so on) so that no device is overly burdened.

High availability refers to measures that are used to keep services and information systems operational during an outage. The goal of HA is often to have key services available 99.999 per cent of the time (known as “five nines” availability). HA strategies include redundancy and failover, which are discussed below.

Redundancy refers to systems that either are duplicated or fail over to other systems in the event of a malfunction. Failover refers to the process of reconstructing a system or switching over to other systems when a failure is detected. In the case of a server, the server switches to a redundant server when a fault is detected. This strategy allows the service to continue uninterrupted until the primary server can be restored. In the case of a network, this means processing switches to another network path in the event of a network failure in the primary path.

Failover systems can be expensive to implement. In a large corporate network or e-commerce environment, a failover might entail switching all processing to a remote location until your primary facility is operational. The primary site and the remote site would synchronize data to ensure that information is as up-to-date as possible.

Many operating systems, such as Linux, Windows Server and Novell Open Enterprise Server, are capable of clustering to provide failover capabilities. Clustering involves multiple systems connected together cooperatively (which provides load balancing) and networking in such a way that if any of the systems fail, the other systems take up the slack and continue to operate. The overall capability of the server cluster may decrease, but the network or service will remain operational. To appreciate the beauty of clustering, contemplate the fact that this is the technology on which Google is built. Not only does clustering allow you to have redundancy, but it also offers you the ability to scale as demand increases.

Most ISPs and network providers have extensive internal failover capability to provide high availability to clients. Business clients and employees who are unable to access information or services tend to lose confidence.

The trade-off for reliability and trustworthiness, of course, is cost: Failover systems can become prohibitively expensive. You’ll need to study your needs carefully to determine whether your system requires this capability. For example, if your environment requires a high level of availability, your servers should be clustered. This will allow the other servers in the network to take up the load if one of the servers in the cluster fails.

Fault tolerance is the ability of a system to sustain operations in the event of a component failure. Fault-tolerant systems can continue operation even though a critical component, such as a disk drive, has failed. This capability involves over-engineering systems by adding redundant components and subsystems to reduce the risk of downtime. For instance, fault tolerance can be built into a server by adding a second power supply, a second CPU and other key components. Most manufacturers (such as HP, Sun and IBM) offer fault-tolerant servers; they typically have multiple processors that automatically failover if a malfunction occurs.

There are two key components of fault tolerance that you should never overlook: spare parts and electrical power. Spare parts should always be readily available to repair any system-critical component if it should fail. The redundancy strategy “N+1” means that you have the number of components you need, plus one to plug into any system should it be needed.

Since computer systems cannot operate in the absence of electrical power, it is imperative that fault tolerance be built into your electrical infrastructure as well. At a bare minimum, an uninterruptible power supply (UPS) with surge protection should accompany every server and workstation. That UPS should be rated for the load it is expected to carry in the event of a power failure (factoring in the computer, monitor and any other devices connected to it) and be checked periodically as part of your preventive maintenance routine to make sure that the battery is operational. You will need to replace the battery every few years to keep the UPS operational.

A UPS will allow you to continue to function in the absence of power for only a short duration. For fault tolerance in situations of longer duration, you will need a backup generator. Backup generators run on gasoline, propane, natural gas or diesel and generate the electricity needed to provide steady power. Although some backup generators can come on instantly in the event of a power outage, most take a short time to warm up before they can provide consistent power. Therefore, you will find that you still need to implement UPSs in your organization.

RAID is a technology that uses multiple disks to provide fault tolerance. There are several RAID levels: RAID 0 (striped disks), RAID 1 (mirrored disks), RAID 3 or 4 (striped disks with dedicated parity), RAID 5 (striped disks with distributed parity), RAID 6 (striped disks with dual parity), RAID 1+0 (or 10) and RAID 0+1. You can read more about them in this list of data security best practices.

A disaster recovery plan helps an organization respond effectively when a disaster occurs. Disasters include system failures, network failures, infrastructure failures, and natural disasters like hurricanes and earthquakes. A DR plan defines methods for restoring services as quickly as possible and protecting the organization from unacceptable losses in the event of a disaster.

In a smaller organization, a disaster recovery plan can be relatively simple and straightforward. In a larger organization, it could involve multiple facilities, corporate strategic plans and entire departments.

A disaster recovery plan should address access to and storage of information. Your backup plan for sensitive data is an integral part of this process.

0 Comments