Abstract - This post is based on the Guide for Cybersecurity Event Recovery, published in December 2016 by the National Institute of Standard Technologies, stating how to develop a recovery plan before a cyber-event. Planning enables organizations to explore system dependencies and crisis scenarios and to prepare a customized playbook. Statements in this post can be discuss in the comments at the bottom of this page.

               Recovering from a cybersecurity event needs to be prepared before it occurs by planning and documenting. To do so, organization should follow the CyberSecurity Framework [2] (CSF) which is a document meant for identifying, detecting, protecting and responding in order to understand and improve the organization’s security posture.

Preparing cyber event recovery starts by identifying and documenting the key personnel who will be responsible for defining the recovery criteria and associated plans, and to ensure that all these personnel understand their role and responsibilities.

The second step in this preparation is to list all organization assets and prioritize them depending on their impact and their relative importance to meeting the organization’s mission. This will help to determine the sequence and timeline for restoration activities during and after a cyber event. To best optimize resilience, these dependencies should be categorized by organizational value. Critical services, such as LDAP which is used for mail service must be the first priority to protect enterprise email exchanges and external communication.

Organizations also need to meet other perquisites such as identification of system boundaries and identities in their environment, and maintainability of the access control over them. Furthermore, it is essential to preserve data integrity which leads to confidence in it. Validating, backuping, replicating and monitoring the data on a regular basis is key to protect mission critical data. Last but not least, recovery plan should be protected as it includes sensitive information, that’s why it is important to isolate and restrict access to business key resources that are not needed by some assets.

               A recovery plan is usually set up in a form of a guidance and playbooks which includes the development of procedures and processes that are required to restore systems and all impacted assets in a minimal amount of time. This plan must cover and document topics such as service level agreements, staff and recovery team to contact, restoration procedures, backup storage details, internal and external communication plans as well as hardware and software specifications of the systems used by the organization.

Based upon the previously mentioned preparation the recovery team should identify requirements which leads to possible strategic business and technical options. For example, automation can be used to simplify and accelerate the recovery process.

Recovery planning has to include resources costs for each technical recovery option. Discussions about this topic may be done with the stakeholders. It is important that they are involved in this cybersecurity modeling to remind business system owners of the realistic threats and their potential impact. Another part of the recovery plan should set up an organization’s privacy team which will be responsible for identifying potential risks to individuals after a cyber event.

Main section of the recovery plan should detail the specific procedures of recovery: restoring from backups, rebuilding systems, replacing compromised files… Procedures should be strictly followed depending on the services impacted by the attack. They can also include non-technical actions such as adapting personnel behavior until recovery process is done.

               It is critically important to initiate recovery process after the incident response team has achieved its investigation concerning the cyber event. Understanding of the adversary’s footprint and objectives is key to prevent failure of recovery. Moreover, visibility of impacted resources and knowing of the infiltration detection can be hidden from the adversary. A right balance between forensic investigation and business service restoration is a unique decision which will help restoring services and systems to an operational status. The organization should define and document under which conditions the recovery plan has to be initiated.

The investigation should achieve two key objectives that are necessary before the execution of recovery. First, the organization must understand the intruders’ objectives: access, disruption or data availability obstruction. Secondly, investigation team has to find how the hacker gained access to the environment and if the attackers are still present or in control of the IT resources following the reports of the containment mechanisms.

Whereas the recovery processes may be ineffective or inefficient and imply additional costs for the organization if the investigation is not completed, there are some instances where recovery can be initiated in parallel. In fact, elimination and containment failures might help in detecting systems weaknesses and potentially isolate compromised assets from recovered or rebuilt assets. Deploying protection, detection and response processes to other interconnected systems linked to the targeted resource will minimize propagation across the infrastructure.

A 100% recovery may not be necessary immediately in most of cases. Different levels of recovery can be applied within a diminished capacity of certain services. For example, a denial of service attack will impact accessibility to your services, full recovery will not be able until the attack stops. Meanwhile the recovery team may export some services outside of the scope of the attack.

As it is not possible to fully achieve recovery manually in a short time, automated system might help to determine which offline virtual machine images have been compromised or other actions that can be pre-programmed among a cyber event.

                Effective recovery communications are also critical success factors to achieve organization resilience. Following CSF document, they need to be planned and implemented to include non-technical aspects of resilience such as management of public relation issues and organization’s reputation.

Cyber event implies legal perspectives and regulatory compliances that the organization needs to take care of by knowing what to say to whom. This will require planning specific requirements and timing communication as investigations are still ongoing. Communication with stakeholders is also needed to ensure that they understand their responsibilities during the recovery process and insure they are confident in the recovery team’s capabilities. Actually, giving too much information or inaccurate information may lead to further harm to organization’s reputation. Last but not least, internal communication is also essential and needs to be planed accordingly. Each team has to know to whom reports have to be made and has to be informed of the progress of other teams regarding their objectives.

The organization should be aware in its playbook that some methods of communication may not be available during the recovery. For example, if the network has been compromised, mail exchanges or VoIP communications may be not secure or even corrupted by the intruder. Planning and improving processes following different scenarios should be part of the continuous improvement.

Finally, the organization should consider sharing actionable information with other organization when they succeed in recovering from a major new threat. This sharing coming from both sides is a mutual benefit to detect more quickly and in some case, prevent cyber-attacks. However, it is necessary that organizations don’t share recovery information until it is fully performed.

                Planning a cyber event recovery is not a one-time activity. It should be continuously improved following the knowledge acquired during previous attacks or periodical verifications of the organization’s capabilities and evolution. Constant improvement of the recovery plan and the security posture of the organization is required to ensure the achievement of its pre-defined long-term goals.

Validation of recovery capabilities is necessary to ensure continuous improvement. It implies checking every technologies, processes and people involved in the recovery efforts. The main method is to run a survey with all individuals who are part of the recovery plan in order to get input on the recovery plans, policies and procedures. Moreover, the survey should be customized for each actor depending on their responsibilities. For example, personnel should be asked on how realistic is the expected delay for a specific recovery task. Answer can take the form of exercises or tests to measure the organization’s recovery capabilities. Even if tests, like shutting down a critical system to ensure that controlled failover occurs gracefully, may have an impact on the activities, it is better to find issues during testing than during a cyber event. Exercises by introducing voluntary failures into system may be a good training to ensure that participants are always aware and ready for a potential attack. Asking a member of the team to play the role of an adversary or engaging an ethical hacker is a way to proactively increase the defense of the organization by discovering new failures, optimizing and patching the recovery plan.

Improving recovery and security capabilities can also be done through identifying improvements from lessons learned during a real cyber event recovery. Implementing continuous feedback will not only help to enhance recovery plan but also organization’s security operations and policies. Following these modifications, organization can adapt approaches and handle more scenarios of attack. There are two types of actions that can be applied. Short-term improvements are new security postures like patching new security issues discovered. Long-term improvements, such as acquisition of new security technologies or redesign of operational processes, are projects which needs more inputs and risk assessments to be achieved.

It is principal that all individuals taking part in the recovery actions are aware that they need to find a right balance between restoring organization’s system to normal operations quickly and well documenting issues they encounter during that process. This documentation will help in the resolution of future cyber events or will potentially prevent one to occur. The more time passes until issues are documented, the less likely this documentation will be done accurately and completely.

                Recovery metrics are used to improve the quality of recovery actions within the organization, such as improving specific aspects or performing a cost/benefit analysis of particular approach. It might be good to plan what should be measured depending if it is relevant for the continuous improvement program.

Organization need to decide whether the metrics gathered are valuable feedback depending on the repeatability and the commonality of the cyber event measured. In fact, these metrics can be either a benefit or a hindrance. Different areas can be measured but it is important to remember that resilience is a highly subjective area of cybersecurity and comparing recovery metrics among other organization or a single entity may produce misleading results.

Three major recovery areas can be measured. Incident damage and cost covers the cost of the leak of sensitive data, the hardware used to execute recovery plan as well as damages to organization’s reputation. Organizational risk assessment includes frequency of recovery tests, the number of issues found during these activities and the number of IT incidents that were not identified. Finally, quality of recovery activities consists of satisfaction of business stakeholders, uptime of IT services, recovery objectives that have been achieved on time and number of business disruption due to IT service incidents.

               To sum up, cybersecurity event recovery is major topic concerning an organization resilience. It is necessary to complete the pre-conditions required for an effective recovery and initiate the recovery planning by a listing and documenting of the organization’s resources. The recovery planning should take care of every details of the scenarios, processes and procedures that have to be applied during execution, including non-IT actions such as communication and insights sharing when all recovery activities are terminated. Metrics are key in the improvement of the recovery plan; all performances and reviews help in augmenting the resilience of the organization.

Sources:

  • [1] Guide for Cybersecurity Event Recovery, National Institute of Standards and Technology, December 2016
  • [2] Framework for Improving Critical Infrastructure Cybersecurity, National Institute of Standards and Technology, February 12, 2014, Version 1.0

Made at Tallinn University of Technology in 2017. Photo by Patryk Grądys on Unsplash