SAP Knowledge Base Article - Public

3030278 - CPQ Disaster Recovery Plan

Symptom

You want to know the procedures and relevant terms of the CPQ Disaster Recovery Plan. More detailed procedure documents are referenced within this plan, but are not included herein due to information security requirements. The objective of this plan is to fully reestablish network and services to our customers as quickly as possible within the stated recovery time and recovery point objectives.

Environment

 SAP CPQ

Cause

1. Definition

Disaster recovery is the process of regaining access to the data, software, hardware, and network necessary to resume critical business operations for Callidus CPQ Customers after a natural or human caused disaster. This section lists acronyms, slang and TLAs in this glossary of terms.

Term Definition / Examples
DRP Disaster Recovery Plan
RTO Recovery Time Objective
RPO Recovery Point Objective
Disaster

"Disaster" as an event that causes a partial (or complete) disruption of our Data center operations making parts (or the
whole) of it unusable for a defined period of time.

System and/or equipment failures

Electrical power failure, Loss of water supply, Communications services breakdown, Power supply or
distribution disruption, and Support utility equipment failures.

Organized or deliberate disruptions

Terrorism, Sabotage, War, Theft, Arson, Civil unrests and Transportation disruption.

Computer based attacks

Data loss, Cybercrime, Network disruption, Hacking, Virus outbreak, and Systems failure.

Natural Disasters

Cyclone, Flood, Tsunami, Drought, Earthquake, Fire, Subsidence and Landslides, Contamination and
Environmental Hazards, Epidemic, Lightning strikes and Pandemic.

Civil unrest

Civil disorder, also known as civil unrest, is a broad term that is typically used by law enforcement to describe unrestthat is caused by a group of people.

2. Disaster Recovery Including Events

There are many events that can impact Hosted Services Operations including:

  • System and/or equipment failures;
  • Human error;
  • Organized and deliberate disruptions or Terrorist attacks;
  • Theft;
  • Fire;
  • Power Failure;
  • Computer based attacks;
  • Natural Disasters;
  • Civil Unrest;
  • Transportation disruption;

3. Data Protection

The best recovery plan is the prevention of the loss. A key element is preventing data loss. The following preventions are in place:

  • Backups are regularly scheduled, stored on-site in a fire resistant data safe, and sent off-site in regular intervals.
  • Storage devices utilize RAID technology to prevent single drive failure from causing file system damage
  • Conditioned power is used to reduce power fluctuation induced hardware failures
  • Physical chassis servers are fully redundant and capable to maintaining operations when failures occur
  • Uninterruptible power supplies are used for all equipment, and backed up by onsite power generation
  • Fire prevention and suppressions systems are used
  • Multiple tiers of physical security are in place
  • Network and system security is in place
  • Data center is physically housed in a hardened site

4. Recovery Time Objective (RTO)

The nature of the disaster, the extent of the damage, and the Customers’ business continuity objectives for their CPQ Services all influence the recovery time objectives. Although any outage is to be avoided, fallback business continuity procedures usually involve implementing a draw if an extended outage is anticipated. We have segregated the scale of the disaster into three Categories for the purposes of recovery time objectives (RTO) as follows:

  • Category 1 - 4 hour RTO

This would be characterized by single or multiple hardware failures within the primary data center, or the intentional destruction of data within the data center. This would include disk, server, network firewall, or other equipment failure(s), or human induced nonphysical destruction of data or environment. This damage would require the repair of the affected components, and the recovery of data.

  • Category 2 - 8 hour RTO

This would be characterized by single or multiple hardware destruction within the primary data center. This describes serious and irreparable damage to the hardware or cage infrastructure. This damage would likely require relocation within the data center facility, extensive equipment and software replacement, followed by data recovery.

  • Category 3 - 12 hour RTO

Primary data center is destroyed or seriously compromised. Recovery requires replacing hosted services infrastructure in different data center, replacement of all equipment, software, and recovery of data.

Recovery Point Objective (RPO)

The recovery point objective is the acceptable data rollback established for the recovery plan. In general, if the primary data has been damaged, data recovery is done from either online or tape data backups.
Online backups are likely to have a complete image of your data that is within 24 hours of current. Automated backup jobs are performed for customer production data. CallidusCloud maintains a local backup of the production database for five (5) days, offsite storage for two (2) weeks and monthly backups for a year. Backup of CallidusCloud’s production data helps ensure critical systems and data are available for restoration in the event of a system failure or disruption. The automated backup system is configured to notify operations personnel regarding backup status. The primary data center facility is located in Chicago, Illinois, with a disaster recovery data center located in Ashburn, Virginia, which contains an instance of the production data housed in Chicago. Replicated data is transmitted at scheduled intervals over a dedicated VPN connection.

SAP CPQ has contracted Rackspace data center for automatic rerouting of data and services to protect against disruptions caused by an unexpected event. Rackspace is used for the primary site, located in Chicago, and a secondary site located in Virginia. A disaster recovery plan is in place and is tested on an annual basis to help ensure that CallidusCloud meets their business obligations.

  • Category 1 disaster – 4 hours RPO at primary data center
  • Category 2 disaster – 4 hours RPO at primary data center
  • Category 3 disaster – 4 hours RPO at primary data center

5. Communication on Disaster or Incident

  • CPQ will follow the standard Incident Management procedure which includes required Customer and management notifications.
  • CPQ will follow the standard support process until it becomes apparent that a disaster has occurred. As soon as a disaster has been identified, the support escalation contacts should be contacted immediately.

6. Disaster Assessment

The scope and nature of the destruction requires rapid assessment to invoke the appropriate response plans. In general the hardware, software, and data recovery response plans should be implemented in parallel. If the destruction results from a physical or electronic assault, the information security response plan should also be invoked. Once the initial disaster assessment is complete, a change control board meeting should be called, all information reviewed, a disaster category assigned to the event, the response plans invoked, and an appropriate Customer communication plan put in place focusing on rapid communication of status, recovery steps underway, and frequency of status updates until the service is fully recovered.

6.1 Order of Recovery and response Plan overview

6.1.1 Order of Recovery

The following order of recovery has been established:

  • Virtual Network and Security infrastructure
  • SAN Infrastructure
  • Virtual Server Infrastructure
  • Software provisioning infrastructure
  • Customer production instances

6.1.2 Response Plan Overviews

Information Security

  • Isolate “area of compromise” based on positive confirmation and nature of alert
  • Validate current administrative access and perform lockdown where needed
  • Preserve audit trail
  • Investigate cause of incident with emphasis on possible information security policy violations
    • Determine point of compromise
    • Determine window of compromise
    • Determine impact of compromise
      • User account/ “Chain of Trust” integrity checks

      • Implementation integrity checks (physical, network, system, application, data)

    • Incident notification based on nature of incident
      • Callidus internal

      • Customer

      • Law enforcement (if required)

    • Restore security services as needed
      • User and administrative account cleanup

      • Recover to last known good state

      • Reinstallation / patches / hot fixes / reconfiguration as needed

Hardware

  • Validate data center viability or institute alternate site build

If alternate site

  • expedite production footprint device order
  • expedite cage build out
  • Identify damaged or failed devices
  • Callout to device service if no physical damage
  • Expedite replacement device order if physical damage
  • Bring device online
  • Establish monitoring agents
  • Move device into production
  • Build or validate zone infrastructure

Software

  • Establish or validate software provisioning infrastructure
  • Install or validation production instances’ software stack
    • Any applications updates will be on hold until root cause established

Data

  • Establish customer production date recovery options based upon backup inventory
  • Have customer select desired date recovery option possible
  • Recover production and turn over to customer

7. Annual Test/Rehearsal/Review Requirements

Recovery Plans are tested and validated by the operations team to verify its effectiveness and following
processes are followed:

7.1 Exercise and Testing Process

DRP tests or rehearsals will be done no less frequently than annually. Category 1 full restoration of a customer environment will be fully tested no less frequently than annually. Conference room tests of Category 2 and Category 3 recoveries will be fully tested no less than annually. This document will be reviewed annually and the review will be documented by updating the revision history.

7.2 Purpose of this Process

The purpose of this process to summarize the actions taken and their results during tabletop/functional test execution, review, and corrective actions to the defined for the CPQ Disaster Recovery (DR) drill.

This document is to be considered a “baseline” throughout the phases of the recovery process, independent of the type of exercise being performed.

7.3 Pre-Test

7.3.1 Test Planning Background

This test is in support of the CPQ infrastructure.

This testing, primarily is done in a “tabletop” fashion with a sampling of the customer migrations performed as part of the migration to the disaster recovery infrastructure being implemented as “functional” tests. The migration process of customer instances from the production platform to another infrastructure is the same process used during a major outage event.

If any actual changes are needed to be made to work products in Production during this testing activity, they will be vetted through the proper/required Change Control processes and all necessary approvals will be obtained and documented for audit purposes.

7.3.2 Test Design

For the previous year’s CPQ DR test, CallidusCloud performed a full restore of a production environment in an alternate data center, redirected network traffic to the restored environment and decommissioning
the original production environment after testing and validation. The original production environment was hosted from Callidus data center. The Operations team used IT DR provisioning procedures to build an
identical production environment within its alternate region Data Center, restored the production data, and then tested the restored environment to ensure that it was fully functional and that no data had been lost.

7.3.3 Pre-Test Planning Meeting(s)

Specific coordination and simulation meetings where more specific required steps related to the migration of data will be discussed and rehearsed. In these discussions and through careful observation of ongoing infrastructure activities, it will be confirmed that we can effect successful migration of the infrastructure and/or application with a minimal level of residual risk related to the customer instances from existing facilities to alternative facilities.

All input and observations made during this process will be considered for updates/enhancements to be made within any associated documentation and/or defined processes.

Selected members of various internal organizations will be included in the planning and testing meetings for this tabletop/functional DRP test scenario.

This group, and the organizations reporting to them, is responsible for various aspects of the recovery of services during an actual emergency situation where normal operational capabilities have been impacted unexpectedly for unknown reasons. Senior management agrees that these internal organizations are adequate and appropriate to perform the necessary steps to regain operational state to satisfy both customer requirements as well as internal expectations.

7.3.4 Test Post Mortem Findings

This process is performed after the testing exercise. The management team will review the outcome of the test, including how the team performed according to schedule and whether there were any errors. Suggestions including documentation updates and recommendations for future drills to be performed by less experienced engineers to identify any additional documentation detail that might be needed.

7.4 DR Test Process

7.4.1 DR Test Plan

This is an overview of the tasks performed to restore the services quickly during the Disaster situation.

DR Task Test/Validation
System Administrators
  • Start application VM
  • Start db VM is started
  • Verify servers are accessible and online
  • Verify logins and permissions are correct
  • Verify SQL server is started
  • Verify all application configuration files are set appropriately
  • Hand over database server to DBA.
Database Administrator
  • Verify database node is active.
  • Check database connectivity from app server to database.
  • Troubleshoot if needed.
Network Operations
  • Preconfigured prior too. Callidus Cloud Operations will verify.
  • Modify IP to point to new dr app servers in DNS.
Security Infrastructure
  • Verify IDS is running
  • Verify SSL certificates are correct
Application Support
  • Verify application is functional

7.5 Retrospective Review Requirements

All DRP events, tests or rehearsals shall be followed up with a retrospective analysis within 30 days of the completion of the test or rehearsal with particular focus on what can be improved.

Resolution

With the above information you will be able to understand the processes and procedures of the CPQ Disaster Recovery Plan.

Keywords

CPQ, Disaster Recovey Plan, Recovery, DRP , KBA , CEC-SAL-CPQ , Sales Cloud CPQ , How To

Product

SAP CPQ 2021