Surviving Ransomware

Martin Bally
Jan 6
5 min read

Updated: Jan 7

The Day the Infrastructure Turned: A CISO's Post-Mortem of the Cuba Siege

In the world of cyber resilience, there is a distinct difference between a "security event" and a "material crisis." As a CISO, you live with the quiet knowledge that it isn't a matter of if, but when.

My first major encounter with a material ransomware event was against the Cuba ransomware variant (linked to the Russian-aligned Tropical Scorpius group). It was an incident that didn't just test our technical controls, it tested the very foundation of organizational governance and my personal professional liability. It was a masterclass in adversary persistence, the failure of traditional perimeter security, and the necessity of personal leadership under fire.

Here is the post-mortem of that siege.

1. The Technical Anatomy: Stumbling Upon the Zero-Day

We operated under what we believed was the "Gold Standard" of identity security: Username + Password + Machine Certificates. Yet, the adversary walked right through the front door, their access unimpeded by our multi-factor authentication (MFA).

The discovery of how they got in was a stroke of luck born from diligent forensic work. While attempting to reproduce the attacker's entry method in a controlled environment, our internal red team stumbled across a critical flaw. They found that by presenting a specifically malformed session request, the VPN gateway's logic completely bypassed the certificate check.

It was a Zero-Day vulnerability in our VPN concentrator stack. Our state-of-the-art MFA had become single-factor authentication, and because it was a logic bypass, not a brute-force attack, not a single alert ever fired.

Living Off the Land (LotL): Once inside, the attackers didn't immediately deploy exotic malware that trips EDR alarms. They weaponized legitimate administrative utilities like PSExec.exe. While PSExec wasn't a tool we actively used for software deployment, the underlying services (such as Admin shares) were enabled across the environment. The adversary exploited this availability to deploy their payloads rapidly. Coupled with a traditional flat VPN architecture, this gave them the "run of the network," allowing them to move laterally across nearly 100 manufacturing sites in a matter of hours.

2. The Operational Bottleneck: Brain vs. Limbs

This attack highlighted a critical distinction in manufacturing resilience. A common misconception is that the goal is always to hit the "crown jewels" like the ERP.

In this incident, our On-Prem ERP system, the company’s financial "brain" and system of record remained intact.

However, the attackers successfully crippled the manufacturing services middleware and pockets of floor workstations across the globe. We knew what needed to be built (the data was safe in the ERP), but we had no way to translate those orders into physical production. The "limbs" were paralyzed. The business was technically solvent on paper, but operationally frozen.

3. The Strategic Response: The Kill Switch and Zero Trust

The myth in many boardrooms is that massive outages are always the result of uniquely targeted, high-precision strikes. In our case, the attack was opportunistic; what made the outcome unique was our response.

The Pre-Authorized "Kill Switch": The only way to contain the rapid lateral spread was a "scorched earth" decision: shut down the primary data centers immediately. This only worked because of previous Tabletop Exercises (TTX) where I had negotiated pre-authorized authority to pull the plug at 1:00 AM without waiting for a committee vote. This bias for action stopped the ransomware from propagating to the remaining 80% of our infrastructure.
Recovery Triage based on Revenue & Penalties: Once contained, we faced the reality that we couldn't restore 100 sites simultaneously. We moved from a "first-in, first-out" IT mindset to a business-driven triage model. We prioritized recovery based on three strict criteria:
1. Revenue Impact: Which sites produce the highest revenue?
2. Contractual Exposure: Which sites were facing immediate Service Level Agreement (SLA) penalties for missed shipments?
3. Technical Severity: What was the extent of the damage (encryption vs. corruption)?
Engineering Collaboration & BCP Integration: We didn't dictate recovery from an ivory tower. We embedded IT leads with local plant engineering teams to accelerate the restoration of specific industrial controllers. Crucially, we mapped our recovery timeline to the manual Business Continuity Processes (BCP) each facility was using. If a plant had a robust paper-based process to get product out the door, we de-prioritized them in favor of plants that were dead in the water.

The Architectural Pivot: True MFA and Killing the VPN

The incident made it clear that our "Gold Standard" was brittle. We implemented two immediate architectural changes to ensure this specific vector could never be exploited again.

True Out-of-Band MFA (Push Notifications): We abandoned the reliance on passive certificate checks. We moved to a "True MFA" solution utilizing mobile Push Notifications. This decouples authentication from the network session. Even if a gateway is tricked by a malformed packet, the adversary cannot proceed without the physical, out-of-band approval from the user's mobile device. It adds a human "check" that software bugs cannot bypass.
Zero Trust Network Access (ZTNA): We initiated a total transition to SASE (Secure Access Service Edge) tools like Zscaler. By shifting identity to the perimeter and connecting users only to specific applications, rather than placing them "on the network" via a VPN, we eliminate the attack surface that tools like PSExec exploit for lateral movement.

The CISO’s Leadership Playbook

Beyond the technical recovery, this incident fundamentally redefined how I approach leadership during a live crisis.

1. The 70/80 Rule of Decision Making

In a crisis, information is fluid. You will never have 100% of the facts. Hesitation is the adversary's greatest ally. You must be comfortable making "material" decisions with 70% to 80% of the information. Waiting for certainty is a luxury you do not have.

2. The IR Retainer as a Strategic Force Multiplier

We often view Incident Response retainers as a simple insurance policy, a number to call when the house is on fire. This incident taught me that a modern retainer is vital for extending your team's capabilities and skill sets during the fog of war.

Extending the Red/Blue Team: We didn't just use the IR firm for cleanup; we integrated them to augment our internal defenses. They brought specialized forensic and reverse-engineering skills that our internal team simply didn't possess at that scale.
Identifying the Zero-Day: This partnership was the key to our survival. By working alongside our internal team to "red team" the environment and attempting to reproduce the attack, the IR firm helped us identify the obscure zero-day logic flaw in our VPN concentrator. Their external perspective and specialized tooling found what our internal scans missed, proving that a retainer is not just about capacity, it’s about capability.

3. Operational Coordination Over Technical Restoration

Recovery fails in silos. One of the most critical roles I played was not "Head of Technology," but "Head of Logistics." We established a centralized coordination cell that bridged the gap between IT, local Plant Engineering, and Corporate Supply Chain.

Aligning with Local Reality: We didn't guess which servers mattered; we asked local plant leadership which lines were critical to avoiding penalties.
Respecting BCP: We assessed the viability of local Business Continuity Plans. If a site could ship product manually using paper and pencil, they were moved down the queue. Resources were surged only to sites that were operationally dead.
The Outcome: This prevented the chaotic "fighting for resources" that usually plagues recovery efforts and ensured that every restored server translated directly to recognized revenue.

4. Boardroom Transparency

Maintain radical honesty. By using Sales teams to manage B2B relationships and keeping internal communications truthful but fluid, we maintained trust. I focused on translating technical "outages" into business "impacts" for the board, avoiding the trap of overstating our success or understating the remaining risk.

Surviving Ransomware

The Day the Infrastructure Turned: A CISO's Post-Mortem of the Cuba Siege

Recent Posts

Comments