Linux Incident Response Playbook: Practical Troubleshooting and Containment

Security incidents on Linux servers usually don’t start with a dramatic movie-style breach.

Most incidents begin with small signals that are easy to ignore:

repeated SSH failures,
a suspicious process that keeps coming back,
unknown cron entries,
odd outbound network connections,
or a sudden CPU spike at unusual hours.

When this happens, many teams panic and jump straight to random fixes. That is risky. You can destroy evidence, break production, and still fail to remove the root cause.

This post gives you a practical, production-safe playbook for basic incident response linux workflows. The goal is simple: detect quickly, contain safely, recover cleanly, and prevent recurrence.

Primary keyword: basic incident response linux
Search intent (weekly rotation): Problem-solving / Troubleshooting
Monthly keyword cluster: linux security hardening, secure shell scripting, linux audit log, server hardening checklist

Why a playbook matters more than heroics

A repeatable playbook beats improvisation every time. During incidents, stress is high and context is fragmented. A clear workflow helps you:

Reduce mean time to detection (MTTD)
Reduce mean time to response (MTTR)
Preserve forensic evidence
Keep stakeholders informed
Avoid accidental downtime caused by rushed commands

For small teams, this is even more important. You probably don’t have a dedicated SOC. Your best defense is a lightweight but disciplined system.

Incident severity model (keep it simple)

Use a 3-level severity model so everyone aligns fast:

SEV-1 (Critical): active compromise, data exfiltration risk, privilege escalation confirmed, service unavailable.
SEV-2 (High): suspicious persistence, repeated auth attacks, malware indicators, partial service impact.
SEV-3 (Medium): early anomalies, suspicious logs without confirmed compromise.

Define actions by severity. Example: SEV-1 requires immediate containment and management notification. SEV-3 may start with observation and tighter monitoring.

Phase 1 — Detect: verify signals before acting

Start with signal validation. Not every anomaly is an attack.

Quick triage commands

# Failed/success SSH logins (Debian/Ubuntu)
sudo journalctl -u ssh --since "-2 hours" --no-pager

# Recent auth events from auth.log (if available)
sudo tail -n 200 /var/log/auth.log

# Top active network connections
sudo ss -tulpen

# High CPU / memory processes
ps aux --sort=-%cpu | head -n 15
ps aux --sort=-%mem | head -n 15

# Recent cron modifications
sudo find /etc/cron* -type f -mtime -2 -ls

Questions you must answer

Is the behavior new or normal for this server?
Which account, host, and time window are involved?
Is there a clear blast radius (single service or multiple)?
Is customer data potentially impacted?

Avoid deleting anything at this stage. Detection is about building confidence in facts.

Phase 2 — Contain: stop the bleed safely

Containment should be surgical, not destructive. Keep services alive where possible.

Fast containment checklist

Restrict ingress with firewall rules
Temporarily disable suspicious account access
Isolate affected service host if needed
Rotate exposed credentials/tokens
Increase logging verbosity for affected services

If you already use UFW + Fail2Ban baseline, containment becomes much easier. See related setup in UFW vs Fail2Ban vs SSH Hardening: Kombinasi Wajib Keamanan Server Linux.

Example: temporary network containment (UFW)

# Deny suspicious source IP quickly
sudo ufw deny from 198.51.100.24

# Restrict SSH to known office/VPN IP range
sudo ufw delete allow 22/tcp
sudo ufw allow from 203.0.113.0/24 to any port 22 proto tcp

sudo ufw status verbose

Example: temporarily lock risky user account

# Lock account (do not delete yet)
sudo usermod -L suspicious_user

# Expire existing sessions if required
sudo pkill -KILL -u suspicious_user

Containment is not cleanup. You are buying time and reducing blast radius.

Phase 3 — Investigate: collect evidence with minimal footprint

Now gather artifacts to understand root cause.

Core artifacts to collect

Auth logs (journalctl, /var/log/auth.log)
Process tree and binary paths
Crontab entries (system + user)
Bash history (with caution)
New/modified files in sensitive paths
Startup persistence points (systemd, cron, rc files)

# Process tree snapshot
ps -eo pid,ppid,user,cmd --sort=ppid > /tmp/ir-process-tree.txt

# Systemd units recently changed
sudo find /etc/systemd/system -type f -mtime -7 -ls > /tmp/ir-systemd-changes.txt

# Suspicious SUID files audit
sudo find / -perm -4000 -type f 2>/dev/null > /tmp/ir-suid-files.txt

# Recently modified binaries/scripts in common paths
sudo find /usr/local/bin /usr/bin /tmp /var/tmp -type f -mtime -2 2>/dev/null > /tmp/ir-recent-files.txt

If you rely on shell automation during incidents, make sure scripts are safe to rerun. This pattern helps avoid broken rollback attempts: Idempotent Shell Script: Jalankan Berkali-kali Tanpa Bikin Berantakan.

Evidence hygiene rules

Record command history in incident notes
Use UTC timestamps for cross-system correlation
Never “clean first, investigate later”
Hash critical collected files if you need integrity tracking

Phase 4 — Eradicate: remove root cause, not only symptoms

Common mistake: kill process, restart service, done. That only hides symptoms.

Eradication should cover:

Remove malicious binaries/scripts
Delete unauthorized keys/users/tokens
Fix vulnerable configuration
Patch vulnerable packages
Rebuild compromised workloads if trust is low

Practical eradication tasks

# Update packages quickly (example for Debian/Ubuntu)
sudo apt update && sudo apt upgrade -y

# Review authorized keys for privileged users
sudo ls -la /root/.ssh/
sudo cat /root/.ssh/authorized_keys

# Validate SSH hardening status
sudo grep -E "^(PermitRootLogin|PasswordAuthentication|PubkeyAuthentication|AllowUsers)" /etc/ssh/sshd_config
sudo sshd -t

If host trust is severely compromised, rebuilding from known-good image is often faster and safer than deep manual cleanup.

Phase 5 — Recover: restore service with guardrails

Recovery means restoring business function with controls enabled, not simply turning things back on.

Recovery checklist

Validate backups before restore
Restore least-privilege access model
Re-enable traffic gradually
Monitor logs/metrics for at least 24 hours
Announce closure only after stability criteria are met

Post-recovery validation commands

# Failed services
systemctl --failed

# Security-relevant service health
systemctl status ssh --no-pager
systemctl status fail2ban --no-pager

# Verify listening ports
sudo ss -tulpen

Phase 6 — Lessons learned: convert incident into hardening

A good response ends with better baseline security.

Create a short post-incident note with:

timeline,
root cause,
impact,
what worked,
what failed,
and concrete action items.

Then update your server hardening checklist:

Enforce key-only SSH access
Review firewall rules monthly
Enable Fail2Ban jails for exposed auth services
Add alerting for suspicious login spikes
Run periodic account and sudo privilege audits

Useful CLI habits for fast diagnostics are covered in Linux Shell Command yang Sering Dipakai Developer Modern.

A minimal IR runbook template (copy this)

You can store this in your repo as RUNBOOK_INCIDENT_RESPONSE.md:

Trigger: What alert or symptom started the response?
Scope: Which hosts/services are affected?
Severity: SEV-1 / SEV-2 / SEV-3
Containment actions: exact commands + timestamps
Evidence collected: logs/files/outputs
Eradication actions: what was removed/fixed
Recovery validation: health checks passed?
Owner + handoff: who closes and who follows up?
Prevention tasks: hardening and automation backlog

Keep it short. During incidents, concise runbooks outperform long documentation.

Common mistakes in Linux incident response

1) Restarting too early

Immediate restarts can erase useful traces. Snapshot evidence first.

2) Overblocking in panic mode

Blocking all ingress without coordination can create unnecessary downtime.

3) No communication owner

Even a small incident needs one incident commander to avoid conflicting actions.

4) Skipping root cause analysis

Without RCA, the same issue returns next week.

5) Ignoring automation quality

Unsafe emergency scripts can become a second incident.

Final takeaway

A reliable basic incident response linux process is not about expensive tooling first. It is about repeatable steps:

detect with evidence,
contain with precision,
eradicate root cause,
recover with controls,
and harden continuously.

For small engineering teams, this approach is realistic, affordable, and significantly improves resilience. Start with a lightweight playbook, test it in drills, and refine after every real incident.

FAQ (Schema-ready)

What is the first action when I suspect a Linux security incident?

Start with evidence-preserving triage: verify suspicious logs, process behavior, and network activity before making destructive changes.

Should I disconnect the server immediately from the network?

Only if active compromise is severe (SEV-1). Prefer targeted containment first (restrict SSH source, block attacker IPs, isolate service) to avoid unnecessary outage.

How long should I keep incident logs and artifacts?

At minimum, keep them through remediation and post-incident review. For compliance-sensitive systems, follow your retention policy and legal requirements.

Is Fail2Ban enough for Linux server protection?

No. It is one layer. You still need firewall policy, SSH hardening, patching, and access review.

How often should we run incident response drills?

Quarterly is a strong baseline for small teams. Monthly tabletop exercises are even better if your environment changes frequently.