Linux Backup Automation with Shell Script: Troubleshooting Failed Cron, Locking, and Retention
Last updated on
Primary keyword: linux backup automation shell script
Monthly keyword cluster: linux shell scripting, bash scripting linux, task automation linux
Weekly intent rotation: Problem-solving (Troubleshooting and Practical Fixes)
If your backup job says “success” but restore fails, that is not a backup system. That is a timer with false confidence.
In small teams, backup automation usually starts simple: one cron line, one .sh file, done. It works for weeks, then suddenly you find old files not rotated, jobs overlapping, database dumps half-written, or disk full at 3 AM.
This guide is made for that exact moment.
We will build and troubleshoot a production-safe Linux backup automation shell script with practical patterns you can apply today:
- reliable cron execution,
- lock to prevent overlap,
- atomic file handling,
- retention policy that does not eat disk,
- fast health checks for backup quality,
- and a restore-first mindset.
No overengineering, just patterns that survive real production behavior.
Why backup scripts fail in production (even when code looks fine)
Most failures are not syntax problems. They are environment and workflow problems:
-
Different PATH in cron
Your script runs fine manually but fails in cron because command paths differ. -
Overlapping jobs
A slow backup run is still running when the next schedule starts. -
Non-atomic writes
Script writes directly to final filename. If interrupted, file is corrupted but still looks valid. -
No retention guardrail
Old backups never deleted, disk fills up, new backup fails silently. -
No restore checks
Team validates “file exists,” not “file is restorable.”
If you fix these five, your backup reliability usually jumps fast.
Baseline script structure you should adopt
Use this minimal structure:
#!/usr/bin/env bash
set -euo pipefail
export PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
BACKUP_DIR="/srv/backups/myapp"
TMP_DIR="${BACKUP_DIR}/tmp"
LOG_FILE="/var/log/myapp-backup.log"
LOCK_FILE="/var/lock/myapp-backup.lock"
RETENTION_DAYS=14
mkdir -p "$BACKUP_DIR" "$TMP_DIR"
log() {
echo "[$(date -Iseconds)] $*" | tee -a "$LOG_FILE"
}
Why this matters:
set -euo pipefailcatches hidden failures early.- explicit
PATHavoids cron surprises. - dedicated temp directory enables atomic rename.
- single log file keeps incident review simple.
For deeper shell hardening, you can also read:
- Bash Strict Mode and Safe Automation Checklist for Linux Servers
- Shell Script Preflight Checks Linux Automation Reliability Guide
Step 1: Prevent overlapping runs with lock
This is the #1 production issue in scheduled backups.
exec 200>"$LOCK_FILE"
if ! flock -n 200; then
log "SKIP: backup already running"
exit 0
fi
When job duration is longer than schedule interval, flock saves you from:
- race conditions,
- high disk I/O spikes,
- inconsistent incremental backup state.
If you want a complete explanation and variations, read:
Mencegah Cron Job Tumpang Tindih di Linux dengan flock
Step 2: Write backup atomically (tmp first, move later)
Never write final backup directly.
TS="$(date +%Y%m%d-%H%M%S)"
TMP_FILE="${TMP_DIR}/db-${TS}.sql.gz.part"
FINAL_FILE="${BACKUP_DIR}/db-${TS}.sql.gz"
log "START: creating dump"
mysqldump --single-transaction mydb | gzip -c > "$TMP_FILE"
# quick integrity check
if ! gzip -t "$TMP_FILE"; then
log "ERROR: gzip test failed"
rm -f "$TMP_FILE"
exit 1
fi
mv "$TMP_FILE" "$FINAL_FILE"
log "OK: backup saved $FINAL_FILE"
This pattern ensures partial files never appear as successful backups.
Step 3: Add retention that is safe and predictable
Retention should be boring and deterministic.
log "INFO: apply retention > ${RETENTION_DAYS} days"
find "$BACKUP_DIR" -maxdepth 1 -type f -name "db-*.sql.gz" -mtime +"$RETENTION_DAYS" -print -delete
Tips:
- always filter by strict filename pattern,
- print before delete for easy audit,
- do not recurse unless you really need it.
If you are managing many shell jobs, this is related too:
Shell Script Retry, Backoff, Timeout Patterns Linux Automation
Step 4: Build cron entry that is production-safe
Use full paths and append logs explicitly.
17 2 * * * /usr/bin/flock -n /var/lock/myapp-backup.lock /usr/local/bin/myapp-backup.sh >> /var/log/myapp-backup.log 2>&1
Three common fixes when cron “does nothing”:
- Put full command path (
/usr/local/bin/myapp-backup.sh). - Ensure executable bit exists (
chmod +x). - Add explicit output redirection for visibility.
Step 5: Add quick restore validation (the missing piece)
Backup without restore test is incomplete.
At minimum, add one of these checks:
gzip -tfor archive integrity,- spin up disposable DB and run a small restore sample weekly,
- verify expected table signatures/checksums.
Example weekly smoke-restore idea:
# pseudo-flow
# 1) pick latest backup
# 2) restore to temporary db/container
# 3) run sanity query (table count, key records)
# 4) report status
This immediately catches broken dumps, encoding issues, and permission surprises.
Practical troubleshooting playbook
Case A: Backup file size is suddenly tiny
Likely causes:
- dump command failed early,
- credentials expired,
- network DB read interrupted.
Fast actions:
- check exit code and stderr logs,
- run command manually with same user as cron,
- verify auth source file and permissions.
Case B: Retention deletes wrong files
Likely causes:
- broad
findpattern, - wrong directory variable,
- accidental recursive delete.
Fast actions:
- run
findwith-printonly first, - review one full command in logs,
- enforce strict naming convention (e.g.
db-*.sql.gz).
Case C: Jobs overlap after deployment
Likely causes:
- lock file path changed,
- wrapper script bypasses flock,
- cron entry duplicated.
Fast actions:
- inspect crontab for duplicates,
- confirm lock line is executed before heavy commands,
- add
pscheck in logs for active PID.
Case D: Disk full, backup stops
Likely causes:
- retention not executed,
- old format files excluded from cleanup,
- backup growth exceeded estimate.
Fast actions:
du -shbackup directories,- enforce retention for legacy patterns too,
- add free-space preflight check before dump.
Production checklist (copy this to your runbook)
- Script uses
set -euo pipefail - PATH is explicit for cron
- Locking (
flock) exists and tested - Temporary write + atomic move implemented
- Retention policy logs every deletion
- Backup integrity check runs every job
- Restore smoke test runs at least weekly
- Alerting exists when backup fails
- Team has one-page restore runbook
A related incident-readiness perspective you can adopt:
Linux Incident Response Playbook: Practical Troubleshooting and Containment
Common architecture decision: local vs remote backup target
There is no universal winner. Use this practical default:
- Local first, remote second for speed + safety.
- Write dump locally (fast and atomic), then sync to object storage/remote host.
Why this works well:
- local write reduces partial upload complexity,
- restore is still possible even if remote endpoint fails temporarily,
- retries for remote sync can be isolated from dump logic.
If bandwidth is unstable, split stages:
- Create and verify local artifact.
- Push artifact with retry/backoff.
- Mark backup “complete” only after remote verification.
What “good” looks like after 2–4 weeks
You should be able to answer these quickly:
- How many successful backups in last 7 days?
- How many failed runs and why?
- Can we restore last valid backup in under X minutes?
- Are retention and free space stable?
If your team cannot answer these in 5 minutes, improve observability before adding more features.
FAQ
1) Is cron still okay, or should I migrate to systemd timer?
Cron is still fine for many teams. If your workflow needs dependency ordering and better service integration, systemd timer can be cleaner. Start with the tooling your team can operate confidently.
2) How often should I run backup jobs?
Depends on RPO target. For many apps: daily full + more frequent incremental/log backup is enough. Define schedule from business recovery requirements, not habit.
3) Is file existence enough to mark backup success?
No. At least validate archive integrity and run periodic restore tests. “File exists” is an operational signal, not a recovery guarantee.
4) What is the fastest first improvement for weak backup automation?
Add three things first: flock, atomic temp-to-final write, and retention with logged delete output. Those fixes eliminate a lot of silent failures.
FAQ Schema (JSON-LD, ready to use)
<script type="application/ld+json">
{
"@context": "https://schema.org",
"@type": "FAQPage",
"mainEntity": [
{
"@type": "Question",
"name": "Is cron still okay, or should I migrate to systemd timer?",
"acceptedAnswer": {
"@type": "Answer",
"text": "Cron is still fine for many teams. If you need dependency ordering and tighter service integration, systemd timer may be cleaner. Choose what your team can operate confidently."
}
},
{
"@type": "Question",
"name": "How often should I run backup jobs?",
"acceptedAnswer": {
"@type": "Answer",
"text": "Use your business RPO target as the primary input. A common baseline is daily full backup plus more frequent incremental or log backup."
}
},
{
"@type": "Question",
"name": "Is file existence enough to mark backup success?",
"acceptedAnswer": {
"@type": "Answer",
"text": "No. File existence alone is not a recovery guarantee. Validate archive integrity and run periodic restore tests."
}
},
{
"@type": "Question",
"name": "What is the fastest first improvement for weak backup automation?",
"acceptedAnswer": {
"@type": "Answer",
"text": "Start with flock locking, atomic temp-to-final writes, and retention cleanup with clear logging. These three reduce many silent production failures."
}
}
]
}
</script>
Closing
A reliable backup system is not about fancy tools. It is about repeatable behavior under stress.
If you implement lock, atomic writes, retention guardrails, and restore checks, your Linux backup automation shell script will already be better than most “it works on my server” setups. Start from one critical service, prove restore time, then scale the pattern across the rest of your stack.
Komentar
Memuat komentar...
Tulis Komentar