diagnose and resolve Oracle CDB server hung

When an Oracle database server stops responding, the cause is rarely a single failure. In most cases it is a chain of pressure points — memory over-commitment, undersized redo logs, and poorly tuned kernel parameters — that compound until the system cannot recover on its own. This guide walks through the full diagnostic and remediation process for a hung Oracle CDB server on Oracle Linux 9, using the commands and queries that actually matter.

1. Initial Triage

When a server becomes unresponsive, start with the basics before touching the database.

Check the last reboot time

who -b

If the server rebooted unexpectedly, note the timestamp. OS-level logs do not persist across a reboot by default — you need to check the VM console or a centralised log aggregator for pre-crash evidence.

Check current memory state

free -tm

Key values to assess:

available — this is what matters, not free. It includes reclaimable buffer/cache.
Swap used — any significant swap usage on a dedicated Oracle server is a red flag. Oracle SGA pages have no business being on disk.

Example output showing a concerning state:

               total        used        free      shared  buff/cache   available
Mem:           31964       13893         425          46       10101       18070
Swap:          16382        1454       14928

Available memory looks healthy at 18 GB, but 1.4 GB of swap usage on a database server warrants investigation.

Check for OOM evidence in logs

grep -i 'out of memory' /var/log/messages
grep -i 'oom' /var/log/messages
dmesg -T | grep -i oom

Empty output does not mean OOM did not occur — if the server was rebooted, these logs have been cleared. Check your VM console history or monitoring platform for OOM messages that appeared before the reboot.

2. Memory Pressure Diagnosis

Analyse historical memory trends with SAR

sar -r

Look for these patterns in the output:

Indicator	What to look for
`%memused`	Sustained values above 85–90% are concerning
`kbcommit`	Growing committed memory that does not release suggests a memory leak or runaway session
`kbdirty`	Spikes in dirty pages indicate heavy write activity that may be stressing the I/O subsystem

A pattern of kbcommit growing steadily from a post-reboot baseline (e.g., 3 GB → 5.5 GB over 10 hours) with no release is a strong indicator of PGA growth from long-running sessions or a memory leak in an application schema.

Identify the top memory consumers

ps aux --sort=-%mem | head -20

On an Oracle server, you will typically see Oracle dedicated server processes (oracle<SID> (LOCAL=NO)) each consuming a share of the SGA virtual address space. The VSZ value appears very large (reflecting the full SGA mapped into each process) — focus on RSS for actual physical memory usage per session.

3. Oracle Alert Log Analysis

The Oracle alert log is the most important source of evidence for a database hang. On a typical Oracle Linux installation:

find /u01 /opt/oracle -name "alert_*.log" 2>/dev/null
# Or use the Oracle environment:
find \(ORACLE_BASE -name "alert_\){ORACLE_SID}.log" 2>/dev/null

The smoking gun: redo log archiving errors

Search for the following patterns in the alert log:

grep -E "cannot allocate new log|Private strand flush|ORA-00257|checkpoint not complete" alert_<SID>.log

"Thread 1 cannot allocate new log, sequence XXXX" means the Log Writer (LGWR) tried to switch to the next redo log group but that group has not yet been archived and freed. The database stalls until the archiver catches up.

"Private strand flush not complete" appears alongside the above and means in-memory redo strands could not be flushed before the log switch. This indicates the database is generating redo faster than the archiver can handle it.

"checkpoint not complete" means the database writer (DBWn) has not finished writing dirty buffers before LGWR needs to reuse a log group. Both conditions (archiving and checkpoint) must be clear before a log group can be reused.

When a log switch stalls, check how long it persisted by comparing the timestamp of the "cannot allocate" message against the next successful log switch entry. A stall of more than a few seconds indicates a genuine problem; a stall of minutes or hours means the database was effectively frozen.

4. Redo Log Investigation

Check current log group status

Connect as SYSDBA from CDB$ROOT on a CDB:

SHOW CON_NAME;
-- Must return CDB$ROOT

SELECT group#, members, bytes/1024/1024 AS size_mb, status 
FROM v$log;

Status meanings:

Status	Meaning
`CURRENT`	Actively being written to by LGWR
`ACTIVE`	Log switch occurred but checkpoint has not yet completed — this group cannot be reused yet
`INACTIVE`	Fully archived and checkpointed — safe to reuse or drop
`UNUSED`	Never been written to

Having multiple groups in ACTIVE status simultaneously is a warning sign — it means the checkpoint process is lagging behind log switches.

Analyse log switch frequency

This query shows how long each log group was in use. Ideal is 15–30 minutes between switches under normal load:

SELECT sequence#, first_time,
       ROUND((LEAD(first_time) OVER (ORDER BY sequence#) - first_time) * 24 * 60, 2) AS duration_mins
FROM v$log_history
WHERE first_time >= SYSDATE - 1
ORDER BY sequence#;

Switches completing in under 5 minutes indicate the redo logs are undersized for the workload. Switches completing in under 1 minute indicate a severe mismatch.

Check current log file sizes and paths

ls -lh $ORACLE_BASE/oradata/<DB_NAME>/onlinelog/
ls -lh $ORACLE_BASE/fast_recovery_area/<DB_NAME>/onlinelog/

5. Identifying High-Redo Workloads in a CDB

In a CDB, all PDBs share the same online redo log stream. A single busy PDB can generate enough redo to overwhelm the archiver for the entire CDB.

Find which PDB is generating the most redo

SELECT con_id, name, value AS redo_size_bytes
FROM v$con_sysstat
WHERE name = 'redo size'
ORDER BY value DESC;

A PDB generating an order of magnitude more redo than others is the primary workload driver.

Identify which PDB maps to which CON_ID

SELECT con_id, name, open_mode FROM v$pdbs ORDER BY con_id;

Find scheduled jobs in the high-redo PDB

Switch into the identified PDB and query both the scheduler and legacy DBMS_JOB:

ALTER SESSION SET CONTAINER = <pdb_name>;

-- DBMS_Scheduler jobs running on an hourly frequency
SELECT job_name, owner, state, last_start_date, repeat_interval
FROM dba_scheduler_jobs
WHERE repeat_interval LIKE '%HOUR%'
   OR repeat_interval LIKE '%hour%'
ORDER BY last_start_date DESC;

-- Legacy DBMS_JOB definitions (wrapped as DBMS_JOB$_ in scheduler history)
SELECT job, what, interval, next_date, broken
FROM dba_jobs
ORDER BY next_date;

-- Recent job execution history
SELECT job_name, owner, status, actual_start_date, additional_info
FROM dba_scheduler_job_run_details
WHERE actual_start_date >= SYSTIMESTAMP - INTERVAL '1' DAY
ORDER BY actual_start_date DESC;

High-frequency jobs (running every few seconds or minutes) owned by application schemas are a common source of sustained redo generation. On-the-hour burst jobs that generate a large volume of DML in a short window are the typical trigger for "cannot allocate new log" conditions.

6. Linux Kernel Tuning for Oracle Servers

The Linux kernel defaults are not optimised for dedicated Oracle database servers. Three parameters should always be reviewed and set explicitly.

Check current values

sysctl vm.swappiness vm.dirty_ratio vm.dirty_background_ratio

Recommended values for Oracle DB servers

# Apply immediately (live, no reboot required)
sysctl -w vm.swappiness=1
sysctl -w vm.dirty_ratio=15
sysctl -w vm.dirty_background_ratio=3

Persist across reboots

vi /etc/sysctl.conf
# Add the following lines:
vm.swappiness=1
vm.dirty_ratio=15
vm.dirty_background_ratio=3

Verify the file contains the entries:

grep -E 'swappiness|dirty' /etc/sysctl.conf

Why these values matter for Oracle

vm.swappiness=1 The default of 60 tells the kernel to aggressively page memory to swap even when RAM is available. Oracle SGA pages swapped to disk cause severe query latency and can cascade into log writer waits and database hangs. Setting this to 1 (not 0 — on RHEL 9 / OracleLinux 9, 0 can trigger more aggressive OOM kills) tells the kernel to exhaust all other reclaim options before touching swap.

vm.dirty_ratio=15 Caps the percentage of total memory that can be held as dirty (unwritten) pages before the kernel forces synchronous write-back. The default of 20 can allow large dirty page accumulations that then flush all at once, creating I/O storms that compete with Oracle's own I/O.

vm.dirty_background_ratio=3 Starts background write-back earlier (when 3% of memory is dirty vs. the default 10%), preventing dirty pages from building up in the first place.

7. Resizing Online Redo Logs

Oracle does not support in-place resizing of redo log groups. The process is: add new larger groups, cycle the old ones out, then drop them.

All commands run from CDB$ROOT as SYSDBA. The resize applies to the entire CDB.

Step 1 — Confirm current state

SELECT group#, members, bytes/1024/1024 AS size_mb, status FROM v$log;

Step 2 — Add new larger log groups

Replace the paths with your actual oradata and fast_recovery_area locations. Multiplexing across two separate storage volumes is strongly recommended:

ALTER DATABASE ADD LOGFILE GROUP 4 
  ('/u01/oradata/MYDB/onlinelog/redo04a.log',
   '/u02/fast_recovery_area/MYDB/onlinelog/redo04b.log') SIZE 2048M;

ALTER DATABASE ADD LOGFILE GROUP 5
  ('/u01/oradata/MYDB/onlinelog/redo05a.log',
   '/u02/fast_recovery_area/MYDB/onlinelog/redo05b.log') SIZE 2048M;

ALTER DATABASE ADD LOGFILE GROUP 6
  ('/u01/oradata/MYDB/onlinelog/redo06a.log',
   '/u02/fast_recovery_area/MYDB/onlinelog/redo06b.log') SIZE 2048M;

Step 3 — Verify new groups were created

SELECT group#, members, bytes/1024/1024 AS size_mb, status FROM v$log;
-- Groups 4, 5, 6 should show UNUSED

Step 4 — Force log switches to cycle old groups out

ALTER SYSTEM SWITCH LOGFILE;
ALTER SYSTEM SWITCH LOGFILE;
ALTER SYSTEM SWITCH LOGFILE;
ALTER SYSTEM CHECKPOINT;

Step 5 — Wait for old groups to go INACTIVE

SELECT group#, members, bytes/1024/1024 AS size_mb, status FROM v$log;

All old groups must show INACTIVE before you proceed. If any are still ACTIVE, run ALTER SYSTEM CHECKPOINT again and recheck. Never drop a CURRENT or ACTIVE group.

Step 6 — Drop the old groups

ALTER DATABASE DROP LOGFILE GROUP 1;
ALTER DATABASE DROP LOGFILE GROUP 2;
ALTER DATABASE DROP LOGFILE GROUP 3;

Step 7 — Remove the physical files

Oracle will often remove the OS files automatically when dropping a log group. Verify first:

ls -lh /u01/oradata/MYDB/onlinelog/
ls -lh /u02/fast_recovery_area/MYDB/onlinelog/

Remove any remaining old files manually if present.

Step 8 — Final verification

SELECT group#, members, bytes/1024/1024 AS size_mb, status FROM v$log;
-- Should show only the new groups at the new size

Sizing guidance

Use the log switch frequency query from Section 4 as your guide. Target 15–30 minutes between switches under your peak workload. If peak bursts fill a log in under a minute, the new size should be at least 10–15x the old size.

8. Verifying Oracle Memory Configuration

Check current SGA and PGA parameters

SELECT name, value 
FROM v$parameter
WHERE name IN ('sga_max_size','sga_target','pga_aggregate_target','memory_target');

Key considerations:

memory_target = 0 means Automatic Memory Management (AMM) is disabled. Oracle manages SGA and PGA separately (ASMM mode). This is the recommended configuration on Oracle Linux.
sga_target and pga_aggregate_target together should not exceed 75–80% of physical RAM, leaving headroom for the OS, filesystem cache, and PGA growth beyond the target (which is a soft limit, not a hard cap).
Swap usage above zero on a dedicated Oracle server is a sign these limits are being exceeded.

Example: checking FRA disk space

Before any redo log work, confirm the FRA has sufficient space:

df -h /u02/fast_recovery_area/

Also check from SQL:

SELECT dest_name, status, error 
FROM v$archive_dest 
WHERE status != 'INACTIVE';

9. Preventative Monitoring

Redo log switch frequency (run weekly or via AWR)

SELECT 
  TO_CHAR(first_time, 'YYYY-MM-DD HH24') AS hour,
  COUNT(*) AS switches_per_hour
FROM v$log_history
WHERE first_time >= SYSDATE - 7
GROUP BY TO_CHAR(first_time, 'YYYY-MM-DD HH24')
ORDER BY 1;

More than 4–6 switches per hour consistently indicates the redo logs need to be larger.

Current memory commit ratio (run anytime)

sar -r 1 3

Watch kbcommit — if it consistently exceeds physical RAM + 50% of swap, the system is over-committed.

Swap usage alert threshold

Set monitoring alerts for swap usage exceeding 25% of total swap. By the time an OOM event occurs, it is too late to intervene — early swap pressure is the warning window.

Check for ACTIVE log group accumulation

SELECT COUNT(*) AS active_groups 
FROM v$log 
WHERE status = 'ACTIVE';

More than one ACTIVE group simultaneously is a checkpoint lag warning. Two or more ACTIVE groups on a system with only three total groups means the next log switch could stall.

Summary

Problem	Diagnosis	Fix
Redo logs too small	Log switches < 5 mins; "cannot allocate new log" in alert log	Add larger log groups, drop old ones
Checkpoint lag	Multiple ACTIVE groups in v$log	Larger logs; tune DB_WRITER_PROCESSES if needed
Memory exhaustion	kbcommit growth in SAR; swap usage	Right-size SGA/PGA; investigate PGA growth
Kernel swap pressure	vm.swappiness=60 (default)	Set vm.swappiness=1 in /etc/sysctl.conf
Dirty page I/O storms	kbdirty spikes in SAR	Set vm.dirty_ratio=15, vm.dirty_background_ratio=3
High-redo application jobs	v$con_sysstat redo size; dba_scheduler_job_run_details	Tune batch job frequency and DML volume

Command Palette