Diagnosing and Resolving an Oracle CDB Server Hung State on Oracle Linux 9

When an Oracle database server stops responding, the cause is rarely a single failure. In most cases it is a chain of pressure points — memory over-commitment, undersized redo logs, and poorly tuned kernel parameters — that compound until the system cannot recover on its own. This guide walks through the full diagnostic and remediation process for a hung Oracle CDB server on Oracle Linux 9, using the commands and queries that actually matter.
1. Initial Triage
When a server becomes unresponsive, start with the basics before touching the database.
Check the last reboot time
who -b
If the server rebooted unexpectedly, note the timestamp. OS-level logs do not persist across a reboot by default — you need to check the VM console or a centralised log aggregator for pre-crash evidence.
Check current memory state
free -tm
Key values to assess:
- available — this is what matters, not
free. It includes reclaimable buffer/cache. - Swap used — any significant swap usage on a dedicated Oracle server is a red flag. Oracle SGA pages have no business being on disk.
Example output showing a concerning state:
total used free shared buff/cache available
Mem: 31964 13893 425 46 10101 18070
Swap: 16382 1454 14928
Available memory looks healthy at 18 GB, but 1.4 GB of swap usage on a database server warrants investigation.
Check for OOM evidence in logs
grep -i 'out of memory' /var/log/messages
grep -i 'oom' /var/log/messages
dmesg -T | grep -i oom
Empty output does not mean OOM did not occur — if the server was rebooted, these logs have been cleared. Check your VM console history or monitoring platform for OOM messages that appeared before the reboot.
2. Memory Pressure Diagnosis
Analyse historical memory trends with SAR
sar -r
Look for these patterns in the output:
| Indicator | What to look for |
|---|---|
%memused |
Sustained values above 85–90% are concerning |
kbcommit |
Growing committed memory that does not release suggests a memory leak or runaway session |
kbdirty |
Spikes in dirty pages indicate heavy write activity that may be stressing the I/O subsystem |
A pattern of kbcommit growing steadily from a post-reboot baseline (e.g., 3 GB → 5.5 GB over 10 hours) with no release is a strong indicator of PGA growth from long-running sessions or a memory leak in an application schema.
Identify the top memory consumers
ps aux --sort=-%mem | head -20
On an Oracle server, you will typically see Oracle dedicated server processes (oracle<SID> (LOCAL=NO)) each consuming a share of the SGA virtual address space. The VSZ value appears very large (reflecting the full SGA mapped into each process) — focus on RSS for actual physical memory usage per session.
3. Oracle Alert Log Analysis
The Oracle alert log is the most important source of evidence for a database hang. On a typical Oracle Linux installation:
find /u01 /opt/oracle -name "alert_*.log" 2>/dev/null
# Or use the Oracle environment:
find \(ORACLE_BASE -name "alert_\){ORACLE_SID}.log" 2>/dev/null
The smoking gun: redo log archiving errors
Search for the following patterns in the alert log:
grep -E "cannot allocate new log|Private strand flush|ORA-00257|checkpoint not complete" alert_<SID>.log
"Thread 1 cannot allocate new log, sequence XXXX" means the Log Writer (LGWR) tried to switch to the next redo log group but that group has not yet been archived and freed. The database stalls until the archiver catches up.
"Private strand flush not complete" appears alongside the above and means in-memory redo strands could not be flushed before the log switch. This indicates the database is generating redo faster than the archiver can handle it.
"checkpoint not complete" means the database writer (DBWn) has not finished writing dirty buffers before LGWR needs to reuse a log group. Both conditions (archiving and checkpoint) must be clear before a log group can be reused.
When a log switch stalls, check how long it persisted by comparing the timestamp of the "cannot allocate" message against the next successful log switch entry. A stall of more than a few seconds indicates a genuine problem; a stall of minutes or hours means the database was effectively frozen.
4. Redo Log Investigation
Check current log group status
Connect as SYSDBA from CDB$ROOT on a CDB:
SHOW CON_NAME;
-- Must return CDB$ROOT
SELECT group#, members, bytes/1024/1024 AS size_mb, status
FROM v$log;
Status meanings:
| Status | Meaning |
|---|---|
CURRENT |
Actively being written to by LGWR |
ACTIVE |
Log switch occurred but checkpoint has not yet completed — this group cannot be reused yet |
INACTIVE |
Fully archived and checkpointed — safe to reuse or drop |
UNUSED |
Never been written to |
Having multiple groups in ACTIVE status simultaneously is a warning sign — it means the checkpoint process is lagging behind log switches.
Analyse log switch frequency
This query shows how long each log group was in use. Ideal is 15–30 minutes between switches under normal load:
SELECT sequence#, first_time,
ROUND((LEAD(first_time) OVER (ORDER BY sequence#) - first_time) * 24 * 60, 2) AS duration_mins
FROM v$log_history
WHERE first_time >= SYSDATE - 1
ORDER BY sequence#;
Switches completing in under 5 minutes indicate the redo logs are undersized for the workload. Switches completing in under 1 minute indicate a severe mismatch.
Check current log file sizes and paths
ls -lh $ORACLE_BASE/oradata/<DB_NAME>/onlinelog/
ls -lh $ORACLE_BASE/fast_recovery_area/<DB_NAME>/onlinelog/
5. Identifying High-Redo Workloads in a CDB
In a CDB, all PDBs share the same online redo log stream. A single busy PDB can generate enough redo to overwhelm the archiver for the entire CDB.
Find which PDB is generating the most redo
SELECT con_id, name, value AS redo_size_bytes
FROM v$con_sysstat
WHERE name = 'redo size'
ORDER BY value DESC;
A PDB generating an order of magnitude more redo than others is the primary workload driver.
Identify which PDB maps to which CON_ID
SELECT con_id, name, open_mode FROM v$pdbs ORDER BY con_id;
Find scheduled jobs in the high-redo PDB
Switch into the identified PDB and query both the scheduler and legacy DBMS_JOB:
ALTER SESSION SET CONTAINER = <pdb_name>;
-- DBMS_Scheduler jobs running on an hourly frequency
SELECT job_name, owner, state, last_start_date, repeat_interval
FROM dba_scheduler_jobs
WHERE repeat_interval LIKE '%HOUR%'
OR repeat_interval LIKE '%hour%'
ORDER BY last_start_date DESC;
-- Legacy DBMS_JOB definitions (wrapped as DBMS_JOB$_ in scheduler history)
SELECT job, what, interval, next_date, broken
FROM dba_jobs
ORDER BY next_date;
-- Recent job execution history
SELECT job_name, owner, status, actual_start_date, additional_info
FROM dba_scheduler_job_run_details
WHERE actual_start_date >= SYSTIMESTAMP - INTERVAL '1' DAY
ORDER BY actual_start_date DESC;
High-frequency jobs (running every few seconds or minutes) owned by application schemas are a common source of sustained redo generation. On-the-hour burst jobs that generate a large volume of DML in a short window are the typical trigger for "cannot allocate new log" conditions.
6. Linux Kernel Tuning for Oracle Servers
The Linux kernel defaults are not optimised for dedicated Oracle database servers. Three parameters should always be reviewed and set explicitly.
Check current values
sysctl vm.swappiness vm.dirty_ratio vm.dirty_background_ratio
Recommended values for Oracle DB servers
# Apply immediately (live, no reboot required)
sysctl -w vm.swappiness=1
sysctl -w vm.dirty_ratio=15
sysctl -w vm.dirty_background_ratio=3
Persist across reboots
vi /etc/sysctl.conf
# Add the following lines:
vm.swappiness=1
vm.dirty_ratio=15
vm.dirty_background_ratio=3
Verify the file contains the entries:
grep -E 'swappiness|dirty' /etc/sysctl.conf
Why these values matter for Oracle
vm.swappiness=1
The default of 60 tells the kernel to aggressively page memory to swap even when RAM is available. Oracle SGA pages swapped to disk cause severe query latency and can cascade into log writer waits and database hangs. Setting this to 1 (not 0 — on RHEL 9 / OracleLinux 9, 0 can trigger more aggressive OOM kills) tells the kernel to exhaust all other reclaim options before touching swap.
vm.dirty_ratio=15
Caps the percentage of total memory that can be held as dirty (unwritten) pages before the kernel forces synchronous write-back. The default of 20 can allow large dirty page accumulations that then flush all at once, creating I/O storms that compete with Oracle's own I/O.
vm.dirty_background_ratio=3
Starts background write-back earlier (when 3% of memory is dirty vs. the default 10%), preventing dirty pages from building up in the first place.
7. Resizing Online Redo Logs
Oracle does not support in-place resizing of redo log groups. The process is: add new larger groups, cycle the old ones out, then drop them.
All commands run from CDB$ROOT as SYSDBA. The resize applies to the entire CDB.
Step 1 — Confirm current state
SELECT group#, members, bytes/1024/1024 AS size_mb, status FROM v$log;
Step 2 — Add new larger log groups
Replace the paths with your actual oradata and fast_recovery_area locations. Multiplexing across two separate storage volumes is strongly recommended:
ALTER DATABASE ADD LOGFILE GROUP 4
('/u01/oradata/MYDB/onlinelog/redo04a.log',
'/u02/fast_recovery_area/MYDB/onlinelog/redo04b.log') SIZE 2048M;
ALTER DATABASE ADD LOGFILE GROUP 5
('/u01/oradata/MYDB/onlinelog/redo05a.log',
'/u02/fast_recovery_area/MYDB/onlinelog/redo05b.log') SIZE 2048M;
ALTER DATABASE ADD LOGFILE GROUP 6
('/u01/oradata/MYDB/onlinelog/redo06a.log',
'/u02/fast_recovery_area/MYDB/onlinelog/redo06b.log') SIZE 2048M;
Step 3 — Verify new groups were created
SELECT group#, members, bytes/1024/1024 AS size_mb, status FROM v$log;
-- Groups 4, 5, 6 should show UNUSED
Step 4 — Force log switches to cycle old groups out
ALTER SYSTEM SWITCH LOGFILE;
ALTER SYSTEM SWITCH LOGFILE;
ALTER SYSTEM SWITCH LOGFILE;
ALTER SYSTEM CHECKPOINT;
Step 5 — Wait for old groups to go INACTIVE
SELECT group#, members, bytes/1024/1024 AS size_mb, status FROM v$log;
All old groups must show INACTIVE before you proceed. If any are still ACTIVE, run ALTER SYSTEM CHECKPOINT again and recheck. Never drop a CURRENT or ACTIVE group.
Step 6 — Drop the old groups
ALTER DATABASE DROP LOGFILE GROUP 1;
ALTER DATABASE DROP LOGFILE GROUP 2;
ALTER DATABASE DROP LOGFILE GROUP 3;
Step 7 — Remove the physical files
Oracle will often remove the OS files automatically when dropping a log group. Verify first:
ls -lh /u01/oradata/MYDB/onlinelog/
ls -lh /u02/fast_recovery_area/MYDB/onlinelog/
Remove any remaining old files manually if present.
Step 8 — Final verification
SELECT group#, members, bytes/1024/1024 AS size_mb, status FROM v$log;
-- Should show only the new groups at the new size
Sizing guidance
Use the log switch frequency query from Section 4 as your guide. Target 15–30 minutes between switches under your peak workload. If peak bursts fill a log in under a minute, the new size should be at least 10–15x the old size.
8. Verifying Oracle Memory Configuration
Check current SGA and PGA parameters
SELECT name, value
FROM v$parameter
WHERE name IN ('sga_max_size','sga_target','pga_aggregate_target','memory_target');
Key considerations:
memory_target = 0means Automatic Memory Management (AMM) is disabled. Oracle manages SGA and PGA separately (ASMM mode). This is the recommended configuration on Oracle Linux.sga_targetandpga_aggregate_targettogether should not exceed 75–80% of physical RAM, leaving headroom for the OS, filesystem cache, and PGA growth beyond the target (which is a soft limit, not a hard cap).- Swap usage above zero on a dedicated Oracle server is a sign these limits are being exceeded.
Example: checking FRA disk space
Before any redo log work, confirm the FRA has sufficient space:
df -h /u02/fast_recovery_area/
Also check from SQL:
SELECT dest_name, status, error
FROM v$archive_dest
WHERE status != 'INACTIVE';
9. Preventative Monitoring
Redo log switch frequency (run weekly or via AWR)
SELECT
TO_CHAR(first_time, 'YYYY-MM-DD HH24') AS hour,
COUNT(*) AS switches_per_hour
FROM v$log_history
WHERE first_time >= SYSDATE - 7
GROUP BY TO_CHAR(first_time, 'YYYY-MM-DD HH24')
ORDER BY 1;
More than 4–6 switches per hour consistently indicates the redo logs need to be larger.
Current memory commit ratio (run anytime)
sar -r 1 3
Watch kbcommit — if it consistently exceeds physical RAM + 50% of swap, the system is over-committed.
Swap usage alert threshold
Set monitoring alerts for swap usage exceeding 25% of total swap. By the time an OOM event occurs, it is too late to intervene — early swap pressure is the warning window.
Check for ACTIVE log group accumulation
SELECT COUNT(*) AS active_groups
FROM v$log
WHERE status = 'ACTIVE';
More than one ACTIVE group simultaneously is a checkpoint lag warning. Two or more ACTIVE groups on a system with only three total groups means the next log switch could stall.
Summary
| Problem | Diagnosis | Fix |
|---|---|---|
| Redo logs too small | Log switches < 5 mins; "cannot allocate new log" in alert log | Add larger log groups, drop old ones |
| Checkpoint lag | Multiple ACTIVE groups in v$log | Larger logs; tune DB_WRITER_PROCESSES if needed |
| Memory exhaustion | kbcommit growth in SAR; swap usage | Right-size SGA/PGA; investigate PGA growth |
| Kernel swap pressure | vm.swappiness=60 (default) | Set vm.swappiness=1 in /etc/sysctl.conf |
| Dirty page I/O storms | kbdirty spikes in SAR | Set vm.dirty_ratio=15, vm.dirty_background_ratio=3 |
| High-redo application jobs | v$con_sysstat redo size; dba_scheduler_job_run_details | Tune batch job frequency and DML volume |





