· 12 min read
  • WordPress

The server wasn't out of memory. I just thought it was.

A WooCommerce store kept crashing under trivial traffic. Two wrong diagnoses (out of memory, hypervisor freeze), then the real cause: an Apache config reserving 19 GB of thread stacks at boot, pushing the kernel past its commit limit so that fork() started failing — making the VM look dead from outside while it still had 5 GB of free RAM. Written with Claude as narrator.

The server wasn't out of memory. I just thought it was.

A WooCommerce outage I couldn't crack alone. I asked Claude to investigate. I'm going to let Claude tell most of it, because the part I learned from is where I was wrong and Claude was wrong and we kept going.

A small WooCommerce store I help with — Cloudways, DigitalOcean droplet, 4 vCPU, 8 GB RAM, $99/month — went down twice in four days. Each time the public site stopped responding and SSH refused new connections. Each time the owner clicked Restart in the panel and the server came back fine in minutes.

I started looking and quickly realised I was out of my depth on the kernel side. So I gave Claude SSH access (read-only paths) and the question. What follows is its account of how the investigation went, with my pushback in line. The investigation took about four hours and produced two wrong answers before the right one. The wrong answers are the interesting part.


Claude, narrating:

The dev pointed me at two outage windows, gave me read access to /var/log (no sudo — Cloudways doesn't grant it to "master" users), and told me the symptom: site down, SSH down, owner clicks Restart, recovery in minutes. The owner's framing was specific and a little incredulous — "why is a $99 server with 4 CPUs and 8 GB going down when only 5 or 6 customers are browsing?"

That sentence ended up being load-bearing for the whole investigation. I should have anchored on it earlier than I did.

First answer: out of memory

The natural place to start on Cloudways is the atop archive in /var/log/atop/. They keep ten-minute snapshots of full system state, readable without sudo. I pulled the snapshot covering the seven minutes before the first outage:

MEM | tot 7.8G | free 284.6M | cache 3.7G | buff 322.4M | slab 980.7M
SWP | tot 1.9G | free 1.2G | vmcom 9.0G | vmlim 5.8G

284 MB free. Committed virtual memory three gigabytes over the kernel's commit limit. On most days that's an open-and-shut memory pressure diagnosis. I wrote up: the server is undersized for the workload, upgrade RAM or shed background workers.

The dev came back firmly:

"I don't think 4 CPU and 8 GB is too little for 5–6 customers. I can't accept that a website can't handle 5 users."

He was right. And I should have caught the mistake myself, because Cloudways' Sensu monitoring agent had been writing the contradicting data into /var/log/sensu/sensu-client.log the entire time.

The metric I should have trusted

Sensu samples real metrics every five minutes. One of its checks is called freeWOBuffersCaches — free memory excluding reclaimable buffers and cache. That's the number atop's "free" column doesn't show. Linux deliberately fills RAM with page cache because empty RAM is wasted RAM; those pages can be reclaimed in nanoseconds when something needs them. So "free" in atop is almost always misleadingly small.

Four minutes before the second outage, Sensu had logged:

memory.free       5,364,084,736   →  5.0 GB actually available
cpu.idle          90.87 %
mysql.active_conn 0
phpfpm.workers    2

Five gigabytes of actually-available memory. Two PHP workers. Zero active database connections. 91% idle CPU. And then total silence for sixty-three minutes.

That isn't a server out of memory. That's a server doing almost nothing, that suddenly couldn't do anything at all. Different shape of failure.

I had to throw out the first answer and start over.

Second answer: hypervisor freeze

The two outages had different shapes. The first one ended with Sensu logging a clean SIGTERM — something gracefully told it to stop:

received signal  "signal":"TERM"
stopping
stopping reactor

The second one just stopped logging mid-sentence. No SIGTERM. No partial snapshot. Total silence for an hour, then post-boot reload.

I had a story: the first outage was a Cloudways or DigitalOcean maintenance reboot (graceful, hence the SIGTERM), and the second was a hypervisor-level freeze at DigitalOcean's end (no graceful shutdown, hence the silence). I wrote up: escalate the second one to DO support.

The dev came back again:

"I was the one who initiated the reboot both times from the panel. The site was down and SSH was unreachable, so I clicked Restart."

That single sentence collapsed the entire second model. The SIGTERM I'd attributed to "maintenance" was the dev's click. The "hypervisor freeze" wasn't a freeze of the VM by the host — it was something inside the VM, severe enough that the dev escalated from "graceful restart" to "hard restart" because the graceful one didn't work this time.

Which meant both outages had been going on for hours before the reboot. The reboot was the recovery, not the failure. The actual question was: what kept happening inside the VM that made it look dead from outside?

When the site actually broke

I went back to the access log. The last request Apache served on the first night was at 20:07:18 — an iPhone customer finishing a product page load. Then nothing in the access log until the reboot the next morning. Apache was broken from 20:07 until 05:08. Nine hours. Not the four or five the dev had assumed from when he noticed.

What triggered it wasn't the iPhone customer who finished. It was the customer just before — an Android phone that rapid-browsed ten product pages in twenty seconds. WooCommerce pages fan out: the page itself, plus a PixelYourSite Facebook event, plus a wc-ajax=ppc-simulate-cart call, plus a cart-fragments refresh. One fast browser was easily producing thirty or forty concurrent server-side requests over a few seconds.

The atop snapshot for that window shows the spike:

CPU | sys 21% | user 80% | irq 1% | idle 296%
CPL | avg1 7.53 | avg5 7.43 | avg15 6.96
MEM | free 189.5M | vmcom 9.5G | vmlim 5.8G
NET | tcpi 51860 | tcpo 99212

Load average 7.5 on a 4-core box. Committed virtual memory pushed from 9.0 GB to 9.5 GB. Then Apache silently stopped serving and never resumed.

Within twenty-three minutes, the first apachectl process appeared stuck in D-state — uninterruptible kernel sleep. Something (monit, sensu, the Cloudways agent — they all probe Apache health by shelling out to apachectl) had tried to check on Apache and the call hung. By the next hour, seven D-state apachectls. Three hours in, thirty-two. By the time the dev reset the server, 103 of them had accumulated, at one every five minutes — exactly the cadence of the monitoring loop that kept invoking apachectl and kept getting stuck.

That pile-up of D-state processes was the fingerprint. Whatever was wrong with the server, it wasn't killing Apache — it was preventing the system from completing a perfectly normal call to apachectl status. Which meant the failure wasn't in Apache. It was upstream of anything starting a new process.

The mechanism

Linux tracks two memory numbers most people never look at:

  • CommitLimit — the kernel's bookkeeping ceiling on promised memory. By default it's swap + RAM × overcommit_ratio / 100. On this server: 2 GB + 8 GB × 50% = 5.81 GB. Verified in /proc/meminfo as CommitLimit: 6,098,396 kB.
  • Committed_AS — the running total of memory the kernel has promised to running processes. Not what they're using — what they could touch if they tried. Verified as 9,165,708 kB. That's 8.74 GB. Roughly three gigabytes over the limit, while the server was idle.

In vm.overcommit_memory = 0 (the Linux default), the kernel allows Committed_AS to exceed CommitLimit as long as the heuristic thinks any individual allocation looks reasonable. But when a new process tries to fork() or a running process tries to mmap() and the heuristic doesn't like it, the kernel returns ENOMEM. Allocation refused. New processes can't be created.

When fork() starts being refused systemwide, here's what stops working:

  • Apache can't spawn new worker processes.
  • sshd can't fork a session for a new SSH connection → SSH appears down.
  • atop's sampling helper can't run → atop log goes silent.
  • Sensu can't shell out to its Ruby check scripts → monitoring goes silent.
  • apachectl status (run by monitoring loops) can't fork its child → hangs in D-state, accumulating.
  • Cron jobs fail silently.

From the outside this is indistinguishable from a frozen VM. From the inside the kernel is alive, existing processes can briefly continue, and the only thing broken is the birth of new processes. This is the single counterintuitive insight from the whole investigation: a Linux box can have five gigabytes of free RAM and still refuse a fifty-megabyte allocation, because the kernel is doing bookkeeping against CommitLimit, not against actual physical memory.

That's the failure mode. Now: where did the committed memory go?

The Apache config

Cloudways ships this droplet with /etc/apache2/mods-enabled/mpm_event.conf reading:

<IfModule mpm_event_module>
    StartServers             20
    ServerLimit              300
    MinSpareThreads          150
    MaxSpareThreads          300
    ThreadLimit              200
    ThreadsPerChild          120
    MaxRequestWorkers        36000
    MaxConnectionsPerChild   7500
</IfModule>

MaxRequestWorkers 36000 is the line that draws the eye. Thirty-six thousand concurrent workers on an 8 GB server is absurd. But it's not what breaks things — it's a ceiling, only relevant under traffic the server will never see.

What breaks things is StartServers 20 × ThreadsPerChild 120. Apache's event MPM pre-creates a thread pool per process, and each thread reserves an 8 MB stack (verified via ulimit -s = 8192 KB). So at boot:

20 processes × 120 threads × 8 MB stack  =  19,200 MB  ≈  19 GB

Nineteen gigabytes of virtual memory reserved at boot, before a single visitor has arrived. Confirmed empirically: idle Apache worker processes each showed VSIZE of 3.2 GB (RSS only ~175 MB — so barely used, but fully committed). Add MariaDB (5.7 GB VSIZE), Varnish (2.4 GB file-backed mapping), the full Imunify360 security stack (~5 GB combined), and Committed_AS sits permanently above CommitLimit.

The server "works" only while nothing forces the kernel to actually honor those promises. One fast browser making thirty parallel WooCommerce requests is enough. So is two simultaneous checkouts. So is a Facebook ad landing three visitors at once. The fast-browsing customer didn't do anything wrong — any normal busy moment would have done the same. The config was the failure; the customer just exposed it first.

This is also why upgrading RAM wouldn't fix the problem. You can't out-buy a config that explicitly reserves nineteen gigabytes of stacks before serving its first byte. The plan was correctly sized. The config was sized for a server tier that doesn't exist on this plan.

The fix

Same MPM, numbers that match the box:

<IfModule mpm_event_module>
    StartServers             2
    ServerLimit              8
    MinSpareThreads          25
    MaxSpareThreads          75
    ThreadLimit              64
    ThreadsPerChild          25
    MaxRequestWorkers        200
    MaxConnectionsPerChild   10000
</IfModule>

Stack reservation drops from 19 GB to about 400 MB on paper. The 200-concurrent-request ceiling is still 33× the peak this site has ever recorded; if you ever exceed it, Nginx in front queues the overflow for a fraction of a second rather than rejecting it.

The bigger change is the kernel mode:

vm.overcommit_memory = 1

With overcommit_memory = 1, fork decisions are made against real memory pressure instead of CommitLimit. The MPM tightening brings Committed_AS down meaningfully (how meaningfully depends on glibc arena behavior — more on that in the epilogue), but the fork() cascade is gone either way.

What I missed

Twice I produced a confident wrong answer. Both times the dev was the corrective signal.

The first wrong answer ("memory pressure, upgrade RAM") was a calibration error. I weighted atop's "free" column too heavily. The Sensu logs were already showing 5 GB of actually-available memory; I should have cross-referenced them before writing up the diagnosis. The lesson is mechanical and reusable: when one metric says "low memory" and another says "high memory" about the same instant, the one that includes reclaimable cache is the one that's true.

The second wrong answer ("hypervisor freeze") was a different kind of mistake. I was modeling automated events — Cloudways maintenance, DigitalOcean freezes — without first asking the obvious question of whether a human had pushed any buttons. The dev's single sentence about clicking Restart collapsed the whole theory in a way no log analysis could have. Establish what humans did before modeling what computers did.


Epilogue: the fix landed (mostly as predicted)

The dev opened a support ticket with the config changes. Cloudways applied both within a few hours.

Verification on the box:

  • Apache MPM: file on disk matches the new values, processes have been up since the reload, each worker now reports 27 threads (25 + 2 dispatcher) instead of 122.
  • vm.overcommit_memory: set to 1 live, and persisted in /etc/sysctl.conf so it survives reboot.

Measured outcomes:

                      Before     After
Committed_AS          9.16 GB    7.51 GB
Apache worker VSIZE   3.2 GB     2.17 GB
overcommit_memory     0          1
Available RAM         —          ~4.4 GB free, swap barely touched

The crash mechanism is gone. In overcommit_memory = 1 mode the kernel no longer refuses allocations against the commit limit, so the fork() cascade that took down Apache, SSH, and the monitoring agents can't recur.

One honest correction: I predicted Committed_AS would drop to ~3–4 GB. It dropped to ~7.5 GB instead — still above CommitLimit. The reason is that the Apache workers' virtual size turns out to be dominated by glibc's per-thread memory arenas (~2 GB per process) rather than the 8 MB thread stacks I had been arithmetic-ing. Cutting the thread count from 120 to 25 helped less than expected on the virtual number.

But the actual fix works for a different reason than the virtual size: with overcommit_memory = 1 the kernel ignores CommitLimit when deciding whether to allow a fork(). The bookkeeping ceiling that was the failure mode is no longer consulted. So the Committed_AS number that looks like a problem is now correctly classified as harmless reserved virtual memory that the processes never touch — and physical memory use is low, swap is barely warm.

The prediction was directionally right, numerically optimistic. The fix works because the mode change matters more than the size change.


Back to me:

A few things I took from sitting on the other side of this:

The first answer was the most popular answer. "Memory pressure" is the default diagnosis on the internet for these symptoms. An AI trained on the internet's debugging stories will reach for it first. The corrective came from me pushing back on a real-world fact ("a website can't crash from five users") that the AI had no way of weighing without prompting.

The "wrong answers" weren't wasted work. Both of them required digging through logs and ruling out plausible causes; that ruling-out is most of what investigation is. The shape of the wrong answers also narrowed the space of right ones: by the time we'd ruled out "undersized" and "hypervisor," the only category left was "userspace failure on a healthy VM," which is a small space to search.

The fix wouldn't have been doable without the AI. I would have ended up at "upgrade the plan." The actual fix — that the Apache MPM is reserving 19 GB of thread stacks at boot, that the kernel commit limit is being blown by bookkeeping rather than usage, that overcommit mode matters — is well outside what I'd reach for. AI is genuinely good at the specific shape of work where you need to pull a thread through layers you don't personally maintain.

Pushing back is the part you can't outsource. The investigation got unstuck twice, both times because I told the AI it was wrong. That part — knowing when something doesn't pass the smell test, even if you can't yet name why — is still yours to do.

The fix itself is a five-minute config change on Cloudways' end. (The owner of the box can't apply it himself; Cloudways doesn't grant sudo to "master" SSH users and the Apache MPM controls aren't exposed in their panel. It needs a support ticket.) The investigation that landed on the fix took four hours, two wrong turns, and one dev who refused to accept that a website should crash from five customers.

He was right about that, too.