cfm kernsec — Kernel Attack Surface Reduction for Linux

Part of the CFM server security platform. This post covers what kernsec does, how it works, how to use it, and how it relates to the Linux kernel’s proposed Killswitch primitive.


The problem it solves

A stock AlmaLinux or CloudLinux install ships with a running kernel that supports hundreds of subsystems almost no hosting server will ever use: amateur radio protocols, 1980s network stacks, optical disc filesystems, infrared communication, virtual video devices. These subsystems exist in kernel memory, can be autoloaded on first use, and represent real attack surface — many CVEs in the past few years trace directly to one of them.

At the same time, the kernel’s own hardening knobs — a sprawling set of sysctls, boot arguments, and compile-time options documented by the Kernel Self-Protection Project (KSPP) — go untouched on most production systems because applying them correctly without breaking real workloads takes more time than most operators have.

cfm kernsec solves both problems with a single managed component.


Philosophy: audit first, apply explicitly

kernsec never silently mutates a host. The default command, cfm kernsec, opens an interactive TUI (or falls back to text output on non-TTY) that shows the full audit without changing anything. Mutation only happens when you explicitly run cfm kernsec apply.

This is intentional. On a shared hosting server managing hundreds of customer vhosts, a misconfigured sysctl or a blacklisted module that a customer’s process depends on can escalate from “security improvement” to “support ticket” very fast. kernsec shows you what it intends to do first, tells you exactly what each rule affects, and requires explicit confirmation before writing anything.

The same philosophy applies to /etc/fstab: kernsec audits mount options and surfaces what is missing, but it never edits fstab. Mount changes on a live hosting server require operator judgment.


The tier system

Rules are grouped into two tiers.

Tier 1 is safe-everywhere for shared hosting, KVM, and cPanel/CloudLinux stacks. These are the defaults. Every rule in Tier 1 has been validated against the real hosting workload — CloudLinux LVE, CageFS, cPanel jails, PHP-FPM, MySQL, Dovecot, Exim, Postfix.

Tier 2 is server-aggressive and opt-in. These rules either have meaningful blast radius on some workloads or require more operator context. Examples: disabling unprivileged user namespaces (breaks rootless containers), oops=panic (reboots on kernel oops), and panic_on_oops. You enable Tier 2 by setting tier = 2 in /etc/cfm/kernsec.conf.

The tier system is not about whether a hardening measure is correct — all rules are correct. It is about whether the blast radius is bounded without operator context.


What kernsec manages

Sysctls — /etc/sysctl.d/99-cfm-kernsec.conf

kernsec owns and enforces the KSPP sysctl baseline plus a set of CFM extensions. Rules fall into groups:

kspp.kernel — Core KSPP sysctls: kernel.kptr_restrict=2 (hide kernel pointers from userspace), kernel.dmesg_restrict=1 (restrict kernel log to root), kernel.unprivileged_bpf_disabled=1, kernel.randomize_va_space=2, kernel.yama.ptrace_scope=1 (restrict ptrace to direct children), kernel.perf_event_paranoid=3.

kspp.fs — Filesystem hardening: fs.protected_hardlinks=1, fs.protected_symlinks=1, fs.protected_fifos=2, fs.protected_regular=2.

sysctl.mem.exploit — Exploit technique mitigations: vm.mmap_rnd_bits=32 and vm.mmap_rnd_compat_bits=16 (ASLR entropy), kernel.warn_limit=10 and kernel.oops_limit=10 (rate-limit warning/oops spray), fs.suid_dumpable=0 (no core dumps from setuid binaries).

sysctl.kernel.surface — Surface reductions: dev.tty.ldisc_autoload=0 (no automatic TTY line-discipline module loading — closes the n_hdlc-style attack path), kernel.sysrq=0.

sysctl.net.harden — Network hardening: source-route rejection, martian logging, RFC1337 TIME_WAIT fix, broadcast ICMP ignore.

sysctl.net (EXT) — Keys owned by cfm-sysctl-tweaks (the daemon’s imperative TCP tuning): rp_filter, accept_redirects, send_redirects, tcp_syncookies. kernsec audits these but never writes them — ownership is tracked in CFM’s cross-component registry. The audit surfaces EXT with the live value so you can see whether the other component’s intent is actually active.

Tier 2 adds user.max_user_namespaces=0, kernel.unprivileged_userns_clone=0 (kills a major LPE primitive class, but breaks rootless podman, bwrap, and some hosting isolation — skipped automatically when containers or CloudLinux/CageFS are detected), kernel.panic_on_oops=1, and kernel.panic=10.


Boot arguments — bootloader-backend managed

kernsec manages a defined set of cmdline keys across three bootloader backends (BLS/grubby, legacy GRUB, Proxmox). It strips stale instances of its own managed keys before appending the desired set, and never touches operator-provided arguments outside its managed set.

kspp.boot (Tier 1):

  • slab_nomerge — prevents slab cache merging, hardens type-confusion UAF exploits
  • init_on_alloc=1 — zero pages on allocation, eliminates uninitialized-memory leaks kernel-wide
  • page_alloc.shuffle=1 — randomize buddy-allocator freelists, mild ASLR boost for kernel allocations
  • randomize_kstack_offset=on — per-syscall kernel-stack randomization, makes ROP/stack-spray harder
  • initcall_blacklist=algif_aead_initruntime killswitch for Copy Fail / CVE-2026-31431: blocks the AEAD AF_ALG initialization at boot, preventing the exploit’s send-path primitive from being reachable

boot.bug-detection (Tier 1): kfence.sample_interval=100 — enables low-overhead KFENCE heap corruption detection sampling.

boot.dma (Tier 1): efi=disable_early_pci_dma — disables early pre-IOMMU EFI PCI DMA; skipped automatically on non-EFI hosts.

boot.sidechannel (Tier 1): tsx=off — disables Intel TSX, eliminating the TAA/TSX side-channel surface entirely.

tier2.oops (Tier 2): oops=panic — paired with kernel.panic_on_oops=1, causes the kernel to reboot instead of continuing after memory corruption.


Module blacklists — /etc/modprobe.d/cfm-kernsec.conf

This is where kernsec does the most work per line of config. For each rule, apply writes both a blacklist <module> line (prevents alias-based autoloading) and an install <module> /bin/false line (prevents direct modprobe loads). Two lines because blacklist alone is bypassed by modprobe -f and some alias paths.

The current catalog covers 82 modules across 9 groups:

modules.recent_cves — Modules that have been directly exploited or have no conceivable server use case: ksmbd (in-kernel SMB server, multiple LPE CVEs 2023–2025), n_hdlc (the n_hdlc LPE class), vivid (virtual video driver, frequent CVE/CTF target), watch_queue (Dirty Cred vector), binfmt_aout, nfc, nfcsim, pn533, pn533_usb.

modules.net.legacy — 26 dead network protocols: dccp (multiple LPEs), tipc (cluster IPC LPEs), rds (Oracle-internal reliable datagram), rxrpc (AFS RPC), ax25/netrom/x25/rose (amateur radio protocols, explicitly named in the Linux Killswitch proposal), decnet, econet, ipx (Novell), appletalk, various LLC/SNAP encapsulations, pptp, gtp, can, atm, irda, phonet, caif, caif_socket, hsr.

modules.net.virtvsock (VMware/QEMU guest↔host virtual sockets — also explicitly named in the Killswitch proposal; skipped when IsKVMHost is detected since KVM hypervisor hosts need it for guest communication).

modules.net.iot — IoT/embedded protocols with no server purpose: 6lowpan, ieee802154, ieee802154_6lowpan.

modules.fs.unused — 17 filesystems no hosting server mounts: cramfs, freevxfs, jffs2, hfs, hfsplus, udf, qnx4, qnx6, omfs, befs, ufs, affs, sysv, nilfs2, gfs2, ocfs2, coda.

modules.fs.containeroverlay is blacklisted where containers are not detected. The host profile probe checks for runc/containerd/LXC/podman and auto-skips this rule if any are running.

modules.bus.bluetoothbluetooth, btusb, bnep, hci_uart. Auto-skipped when Bluetooth hardware is detected via /sys/class/bluetooth.

modules.bus.firewirefirewire-core, -ohci, -net, -sbp2. DMA attack surface; no server use.

modules.bus.thunderboltthunderbolt. Auto-skipped when Thunderbolt devices are detected.

modules.bus.miscjoydev, pcspkr, floppy.

modules.sidechannelintel_rapl_common, intel_rapl_msr. Removes Intel RAPL/Platypus power telemetry side-channel surface (CVE-2020-8694).

modules.crypto_userapialgif_hash, algif_skcipher, algif_rng, algif_akcipher. Extends the AF_ALG hardening beyond what initcall_blacklist=algif_aead_init covers at boot. Together with the boot arg, this eliminates the entire AF_ALG userspace crypto API surface — all five algif_* socket families.

NFS, CIFS/SMB clients, io_uring, and wifi modules are intentionally not blacklisted. These have legitimate use on real fleet nodes.


Mount audit — report only

kernsec audits four mount points and surfaces missing options, but never edits /etc/fstab:

  • /tmp — recommend nodev,nosuid,noexec
  • /var/tmp — same
  • /dev/shm — recommend nodev,nosuid,noexec
  • /home — recommend nodev,nosuid (not noexec — hosting panels need exec on home)

Host profile detection

Before resolving any rule, kernsec probes the host and builds a profile that gates rules which would break real workloads. This runs automatically — operators don’t configure it. Detected signals include:

  • IsKVMHost — kvm_intel/kvm_amd loaded → don’t blacklist vsock
  • HasContainers — runc/containerd/lxc/podman running → don’t kill user namespaces or overlay
  • HasIPsecip xfrm policy non-empty → don’t touch IPsec modules
  • HasBluetoothHardware/sys/class/bluetooth non-empty → keep Bluetooth modules
  • HasThunderbolt/sys/bus/thunderbolt/devices non-empty → keep thunderbolt
  • IsCPanel, HasCloudLinuxLVE, HasCageFS — hosting panel detection
  • HasKernelCare, HasKsplice — live-patching agents (affects module loading behavior)
  • IsProxmox — Proxmox boot backend selection
  • HasDKMS — out-of-tree module evidence → conservative on module rules

Rules gated by these signals render as SKIP (host profile: <reason>) in audit output. Operators override per-rule with state = force in kernsec.conf.


Using it

First run

cfm kernsec          # TUI on a TTY; text audit otherwise (read-only)
cfm kernsec text     # force plain-text output
cfm kernsec preview  # show what apply would do

Applying

cfm kernsec init     # write /etc/cfm/kernsec.conf with tier=1 if absent
cfm kernsec apply --dry-run   # show changes without writing
cfm kernsec apply             # interactive: preview + confirmation
cfm kernsec apply --yes       # unattended (Ansible, cron)

apply writes files in safe order: managed sysctl drop-in and modprobe file first (reversible), bootloader cmdline update next, then sysctl -w per-key last. Runtime sysctl application is last specifically so a failure in the bootloader step doesn’t leave Tier 2 sysctls (like user.max_user_namespaces=0) active in the running kernel without a persistent file backing them.

Monitoring

cfm kernsec status --check     # exits non-zero on any WARN; suitable for cron/Nagios
cfm kernsec status --json      # machine-readable, for fleet aggregation
cfm kernsec monitor enable     # install systemd drift-check timer
cfm kernsec monitor status     # show last timer runs

For fleet status collection:

for h in virgo titan orion rigel mars earth edge; do
  echo "=== $h ===" && ssh root@$h cfm kernsec status --json | jq .ok
done

Per-rule overrides

/etc/cfm/kernsec.conf accepts per-rule state overrides without changing the tier. For example, on mars the operator has forced several rules that the host profile would otherwise skip:

tier = 1

[rule "KSEC-BOOT-sidechannel-001"]
state = force   # tsx=off — force even on this specific host

[rule "KSEC-MOD-net.legacy-024"]
state = force   # phonet

[rule "KSEC-MOD-net.virt-001"]
state = skip    # vsock — needed on this KVM host despite IsKVMHost detection

Valid states are default (follow tier), force (apply regardless of host profile), and skip (never apply regardless of tier).

Recovery

cfm kernsec disable --dry-run   # preview what disable removes
cfm kernsec disable             # persist tier=0, strip managed boot args
cfm kernsec rollback            # restore pre-apply bootloader snapshot

rollback uses the .cfm-kernsec.bak snapshots written before each apply. It removes only kernsec-managed cmdline keys — operator-provided args outside the managed set are preserved.


Runtime state output — reading the TUI

The TUI (screenshot above) shows three panes: Groups on the left, Rules in the middle, Detail on the right. The middle pane shows per-rule state:

  • OK — rule is applied and verified live
  • WARN — rule is configured but not yet active (reboot required for boot args, or module still loaded)
  • DIFF — sysctl present but with wrong value (drift from another tool)
  • MISSING — expected entry absent from managed file
  • SKIP — host profile blocks the rule, or the sysctl key/module isn’t exposed by this kernel
  • OFF — operator explicitly disabled (tier below rule’s tier, or state = skip)
  • DRIFT — active in current boot but missing from next-boot config
  • LOADED — module blacklisted but still loaded; needs reboot or rmmod
  • EXT — key owned by another CFM component; kernsec audits only

The rules: 129 warnings: 10 in the TUI header and the BOOT: divergence — reconcile with notice are the two key signals to watch: warnings mean drift from desired state, divergence means not all kernels have the same managed args applied (common during a kernel update when an old kernel still boots).


The killswitch connection

In May 2026, Linux stable kernel co-maintainer Sasha Levin proposed a feature called Killswitch — a mechanism to make a kernel function return a fixed value without executing its body, as a temporary mitigation while a patch cycle completes. The proposal explicitly names the problem: “when a security issue goes public, fleets stay exposed until a patched kernel is built, distributed, and rebooted into.”

kernsec already implements this philosophy at every layer below the kernel:

Boot-time killswitchinitcall_blacklist=algif_aead_init is, literally, a function-level killswitch applied at kernel init time. When Copy Fail (CVE-2026-31431) dropped with a working exploit before distro kernels were patched, this single boot arg made the vulnerable code path unreachable. kernsec manages, applies, and verifies this automatically.

Module-level killswitch — The install <module> /bin/false pattern is a killswitch for any kernel subsystem that lives in a loadable module. Every rule in modules.recent_cves and modules.net.legacy is a pre-emptive killswitch applied before a CVE drops, removing attack surface that the kernel will never need to load. When ksmbd had its LPE series in 2023–2025, hosts with kernsec deployed were unaffected because the module couldn’t load.

The Killswitch proposal’s named targets vs. kernsec coverage:

Subsystem Killswitch proposal kernsec
AF_ALG af_alg_sendmsg engage -EPERM initcall_blacklist=algif_aead_init (boot) + algif_* blacklist (module)
ksmbd ksmbd_smb2_read engage -EPERM modules.recent_cves blacklist
nf_tables named candidate intentionally excluded — CFM depends on nf_tables
vsock named candidate modules.net.virt blacklist (KVM-host gated)
ax25 named candidate modules.net.legacy blacklist

Three of the five named Killswitch candidates are already covered by kernsec’s module blacklists. The AF_ALG case is covered at boot time. nf_tables is the only one that can’t be touched without breaking CFM’s own firewall backend.

The key difference is timing: kernsec’s module blacklists are applied proactively at install time, before any CVE is announced. The kernel Killswitch primitive is designed for the reactive case — a zero-day drops, you engage the killswitch before a patch is ready. The two approaches are complementary: kernsec eliminates the surface area that will never be needed; Killswitch handles the residual cases where a needed function turns out to be vulnerable.


What it does not do

kernsec deliberately does not:

  • Set kernel.modules_disabled=1 — this is a one-way runtime switch that requires careful late-boot orchestration so cfm and host services can finish loading required modules first. Planned as a follow-up with a proper systemd unit.
  • Edit /etc/fstab — mount changes require operator judgment on live hosting servers.
  • Manage NFS, CIFS, io_uring, or wifi modules — legitimate use exists on fleet nodes.
  • Implement Killswitch engage/disengage — the kernel interface doesn’t exist in any distro kernel yet. When it lands in AlmaLinux/CloudLinux kernels, kernsec’s audit layer is designed to surface the state via cfm kernsec status.

Implementation notes for the technically curious

kernsec is written in Go and lives in internal/kernsec/. The key design points:

Schema-driven rule registry — Every rule is a Go struct (SysctlRule, BootArg, ModuleRule, MountRule) with a stable KSEC-<class>-<group>-<NNN> ID. The resolver joins rules against the host profile and conf overrides to produce per-rule Decision values (Apply / SkipByConf / SkipByTier / SkipByHostProfile). The audit then attaches live probes to each decision.

Atomic file writes with backup — All writes are atomic (write to temp, rename). The first write backs up the original file as <path>.cfm-kernsec.bak. Subsequent writes warn if the file contains lines outside the managed set (operator-edited managed files) without blocking the apply.

Per-key sysctl applyapply calls sysctl -w key=value per key rather than sysctl -p, so a single rejected key (kernel version mismatch, security module blocking the write) doesn’t stop the rest. Keys not exposed by the running kernel are detected via /proc/sys/<path> existence check and skipped at render time.

Multi-backend boot arg management — Three backends (BLS/grubby, legacy GRUB, Proxmox) with a clean interface. Each backend owns reading and writing the current and next-boot cmdline. The managed key set is authoritative: apply strips any stale instance of a managed key before appending the desired set, preventing accumulation of duplicate args across kernel updates.

Cross-component registryinternal/managedsysctl tracks which CFM component owns which sysctl key. kernsec checks this before writing any key and renders EXT rows for keys owned elsewhere. This prevents two CFM components from writing conflicting values to the same sysctl.