Seccomp Filtering#

Rustbox installs a seccomp-BPF filter on every sandbox that intercepts every syscall before it reaches the kernel. This is layer 5 of the 8-layer isolation model.

Deny-list, not allowlist#

We use a deny-list approach: block the specific syscalls that enable sandbox escape, allow everything else.

Why not an allowlist? Because the languages we support have complex runtime requirements:

  • Python calls 100+ different syscalls during import sys. An allowlist would break on version upgrades.
  • Java spawns threads, uses futex, mmap, clone - the JVM's syscall profile changes between minor versions.
  • C++ standard library probes for features (io_uring, statx) at startup. Blocking probes kills innocent programs.

A deny-list that targets the specific syscalls attackers need is more robust across language runtimes and version upgrades.

The 51-syscall deny-list#

FamilySyscallsActionWhy
io_uringio_uring_setup, io_uring_enter, io_uring_registerERRNO(ENOSYS)Kernel LPE history (CVE-2021-41073, CVE-2023-2598)
TracingptraceKILLCross-process inspection
Process memoryprocess_vm_readv, process_vm_writevERRNO(EPERM)Runtime crash handlers probe these at startup
Kernel subsystemsbpf, userfaultfd, perf_event_openKILLeBPF loading, page fault interception, perf abuse
Module loadingkexec_load, kexec_file_load, init_module, finit_module, delete_moduleKILLKernel module/boot manipulation
Mount/swapmount, umount2, pivot_root, swapon, swapoffKILLFilesystem manipulation
New mount APIfsopen, fsmount, fsconfig, fspick, move_mount, open_tree, mount_setattrKILLLinux 5.2+ mount manipulation
Namespace escapeunshare, chroot, setnsKILLNested namespace creation, chroot escape
DAC bypassname_to_handle_at, open_by_handle_atKILLFile handle manipulation (CVE-2014-0038)
System clockreboot, settimeofday, clock_settime, acctKILLSystem state manipulation
Kernel keyringadd_key, keyctl, request_keyKILLNot namespaced (CVE-2016-0728)
NUMAmbind, set_mempolicy, move_pagesKILLMemory policy manipulation
Execution domainpersonalityERRNO(EPERM)Blocks READ_IMPLIES_EXEC

Three response modes#

Not all blocked syscalls deserve the same response:

  • Errno(ENOSYS) for probe syscalls (io_uring). The process gets "not supported" and falls back to a safe alternative. This prevents unnecessary crashes when runtimes probe for kernel features.

  • Errno(EPERM) for diagnostic syscalls (process_vm_readv/writev, personality). Runtime crash handlers degrade gracefully instead of receiving SIGSYS.

  • KillProcess for exploit-class syscalls (ptrace, bpf, mount). Immediate process termination. No second chances.

Implementation#

The BPF filter is built using the seccompiler crate (from AWS Firecracker). Four BPF programs are stacked:

  1. ENOSYS filter (io_uring probes)
  2. EPERM filter (diagnostic syscalls)
  3. KILL filter (exploit syscalls)
  4. clone(NEWUSER) argument filter (blocks user namespace creation)

The kernel evaluates all four filters and applies the most restrictive result.

clone(NEWUSER) special case#

User namespace creation (clone with CLONE_NEWUSER) gets a dedicated argument-level BPF filter rather than blocking clone entirely. This is because clone is essential for process creation and threading. Only the CLONE_NEWUSER flag combination is blocked.