Seccomp Filtering#

Rustbox installs a seccomp-BPF filter on every sandbox that intercepts every syscall before it reaches the kernel. This is layer 5 of the 8-layer isolation model.

Deny-list, not allowlist#

We use a deny-list approach: block the specific syscalls that enable sandbox escape, allow everything else.

Why not an allowlist? Because the languages we support have complex runtime requirements:

Python calls 100+ different syscalls during import sys. An allowlist would break on version upgrades.
Java spawns threads, uses futex, mmap, clone - the JVM's syscall profile changes between minor versions.
C++ standard library probes for features (io_uring, statx) at startup. Blocking probes kills innocent programs.

A deny-list that targets the specific syscalls attackers need is more robust across language runtimes and version upgrades.

The 51-syscall deny-list#

Family	Syscalls	Action	Why
io_uring	`io_uring_setup`, `io_uring_enter`, `io_uring_register`	ERRNO(ENOSYS)	Kernel LPE history (CVE-2021-41073, CVE-2023-2598)
Tracing	`ptrace`	KILL	Cross-process inspection
Process memory	`process_vm_readv`, `process_vm_writev`	ERRNO(EPERM)	Runtime crash handlers probe these at startup
Kernel subsystems	`bpf`, `userfaultfd`, `perf_event_open`	KILL	eBPF loading, page fault interception, perf abuse
Module loading	`kexec_load`, `kexec_file_load`, `init_module`, `finit_module`, `delete_module`	KILL	Kernel module/boot manipulation
Mount/swap	`mount`, `umount2`, `pivot_root`, `swapon`, `swapoff`	KILL	Filesystem manipulation
New mount API	`fsopen`, `fsmount`, `fsconfig`, `fspick`, `move_mount`, `open_tree`, `mount_setattr`	KILL	Linux 5.2+ mount manipulation
Namespace escape	`unshare`, `chroot`, `setns`	KILL	Nested namespace creation, chroot escape
DAC bypass	`name_to_handle_at`, `open_by_handle_at`	KILL	File handle manipulation (CVE-2014-0038)
System clock	`reboot`, `settimeofday`, `clock_settime`, `acct`	KILL	System state manipulation
Kernel keyring	`add_key`, `keyctl`, `request_key`	KILL	Not namespaced (CVE-2016-0728)
NUMA	`mbind`, `set_mempolicy`, `move_pages`	KILL	Memory policy manipulation
Execution domain	`personality`	ERRNO(EPERM)	Blocks READ_IMPLIES_EXEC

Three response modes#

Not all blocked syscalls deserve the same response:

Errno(ENOSYS) for probe syscalls (io_uring). The process gets "not supported" and falls back to a safe alternative. This prevents unnecessary crashes when runtimes probe for kernel features.
Errno(EPERM) for diagnostic syscalls (process_vm_readv/writev, personality). Runtime crash handlers degrade gracefully instead of receiving SIGSYS.
KillProcess for exploit-class syscalls (ptrace, bpf, mount). Immediate process termination. No second chances.

Implementation#

The BPF filter is built using the seccompiler crate (from AWS Firecracker). Four BPF programs are stacked:

ENOSYS filter (io_uring probes)
EPERM filter (diagnostic syscalls)
KILL filter (exploit syscalls)
clone(NEWUSER) argument filter (blocks user namespace creation)

The kernel evaluates all four filters and applies the most restrictive result.

clone(NEWUSER) special case#

User namespace creation (clone with CLONE_NEWUSER) gets a dedicated argument-level BPF filter rather than blocking clone entirely. This is because clone is essential for process creation and threading. Only the CLONE_NEWUSER flag combination is blocked.