Seccomp Filtering#
Rustbox installs a seccomp-BPF filter on every sandbox that intercepts every syscall before it reaches the kernel. This is layer 5 of the 8-layer isolation model.
Deny-list, not allowlist#
We use a deny-list approach: block the specific syscalls that enable sandbox escape, allow everything else.
Why not an allowlist? Because the languages we support have complex runtime requirements:
- Python calls 100+ different syscalls during
import sys. An allowlist would break on version upgrades. - Java spawns threads, uses futex, mmap, clone - the JVM's syscall profile changes between minor versions.
- C++ standard library probes for features (io_uring, statx) at startup. Blocking probes kills innocent programs.
A deny-list that targets the specific syscalls attackers need is more robust across language runtimes and version upgrades.
The 51-syscall deny-list#
| Family | Syscalls | Action | Why |
|---|---|---|---|
| io_uring | io_uring_setup, io_uring_enter, io_uring_register | ERRNO(ENOSYS) | Kernel LPE history (CVE-2021-41073, CVE-2023-2598) |
| Tracing | ptrace | KILL | Cross-process inspection |
| Process memory | process_vm_readv, process_vm_writev | ERRNO(EPERM) | Runtime crash handlers probe these at startup |
| Kernel subsystems | bpf, userfaultfd, perf_event_open | KILL | eBPF loading, page fault interception, perf abuse |
| Module loading | kexec_load, kexec_file_load, init_module, finit_module, delete_module | KILL | Kernel module/boot manipulation |
| Mount/swap | mount, umount2, pivot_root, swapon, swapoff | KILL | Filesystem manipulation |
| New mount API | fsopen, fsmount, fsconfig, fspick, move_mount, open_tree, mount_setattr | KILL | Linux 5.2+ mount manipulation |
| Namespace escape | unshare, chroot, setns | KILL | Nested namespace creation, chroot escape |
| DAC bypass | name_to_handle_at, open_by_handle_at | KILL | File handle manipulation (CVE-2014-0038) |
| System clock | reboot, settimeofday, clock_settime, acct | KILL | System state manipulation |
| Kernel keyring | add_key, keyctl, request_key | KILL | Not namespaced (CVE-2016-0728) |
| NUMA | mbind, set_mempolicy, move_pages | KILL | Memory policy manipulation |
| Execution domain | personality | ERRNO(EPERM) | Blocks READ_IMPLIES_EXEC |
Three response modes#
Not all blocked syscalls deserve the same response:
-
Errno(ENOSYS)for probe syscalls (io_uring). The process gets "not supported" and falls back to a safe alternative. This prevents unnecessary crashes when runtimes probe for kernel features. -
Errno(EPERM)for diagnostic syscalls (process_vm_readv/writev, personality). Runtime crash handlers degrade gracefully instead of receiving SIGSYS. -
KillProcessfor exploit-class syscalls (ptrace, bpf, mount). Immediate process termination. No second chances.
Implementation#
The BPF filter is built using the seccompiler crate (from AWS Firecracker). Four BPF programs are stacked:
- ENOSYS filter (io_uring probes)
- EPERM filter (diagnostic syscalls)
- KILL filter (exploit syscalls)
- clone(NEWUSER) argument filter (blocks user namespace creation)
The kernel evaluates all four filters and applies the most restrictive result.
clone(NEWUSER) special case#
User namespace creation (clone with CLONE_NEWUSER) gets a dedicated argument-level BPF filter rather than blocking clone entirely. This is because clone is essential for process creation and threading. Only the CLONE_NEWUSER flag combination is blocked.