Building a container from scratch in Java: Part 1
Syscalls Without JNI: Building the Core
This series
- Part 1 - Syscalls Without JNI: Building the Core (this post)
- Part 2 - Cgroups, veth Pairs, OCI Images, and Lifecycle
- Part 3 - Deep Dives: FFM,
pivot_root, and the OCI Registry Protocol
I was really awed by Julian Friedman's article, Building a Container in Go, when it came out roughly 10 years back, how it brought out the essence of a container and did away with the magic. After that, there were some live demos as well on building a tiny barebones container from scratch (I think a nice one was from Liz Rice). When you get started with the container ecosystem, it feels like somebody conjured it up from innards of the Linux kernel. Namespaces, cgroups, pivot_root, bind mounts - these aren't abstractions, they're syscalls, and for a long time that put them in C territory. More recently, the Docker ecosystem was implemented in Go. Java simply didn't have a clean way to make arbitrary syscalls without writing JNI glue in C, compiling a shared library, and managing the impedance mismatch between the GC heap and native memory. That changed with the Foreign Function & Memory (FFM) API, finalised in Java 22. It lets you look up a native symbol by name, describe its signature in Java types, and call it - all from pure Java, with the JIT treating the call site as efficiently as any other method invocation. No C compiler in your build, no System.loadLibrary, no unsafe casts.
This is 3-post series that builds a working container runtime in Java 25 to explore what that looks like in practice. The design closely follows the Go article linked above - same two-process pattern, same kernel primitives - so if you've read that article, you'll find the structure familiar. Our focus is the Java angle: where FFM fits, what it costs, and where the platform abstraction boundaries need to be. This series also goes beyond Julian Friedman's article in adding Cgroups, veth Pairs, OCI images and basic container lifecycle.
We support Linux fully (namespaces, pivot_root, cgroups in Part 2) and MacOS in a degraded mode (chroot-only, useful for development). Both platforms use FFM for their respective syscalls.
Outline
- What a container actually is
- The two-process pattern
- The platform abstraction:
ContainerRuntime - Calling the kernel from Java: the FFM API
- The parent: namespace setup and child spawning
- The child: the filesystem isolation sequence
- MacOS: the same interface, different syscalls
- What's next
What a container actually is
Before touching any code, it's worth being precise about what we're building. A container is a process - just like any other ordinary process - running in a set of isolated Linux kernel namespaces with a different view of the filesystem root. There's no VM boundary, no hypervisor, no separate kernel. The kernel that runs your container is the same one running everything else on the machine.
Three namespaces do the work in a basic container:
- UTS - the hostname namespace. The process gets its own hostname without affecting the host.
- Mount (MNT) - the mount namespace. Mount and unmount operations inside the container don't propagate to the host.
- PID - the PID namespace. The first process in the namespace becomes PID 1. It gets its own PID tree; from inside the container, you can't see host processes.
Filesystem isolation is handled separately: pivot_root(2) atomically swaps the root filesystem, replacing the host's root with the container's rootfs image. That's the complete picture covered in Part 1. No magic.
The two-process pattern
The entry point is JContainer.java, and its structure is worth understanding before anything else. The program re-invokes itself: the first run is "parent mode" (dispatched by run), and it spawns a child process that re-enters the same JAR in "child mode" (dispatched by child). Part 2 adds lifecycle commands on top.
public static void main(String[] args) {
switch (args[0]) {
case "run" -> {
ContainerRuntime runtime = createRuntime();
ContainerParent.run(runtime, args);
}
case "child" -> {
ContainerRuntime runtime = createRuntime();
ContainerChild.run(runtime, args);
}
case "list" -> new ContainerLifecycle().list();
case "stop" -> new ContainerLifecycle().stop(args[1]);
case "logs" -> new ContainerLifecycle().logs(args[1]);
case "rm" -> new ContainerLifecycle().rm(args[1]);
// ...
}
}
static ContainerRuntime createRuntime() {
return isLinux() ? new LinuxRuntime() : new MacOSRuntime();
}
The reason for the self-re-invocation is the same as in the Go article: namespace setup has to happen in two phases. The parent creates the namespaces and spawns a child inside them. The child configures the filesystem and execs the payload. These can't happen in a single process because of how PID namespaces work - more on that shortly.
The platform abstraction: ContainerRuntime
Before any implementation, the project defines an interface that cleanly separates what needs to happen from how it's done on each platform.
public interface ContainerRuntime {
/**
* Build the command list to spawn the child process.
* On Linux, this wraps with unshare --pid --fork for PID namespace.
* On MacOS, this is a plain Java invocation.
*/
List<String> buildChildCommand(String javaPath, String classpath,
String rootfs, String[] command,
boolean networkEnabled);
/**
* Set up the parent process before spawning the child.
* On Linux, creates UTS and mount namespaces via unshare(2).
* On MacOS, this is a no-op.
*/
void setupParent();
/**
* Set up filesystem isolation in the child process.
* On Linux, performs bind mount + pivot_root + proc mount.
* On MacOS, performs chroot.
*/
void setupFilesystem(String rootfs);
/**
* Set the container hostname.
* On Linux, calls sethostname(2) in the UTS namespace.
* On MacOS, this is skipped - it would affect the host.
*/
void setHostname(String hostname);
void execCommand(String[] command);
}
This pays dividends in a few ways. ContainerChild is completely platform-agnostic - it just calls runtime.setHostname() and runtime.setupFilesystem() without any platform checks. Tests for LinuxRuntime and MacOSRuntime can verify each implementation independently. And when you read ContainerParent, you're reading orchestration logic, not interleaved kernel detail.
Calling the kernel from Java: the FFM API
At the core of Syscalls.java is a pattern that repeats for every syscall binding. Let's walk through it with chroot - the simplest binding, and the template for everything else.
private static final Linker LINKER = Linker.nativeLinker();
private static final SymbolLookup LOOKUP = LINKER.defaultLookup();
private static final MethodHandle CHROOT = LINKER.downcallHandle(
LOOKUP.find("chroot").orElseThrow(),
FunctionDescriptor.of(ValueLayout.JAVA_INT, ValueLayout.ADDRESS));
public static int chroot(Arena arena, String path) {
try {
return (int) CHROOT.invokeExact(arena.allocateFrom(path));
} catch (Throwable t) {
throw new RuntimeException("chroot failed", t);
}
}
Three objects do the work. Linker.nativeLinker() is the bridge between Java and the platform's C ABI. SymbolLookup.defaultLookup() resolves function names to native addresses, searching the same shared libraries the C runtime would. FunctionDescriptor.of(JAVA_INT, ADDRESS) tells the linker the call signature: one pointer argument, one integer return value. downcallHandle produces a MethodHandle that, when invoked, marshals arguments, calls the native function, and returns the result.
The Arena parameter is the memory lifetime mechanism. arena.allocateFrom(path) converts a Java string to a null-terminated char* in native memory. When the Arena closes - via try-with-resources - that memory is freed. Using Arena.ofConfined() gives you a region tied to the current thread, which is the right choice here since all our syscall sequences run single-threaded.
One thing to note: invokeExact is intentional. Unlike invoke, it performs no implicit type coercion - argument types must match the MethodHandle's signature exactly. That strictness is what lets the JIT treat this as a near-direct native call. Any mismatch is a WrongMethodTypeException at runtime, not a silent bad cast.
For Linux-only syscalls, Syscalls.java wraps the initialization in a static block guarded by a platform check, and sets the handles to null on non-Linux systems. Any attempt to call them from MacOSRuntime throws UnsupportedOperationException. This is enforced at the implementation boundary, not scattered across call sites.
We'll go much deeper on FFM in Part 3 - including the pivot_root binding, which has no libc wrapper and must go through the raw syscall(2) trampoline with architecture-specific syscall numbers. For now the pattern above is everything you need to understand how the rest of Syscalls.java works.
The parent: namespace setup and child spawning
ContainerParent.run() is the orchestration hub. In Part 1, the focus here is on: set up the parent's namespaces, build the child command, and spawn it. Here's LinuxRuntime.setupParent() and buildChildCommand():
@Override
public void setupParent() {
int rc = Syscalls.unshare(CLONE_NEWNS | CLONE_NEWUTS);
if (rc != 0) {
throw new RuntimeException("unshare(CLONE_NEWNS | CLONE_NEWUTS) failed with rc=" + rc);
}
}
@Override
public List<String> buildChildCommand(String javaPath, String classpath,
String rootfs, String[] command,
boolean networkEnabled) {
List<String> cmd = new ArrayList<>();
cmd.add("unshare");
cmd.add("--pid");
if (networkEnabled) {
cmd.add("--net");
}
cmd.add("--fork");
cmd.add(javaPath);
cmd.add("--enable-native-access=ALL-UNNAMED");
cmd.add("-cp");
cmd.add(classpath);
cmd.add("org.jcontainer.JContainer");
cmd.add("child");
cmd.add(rootfs);
cmd.addAll(List.of(command));
return cmd;
}
setupParent() calls unshare(2) via FFM with two flags: CLONE_NEWNS creates a new mount namespace, and CLONE_NEWUTS creates a new UTS namespace. After this call, any mount operations or hostname changes in this process (and its descendants) are isolated from the host.
The PID namespace is conspicuously absent here, and for good reason: unshare(CLONE_NEWPID) creates a new PID namespace for future children of the calling process, not for the calling process itself. The first process that enters the new namespace becomes PID 1 - but only if it's a forked child. This means you can't call unshare(CLONE_NEWPID) from within setupParent() and expect the current process to be PID 1. The conventional solution - used both here and in the Go article - is to use the unshare(1) shell command with --fork to handle the fork atomically. Hence buildChildCommand() returns a command list that starts with unshare --pid --fork.
The Java binary path and classpath are resolved at runtime, not hardcoded:
static String resolveJavaPath() {
return ProcessHandle.current().info().command()
.orElseThrow(() -> new RuntimeException("Cannot resolve Java binary path"));
}
static String resolveClasspath() {
return System.getProperty("java.class.path");
}
ProcessHandle.current().info().command() returns the path of the JVM executable that launched the current process - exactly what you need to re-invoke the same JVM in the child.
The child: the filesystem isolation sequence
Once the child starts, it's already inside new UTS, mount, and PID namespaces. ContainerChild.run() delegates immediately to the runtime:
public static void run(ContainerRuntime runtime, String[] args) {
// args: ["child", rootfs, command, arg1, arg2, ...]
String rootfs = args[1];
String[] command = Arrays.copyOfRange(args, 2, args.length);
runtime.setHostname("container");
runtime.setupFilesystem(rootfs);
runtime.execCommand(command);
}
On Linux, setHostname() calls sethostname(2) via FFM and sets the container's UTS namespace hostname - visible inside the container but invisible on the host, because we're already in an isolated UTS namespace.
The interesting work is in LinuxRuntime.setupFilesystem():
@Override
public void setupFilesystem(String rootfs) {
try (Arena arena = Arena.ofConfined()) {
// Make the mount tree private so changes don't propagate to the host
check(Syscalls.mount(arena, "none", "/", null, MS_REC | MS_PRIVATE, null),
"mount / as private");
// Bind mount rootfs onto itself (pivot_root requires new_root to be a mountpoint)
check(Syscalls.mount(arena, rootfs, rootfs, null, MS_BIND, null),
"bind mount rootfs");
// Create directory for the old root
File oldRoot = new File(rootfs, "oldrootfs");
if (!oldRoot.exists() && !oldRoot.mkdirs()) {
throw new RuntimeException("Failed to create " + oldRoot);
}
// Swap the root filesystem
long rc = Syscalls.pivotRoot(arena, rootfs, oldRoot.getAbsolutePath());
if (rc != 0) {
throw new RuntimeException("pivot_root failed with rc=" + rc);
}
// Move to the new root
check(Syscalls.chdir(arena, "/"), "chdir /");
// Mount /proc for process visibility
new File("/proc").mkdirs();
check(Syscalls.mount(arena, "proc", "/proc", "proc", 0, null),
"mount proc");
// Detach the old root and clean up
check(Syscalls.umount2(arena, "/oldrootfs", MNT_DETACH), "umount2 oldrootfs");
new File("/oldrootfs").delete();
}
}
Each step here is necessary. The first mount call with MS_PRIVATE | MS_REC makes the entire mount tree private to this namespace - without it, subsequent mount operations could propagate back to the host. The bind mount of rootfs onto itself exists purely to satisfy a pivot_root precondition: new_root must be a separate mount point, not just a directory.
pivot_root(2) then atomically swaps the root: the container's rootfs becomes /, and the host's old root lands at /oldrootfs (relative to the new root). The chdir("/") that follows is mandatory - the working directory is still pointing at the old location after the swap. /proc is mounted fresh because we're now inside a PID namespace; the host's /proc was detached with the old root. Finally, umount2 with MNT_DETACH removes the old root - MNT_DETACH means "detach immediately even if busy; clean up when the last reference drops" - and the directory is removed.
Part 3 covers the full mechanics of pivot_root in depth, including why chroot isn't a substitute, and exactly why each precondition exists.
MacOS: the same interface, different syscalls
MacOSRuntime uses chroot(2) and chdir(2) - both available via Linker.nativeLinker() on MacOS, through the same FFM pattern shown above. No namespace flags, no pivot_root. The child command list has no unshare prefix.
@Override
public void setupFilesystem(String rootfs) {
try (Arena arena = Arena.ofConfined()) {
int rc = Syscalls.chroot(arena, rootfs);
if (rc != 0) {
throw new RuntimeException("chroot(" + rootfs + ") failed with rc=" + rc);
}
rc = Syscalls.chdir(arena, "/");
if (rc != 0) {
throw new RuntimeException("chdir(/) failed with rc=" + rc);
}
}
}
The difference in buildChildCommand() is equally stark: LinuxRuntime prefixes unshare --pid --fork, MacOSRuntime goes straight to the Java invocation. The ContainerRuntime interface makes this comparison easy - same method, two implementations, no branching logic elsewhere.
What MacOS mode gives you is a development environment where you can iterate on the Java code and test basic filesystem isolation without needing a Linux box. What it doesn't give you: PID isolation, mount namespace, UTS namespace, or the hostname change. setupParent() on MacOS just prints a warning and returns. MacOS is the most advanced operating system ... NOT!
Running it
# Build
mvn clean package
# Linux: download an Alpine miniroot
./scripts/setup-rootfs.sh
# MacOS: create a minimal rootfs (requires Docker)
./scripts/setup-rootfs-MacOS.sh
# Run (Linux - requires root for namespace creation)
sudo java --enable-native-access=ALL-UNNAMED \
-cp target/jcontainer-1.0-SNAPSHOT.jar \
org.jcontainer.JContainer run rootfs /bin/sh
# Run (MacOS - requires root for chroot)
sudo java --enable-native-access=ALL-UNNAMED \
-cp target/jcontainer-1.0-SNAPSHOT.jar \
org.jcontainer.JContainer run rootfs /bin/sh
Once inside the shell, there are three things worth checking on Linux:
/ # hostname
container
/ # ps aux
PID USER COMMAND
1 root /bin/sh
6 root ps aux
/ # ls /
bin dev etc home lib media mnt opt proc root run sbin srv sys tmp usr var
The hostname is container because we called sethostname(2) in the UTS namespace. PID 1 is your shell, not the host's init, because we entered a new PID namespace. The filesystem root is Alpine's rootfs, not the host's. Exit the shell and you're back on the host, with none of these changes persisting.
The --enable-native-access=ALL-UNNAMED flag is required because the FFM API isn't permitted by default to unnamed modules - the JDK will refuse to create downcall handles without it. In production you'd want to scope this to the specific module rather than using the wildcard.
What's next
The implementation we have seen in Part 1 gives us isolation, but three things are still missing that matter for real workloads.
First, the container process can consume unbounded memory and CPU. There's nothing preventing it from forking a thousand processes or allocating all available RAM. That's where cgroups v2 come in - a filesystem-based hierarchy under /sys/fs/cgroup/ that puts hard limits on resource usage.
Second, the container has no network access. It's isolated in a new network namespace with no interfaces configured. Connecting it to the host requires creating a virtual ethernet pair (a veth pair) and moving one end into the container's namespace.
Third, you have to prepare the rootfs yourself before running anything. Part 2 adds OCI image support - specifying --image alpine:latest instead of a rootfs path, with JContainer pulling and caching the image via the Docker Registry v2 API using Java's built-in HttpClient.
Part 2 covers all three, plus container lifecycle management: IDs, state persistence, list, stop, and logs.