अभिव्यक्ति

Building a container from scratch in Java: Part 3

Deep Dives: FFM, pivot_root, and the OCI Registry Protocol

This series

Code for this series

Parts 1 and 2 kept three topics deliberately shallow - enough to understand what the code does. However, we are curious developers and we demand a deeper understanding! In this post I intend to help you gain a firmer footing in three key topics that can help us build more tools like containers should the need arise.

The three sections are independent. If you read Part 1 and want to understand FFM or pivot_root in depth, you don't need to have read Part 2, and vice versa for the OCI section. The sections appear in the same order as the topics were introduced across the series.

Outline

The Foreign Function and Memory API

Why FFM exists

Before Java 22, calling a native function from Java meant one of three things. JNI (Java Native Interface) required writing C glue code, compiling it into a shared library, loading it with System.loadLibrary, and manually managing the boundary between the GC-managed heap and native memory - a process both error-prone and hostile to the JIT. JNA and JNR improved the ergonomics by using reflection to generate native call stubs at runtime, but they're third-party libraries, reflection-based, and not something the JIT can treat as a first-class call. Panama - later renamed the Foreign Function & Memory API - was incubated across several releases before being finalised in Java 22, and it addresses all three problems: it's part of the JDK, it's statically typed, and the JIT can fully inline the call.

The API has four core concepts. Understanding each one, and how they compose, is enough to read any FFM code you encounter.

Linker: the ABI bridge

private static final Linker LINKER = Linker.nativeLinker();

Linker.nativeLinker() returns the platform's ABI-aware linker - the component that knows how to translate a Java method invocation into a native function call following the C calling convention for the current platform (System V AMD64 ABI on Linux x86-64, AAPCS64 on ARM64, and so on). It handles argument marshalling, return value extraction, and stack alignment. You create one per JVM; it's designed to be a long-lived, shared object.

The term "downcall" refers to a call going down from Java into native code. (The API also supports "upcalls" - native code calling back into Java - but we don't use them here.)

SymbolLookup: resolving function addresses

private static final SymbolLookup LOOKUP = LINKER.defaultLookup();

defaultLookup() searches the same shared libraries the C runtime would: on Linux, that's the libraries loaded into the current process - libc.so, libm.so, libpthread.so, and any others linked at JVM startup. The result is a SymbolLookup you can query by name:

LOOKUP.find("chroot")  // → Optional<MemorySegment>

The returned MemorySegment is a raw function pointer - the native address of chroot in libc. Optional is appropriate here: if you ask for a symbol that doesn't exist in the loaded libraries, you get an empty Optional rather than a crash. In Syscalls.java we call .orElseThrow() on every lookup, which is correct for functions we know must exist on the target platform.

For Linux-only symbols like unshare and pivot_root, the lookups are only attempted when IS_LINUX is true. On macOS, those handles are set to null and any attempt to call them throws UnsupportedOperationException before even reaching the handle invocation.

FunctionDescriptor: the type signature

FunctionDescriptor describes the signature of the native function in terms of ValueLayout types - the FFM equivalent of C types. The mapping is straightforward:

C type ValueLayout
int JAVA_INT
long JAVA_LONG
size_t JAVA_LONG (64-bit)
char*, any pointer ADDRESS
void (use ofVoid())

FunctionDescriptor.of(returnType, argType...) describes a function with a return value. FunctionDescriptor.ofVoid(argType...) describes a void function.

Let's look at the descriptors for each syscall binding in Syscalls.java and what they correspond to in C:

// int chroot(const char *path)
FunctionDescriptor.of(ValueLayout.JAVA_INT, ValueLayout.ADDRESS)

// int chdir(const char *path)
FunctionDescriptor.of(ValueLayout.JAVA_INT, ValueLayout.ADDRESS)

// int unshare(int flags)
FunctionDescriptor.of(ValueLayout.JAVA_INT, ValueLayout.JAVA_INT)

// int sethostname(const char *name, size_t len)
FunctionDescriptor.of(ValueLayout.JAVA_INT,
        ValueLayout.ADDRESS, ValueLayout.JAVA_LONG)

// int mount(const char *source, const char *target,
//           const char *filesystemtype, unsigned long mountflags,
//           const void *data)
FunctionDescriptor.of(ValueLayout.JAVA_INT,
        ValueLayout.ADDRESS, ValueLayout.ADDRESS,
        ValueLayout.ADDRESS, ValueLayout.JAVA_LONG,
        ValueLayout.ADDRESS)

// int umount2(const char *target, int flags)
FunctionDescriptor.of(ValueLayout.JAVA_INT,
        ValueLayout.ADDRESS, ValueLayout.JAVA_INT)

// long syscall(long number, ...)  <- for pivot_root
FunctionDescriptor.of(ValueLayout.JAVA_LONG,
        ValueLayout.JAVA_LONG,
        ValueLayout.ADDRESS, ValueLayout.ADDRESS)

The syscall descriptor is worth calling out. The real syscall(2) is variadic - it takes a syscall number followed by up to six arguments of any type. FFM doesn't have a variadic descriptor type, so you define one for the specific arity and argument types you need. For pivot_root, that's two pointer arguments after the syscall number, so the descriptor has three parameters total.

MethodHandle and downcallHandle: the call site

LINKER.downcallHandle(address, descriptor) binds a function pointer to a type signature and returns a MethodHandle - the JDK's general-purpose mechanism for representing callable things. The handle encodes both where to call and how to marshal.

private static final MethodHandle CHROOT = LINKER.downcallHandle(
        LOOKUP.find("chroot").orElseThrow(),
        FunctionDescriptor.of(ValueLayout.JAVA_INT, ValueLayout.ADDRESS));

Invocation uses invokeExact:

return (int) CHROOT.invokeExact(arena.allocateFrom(path));

invokeExact is stricter than invoke: argument types must match the handle's type signature exactly, with no implicit widening, boxing, or unboxing. A MemorySegment passed where ADDRESS is expected - fine. A String passed directly - WrongMethodTypeException at runtime. This strictness is what allows the JIT to fully inline the call: it can see exactly what types are flowing through, generate a direct native call instruction, and eliminate all reflection overhead. Using invoke instead would reintroduce type uncertainty that blocks inlining.

The (int) cast on the return value is also load-bearing. invokeExact returns Object, and the cast tells the JIT the expected return type. Without it the code won't compile - invokeExact is declared to throw Throwable, and the unchecked cast must be explicit.

Arena: native memory lifetime

Every string we pass to a syscall needs to become a null-terminated char* in native memory - outside the GC heap, at a stable address. Arena manages that memory:

public static int chroot(Arena arena, String path) {
    try {
        return (int) CHROOT.invokeExact(arena.allocateFrom(path));
    } catch (Throwable t) {
        throw new RuntimeException("chroot failed", t);
    }
}

arena.allocateFrom(path) allocates native memory for the UTF-8 encoded string plus a null terminator, copies the bytes in, and returns a MemorySegment pointing to it. The allocation lives until the arena closes.

The callers in LinuxRuntime and MacOSRuntime always open an Arena.ofConfined() and close it with try-with-resources:

public void setupFilesystem(String rootfs) {
    try (Arena arena = Arena.ofConfined()) {
        Syscalls.mount(arena, "none", "/", null, MS_REC | MS_PRIVATE, null);
        Syscalls.mount(arena, rootfs, rootfs, null, MS_BIND, null);
        // ... rest of the sequence
    } // all allocations freed here
}

ofConfined() creates an arena tied to the current thread - allocations can only be accessed from the thread that owns it, and it can only be closed from that thread. This is the right choice for a single-threaded syscall sequence: it's the fastest arena variant (no synchronisation overhead), and confining it to the thread makes the lifetime obvious. All allocations for the entire setupFilesystem sequence share one arena and are freed together when the method returns.

The pivot_root binding: no libc wrapper

pivot_root(2) is the one syscall in Syscalls.java that doesn't have a libc wrapper. The glibc maintainers deliberately omitted it on the grounds that it's a container implementation detail not meant for general application use. The syscall exists in the kernel; you just have to call it through the syscall(2) trampoline:

private static final MethodHandle SYSCALL = LINKER.downcallHandle(
        LOOKUP.find("syscall").orElseThrow(),
        FunctionDescriptor.of(ValueLayout.JAVA_LONG,
                ValueLayout.JAVA_LONG,
                ValueLayout.ADDRESS, ValueLayout.ADDRESS));

public static long pivotRoot(Arena arena, String newRoot, String putOld) {
    requireLinux("pivot_root");
    try {
        return (long) SYSCALL.invokeExact(
                LinuxConstants.sysPivotRoot(),
                arena.allocateFrom(newRoot),
                arena.allocateFrom(putOld));
    } catch (Throwable t) {
        throw new RuntimeException("pivot_root failed", t);
    }
}

The syscall number itself is architecture-dependent and lives in LinuxConstants:

public static final long SYS_PIVOT_ROOT_X86_64  = 155L;
public static final long SYS_PIVOT_ROOT_AARCH64 = 217L;

public static long sysPivotRoot() {
    String arch = System.getProperty("os.arch");
    return switch (arch) {
        case "amd64", "x86_64" -> SYS_PIVOT_ROOT_X86_64;
        case "aarch64"         -> SYS_PIVOT_ROOT_AARCH64;
        default -> throw new UnsupportedOperationException(
                "Unsupported architecture for pivot_root: " + arch);
    };
}

These numbers come from the kernel's syscall_64.tbl and unistd.h and are stable across kernel versions for a given architecture - syscall numbers are ABI, they don't change. The switch expression here is idiomatic Java 14+: exhaustive, expression-form, no fallthrough.

Compared to the Go article's approach - which uses the syscall package's Syscall function with the same numeric constants - the FFM version is structurally identical. The binding work that cgo would do implicitly in Go is made explicit here, which is arguably clearer.

pivot_root - what it does and why chroot isn't enough

The problem with chroot

chroot(2) changes the process's view of the filesystem root. From the process's perspective, the directory you pass becomes /. Everything above it is invisible. That's exactly what we want - and it's also the problem.

chroot changes a view, not a reality. The old root filesystem is still mounted. A process with CAP_SYS_CHROOT capability can call chroot again to escape, or use directory traversal tricks (opening a file descriptor before the chroot, then using /proc/self/fd/ to navigate out) to reach the host filesystem. This isn't a theoretical attack: breaking out of a chroot jail is a well-documented technique. For a development convenience on macOS, chroot is fine. For genuine isolation, it's not the right tool.

pivot_root(2) operates at the mount namespace level. It doesn't just change the process's view of the root - it rewires the actual mount tree in the namespace. After pivot_root, the old root filesystem is fully unmounted from the namespace (assuming you use MNT_DETACH). There's nothing to escape to.

The preconditions

pivot_root is famously finicky. The man 2 pivot_root page lists several preconditions that must all be satisfied simultaneously, and getting any one of them wrong produces an unhelpful EINVAL. Let's go through what LinuxRuntime.setupFilesystem() does in order and why each step exists.

Step 1: Make the mount tree private

Syscalls.mount(arena, "none", "/", null, MS_REC | MS_PRIVATE, null);

Linux mount namespaces support mount propagation: changes in one namespace can propagate to others that share a peer group. The default propagation type when a new mount namespace is created via clone or unshare is to inherit the parent namespace's configuration - which is typically shared. If we leave the tree shared, our bind mount and pivot_root could leak back to the host's mount namespace.

MS_PRIVATE makes a mount point private: changes to it don't propagate out, and changes from peers don't propagate in. MS_REC applies this recursively to the entire subtree rooted at /. This step makes our mount namespace a sealed environment.

Step 2: Bind mount the rootfs onto itself

Syscalls.mount(arena, rootfs, rootfs, null, MS_BIND, null);

This is the step that trips people up most often. pivot_root has an unconditional requirement: new_root must be a mount point - not just a directory, but a directory that is the target of an active mount. Our rootfs directory, even if it's full of Alpine's filesystem contents, is just a directory in the host's filesystem tree. It's not a mount point.

The solution is a bind mount: mounting a directory onto itself. MS_BIND tells the kernel to create a new mount entry whose source and target are the same path. After this call, rootfs is both a directory and a mount point - it satisfies the requirement. This is an unusual thing to do, but it's the canonical approach and the kernel explicitly supports it.

Step 3: Create the directory for the old root

File oldRoot = new File(rootfs, "oldrootfs");
oldRoot.mkdirs();

pivot_root(new_root, put_old) takes two paths: where to find the new root, and where to put the old root after the swap. put_old must be inside new_root and must itself be a mount point. Since new_root is the rootfs directory, we create oldrootfs/ inside it. After pivot_root, the host's old root filesystem will be accessible at /oldrootfs relative to the new root.

Step 4: Call pivot_root

long rc = Syscalls.pivotRoot(arena, rootfs, oldRoot.getAbsolutePath());

This is the atomic operation. The kernel:

  1. Makes new_root the new root of the mount namespace
  2. Moves the old root mount to put_old
  3. Updates the process's root directory (/) to point at new_root

From this point forward, / refers to the Alpine rootfs. The host's filesystem is still mounted, but only at /oldrootfs.

Step 5: chdir to the new root

Syscalls.chdir(arena, "/");

After pivot_root, the process's current working directory is still pointing at the old location in the old mount. chdir("/") updates it to the new root. This is mandatory - the kernel doesn't update the working directory automatically, and leaving it on the old mount would hold a reference to the detached filesystem.

Step 6: Mount /proc

new File("/proc").mkdirs();
Syscalls.mount(arena, "proc", "/proc", "proc", 0, null);

We're now inside a PID namespace, and /proc reflects the PID tree of the namespace it was mounted in. The host's /proc was detached along with the old root. We mount a fresh proc filesystem, which gives us a view of only the container's processes - PID 1 is our process.

Step 7: Detach and clean up the old root

Syscalls.umount2(arena, "/oldrootfs", MNT_DETACH);
new File("/oldrootfs").delete();

umount2 with MNT_DETACH detaches the old root mount immediately, even if there are open file descriptors pointing into it. Those file descriptors continue to work until they're closed, at which point the kernel frees the mount. Without MNT_DETACH, umount2 would fail with EBUSY if anything still had a reference to the old root. After unmounting, the /oldrootfs directory is deleted from the new root's filesystem.

At this point the container is fully isolated: the old root filesystem is gone from the namespace, / is Alpine, /proc shows only container processes. The sequence is complete.

On mount flags for further reading

Mount propagation - the shared/slave/private/unbindable subtree model - is a topic deep enough to warrant its own post. The Linux kernel documentation covers it thoroughly in Documentation/filesystems/sharedsubtree.rst, and Michael Kerrisk's The Linux Programming Interface chapter on mount namespaces is the definitive reference. We may return to it in a future post.

The Docker Registry v2 Protocol

What "docker pull alpine" actually does

The Docker CLI makes a docker pull look atomic, but it's a five-step protocol against an HTTP API. RegistryClient.java implements all five steps using Java's built-in HttpClient. Let's go through each one.

Step 1: Get a Bearer token

Docker Hub requires authentication even for public images. The auth endpoint is separate from the registry:

public String getToken(ImageRef ref) throws IOException, InterruptedException {
    String url = ImageRef.AUTH_URL
            + "?service=registry.docker.io&scope=repository:"
            + ref.repository() + ":pull";

    HttpRequest request = HttpRequest.newBuilder()
            .uri(URI.create(url))
            .GET()
            .build();

    HttpResponse<String> response =
            httpClient.send(request, HttpResponse.BodyHandlers.ofString());

    JsonObject json = JsonParser.parseString(response.body()).getAsJsonObject();
    return json.get("token").getAsString();
}

AUTH_URL is https://auth.docker.io/token. The scope parameter declares what the token will be used for - in this case, a pull from a specific repository. The response is JSON containing a short-lived Bearer token. This token is then passed as Authorization: Bearer <token> on all subsequent registry API calls.

For private registries, you'd pass credentials in the auth request. For Docker Hub public images, the anonymous token is sufficient.

Step 2: Fetch the manifest (and handle fat manifests)

The manifest describes the image: its layers, their digests, and the config. The complication is that a single tag like alpine:latest can resolve to different images on different architectures. Docker Hub handles this with a fat manifest (also called a manifest list or OCI image index): a top-level manifest that contains a list of platform-specific manifests, each identified by OS and architecture.

public JsonObject getManifest(ImageRef ref, String token)
        throws IOException, InterruptedException {

    String url = ref.registryUrl() + "/v2/" + ref.repository()
               + "/manifests/" + ref.tag();

    // Accept all manifest formats so the registry can send whichever it has
    String acceptHeader = String.join(",",
            MANIFEST_V2, MANIFEST_LIST_V2, OCI_MANIFEST, OCI_INDEX);

    HttpRequest request = HttpRequest.newBuilder()
            .uri(URI.create(url))
            .header("Accept", acceptHeader)
            .header("Authorization", "Bearer " + token)
            .GET()
            .build();

    HttpResponse<String> response =
            httpClient.send(request, HttpResponse.BodyHandlers.ofString());

    JsonObject manifest = JsonParser.parseString(response.body()).getAsJsonObject();
    String mediaType = manifest.has("mediaType")
            ? manifest.get("mediaType").getAsString() : "";

    // If we got a fat manifest, resolve it to the platform-appropriate one
    if (MANIFEST_LIST_V2.equals(mediaType) || OCI_INDEX.equals(mediaType)
            || manifest.has("manifests")) {
        String platformDigest = selectPlatformDigest(manifest);
        return fetchManifestByDigest(ref, platformDigest, token);
    }

    return manifest;
}

The four Accept types cover both the Docker v2 format and the OCI format, and both single-platform and multi-platform variants. Without specifying all four, some registries will return a 404 or an unexpected format.

selectPlatformDigest() walks the fat manifest's manifests array looking for an entry where platform.os == "linux" and platform.architecture matches the current JVM's architecture (mapped from Java's os.arch to Docker's naming convention: x86_64amd64, aarch64arm64). If no match is found, it falls back to the first entry with a warning.

A single-platform manifest looks like this (abbreviated):

{
  "schemaVersion": 2,
  "mediaType": "application/vnd.docker.distribution.manifest.v2+json",
  "layers": [
    {
      "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip",
      "size": 3408729,
      "digest": "sha256:f18232174bc91741fdf3da96d85011ef..."
    }
  ]
}

The layers array is what we need. Each entry's digest is a sha256: prefixed content address that identifies the layer blob on the registry.

Step 3: Extract layer digests

public static List<String> extractLayerDigests(JsonObject manifest) {
    List<String> digests = new ArrayList<>();
    JsonArray layers = manifest.getAsJsonArray("layers");
    if (layers == null) return digests;
    for (JsonElement layer : layers) {
        digests.add(layer.getAsJsonObject().get("digest").getAsString());
    }
    return digests;
}

Layers are ordered: they must be applied bottom-to-top. The first layer is the base filesystem; subsequent layers apply diffs on top. ImageManager processes them in order, which is why the OCI layer format includes whiteout files.

Step 4: Download each layer blob

public void downloadBlob(ImageRef ref, String digest, String token, Path dest)
        throws IOException, InterruptedException {

    String url = ref.registryUrl() + "/v2/" + ref.repository()
               + "/blobs/" + digest;

    HttpRequest request = HttpRequest.newBuilder()
            .uri(URI.create(url))
            .header("Authorization", "Bearer " + token)
            .GET()
            .build();

    HttpResponse<InputStream> response = httpClient.send(request,
            HttpResponse.BodyHandlers.ofInputStream());

    try (InputStream is = response.body()) {
        Files.copy(is, dest, StandardCopyOption.REPLACE_EXISTING);
    }
}

BodyHandlers.ofInputStream() streams the response body directly to disk without buffering it in memory - important for large layer files. The URL pattern is registry/v2/<repository>/blobs/<digest>. The digest is both the identifier and an implicit integrity check: the content of the blob is the data whose SHA-256 hash equals the digest. RegistryClient doesn't verify this explicitly, but production clients (including containerd) do.

Docker Hub often responds with a redirect to a CDN URL for the actual blob download. The HttpClient is configured with followRedirects(HttpClient.Redirect.NORMAL), which handles this transparently.

Step 5: Extract layers with whiteout support

LayerExtractor.extractLayer() decompresses and unpacks each .tar.gz layer into the rootfs directory using Apache Commons Compress. The straightforward part is the unpacking: directories, regular files, symlinks, and hard links are all handled, with POSIX permissions preserved from the tar entry's mode field.

The less obvious part is whiteout files - the OCI format's mechanism for representing deletions. In a layered image, if layer 3 deletes a file that was created in layer 1, it can't simply omit the file from the layer tarball (layer 1 has already been extracted). Instead, it includes a special whiteout file:

private static final String WHITEOUT_PREFIX = ".wh.";
private static final String OPAQUE_WHITEOUT  = ".wh..wh..opq";

// In the extraction loop:
if (OPAQUE_WHITEOUT.equals(fileName)) {
    // Opaque whiteout: delete all existing contents of this directory
    clearDirectory(target.getParent());
    continue;
}

if (fileName.startsWith(WHITEOUT_PREFIX)) {
    // Regular whiteout: delete the named file
    String deleteName = fileName.substring(WHITEOUT_PREFIX.length());
    deleteRecursive(target.getParent().resolve(deleteName));
    continue;
}

A regular whiteout .wh.foo means "delete the file foo in the same directory." An opaque whiteout .wh..wh..opq means "this directory was recreated from scratch in this layer; delete everything that was here before." Production container runtimes handle whiteouts at the overlay filesystem level, where layers are stacked without extraction. Since JContainer extracts layers sequentially into a flat rootfs, it has to process whiteouts manually during extraction - which is what LayerExtractor does.

The ImageManager cache round-trip is the last piece of the protocol:

~/.jcontainer/cache/
└── library/
    └── alpine/
        └── latest/
            ├── rootfs/       <- extracted filesystem
            └── .complete     <- marker: this pull finished cleanly

The .complete marker is only written after all layers have been extracted successfully. On the next run --image alpine:latest, ImageManager.pull() sees both the marker and the rootfs/ directory and returns immediately without hitting the network. If the JVM was killed mid-pull, the marker is absent and the next invocation starts fresh after deleting the partial directory.

One detail worth knowing for the future: this cache design doesn't share layer data between images. If you pull alpine:latest and alpine:3.19, and they share a base layer, the layer gets extracted twice. Production registries address this with content-addressable storage keyed by layer digest. That's a natural next step if you wanted to extend this further.

Closing thoughts on Java as a systems language

The three topics in this post share a common thread: they surface the mechanics that higher-level tools - Docker, runc, containerd - make invisible. The FFM API does the same thing at the language level: it removes the abstraction that used to hide native calls behind JNI, and makes the calling convention explicit.

Across the three posts, the standard library carries most of the weight. ProcessHandle for process management. HttpClient for registry communication. java.nio.file for cgroup control files and container state. The FFM API for syscalls. Nothing here required a non-JDK dependency except Gson for JSON parsing and Commons Compress for tar extraction - both narrow, well-understood choices.

The result is a codebase where the Linux kernel behaviour is legible in Java source. That's the practical outcome of FFM: not just that you can call native functions from Java, but that doing so no longer requires context-switching between languages to understand what the code is doing.