अभिव्यक्ति

Building a container from scratch in Java: Part 2

Cgroups, veth Pairs, OCI Images, and Lifecycle

This series

Code for this series

Part 1 built the isolation primitive: a process running in new Linux namespaces with a swapped-out root filesystem. That's the hard kernel part. What it left behind is everything you'd actually need to make a container runtime usable - resource limits, network connectivity, a way to get container images without manually preparing a rootfs, and any notion of a container's lifecycle beyond "it ran and exited."

Part 2 adds four independent features. They're designed to be independently implementable and independently readable, which we'll honour in this post too: each section stands on its own, so jump around if one topic interests you more than the others.

The first thing worth looking at, though, is ContainerParent.run() - it's the orchestration hub that wires all four features together, and reading it top-to-bottom gives you the execution order before we go feature by feature.

Outline

The new orchestration hub: ContainerParent.run()

In Part 1, ContainerParent had one job: set up namespaces and spawn the child. In Part 2 it coordinates six concerns in sequence:

public static void run(ContainerRuntime runtime, String[] args) {
    ContainerConfig config = ContainerConfig.parse(args);

    // 1. Pull image if --image was specified
    String rootfs = config.rootfs();
    if (config.hasImage()) {
        ImageRef ref = ImageRef.parse(config.image());
        Path rootfsPath = new ImageManager().pull(ref);
        rootfs = rootfsPath.toString();
    }

    // 2. Set up parent-side isolation (Linux: unshare UTS+MNT; macOS: no-op)
    runtime.setupParent();

    // 3. Build and spawn the child process
    String javaPath = resolveJavaPath();
    String classpath = resolveClasspath();
    List<String> childCmd = runtime.buildChildCommand(
            javaPath, classpath, rootfs, config.command(), config.networkEnabled());

    // 4. Create cgroup and apply resource limits before the child gets going
    CgroupManager cgroup = null;
    if (config.hasResourceLimits() && JContainer.isLinux()) {
        cgroup = new CgroupManager(CGROUP_ROOT);
        cgroup.create();
        if (config.memoryBytes() != null)  cgroup.setMemoryLimit(config.memoryBytes());
        if (config.cpuPercent() != null)   cgroup.setCpuLimit(config.cpuPercent());
    }

    // 5. Start the child, register it, tee its output to log files
    ProcessBuilder pb = new ProcessBuilder(childCmd);
    pb.redirectInput(ProcessBuilder.Redirect.INHERIT);
    Process process = pb.start();

    ContainerState state = ContainerState.create(
            rootfs, config.image(), config.command(), process.pid());
    ContainerRegistry registry = new ContainerRegistry();
    registry.register(state);

    Path containerDir = registry.getContainerDir(state.id());
    Thread stdoutThread = teeStream(process.getInputStream(), System.out,
            containerDir.resolve("stdout.log"));
    Thread stderrThread = teeStream(process.getErrorStream(), System.err,
            containerDir.resolve("stderr.log"));

    // 6. Add child to cgroup, set up networking (needs child PID for net namespace)
    if (cgroup != null) cgroup.addProcess(process.pid());

    NetworkManager network = null;
    if (config.networkEnabled() && JContainer.isLinux()) {
        network = new NetworkManager();
        network.setup(process.pid());
    }

    int exitCode = process.waitFor();
    stdoutThread.join(5000);
    stderrThread.join(5000);
    registry.updateStatus(state.id(), ContainerState.STATUS_EXITED, exitCode);
    System.exit(exitCode);
    // finally block cleans up network and cgroup
}

A few sequencing decisions here aren't obvious. The cgroup is created before the child process starts, but the child PID is added to it after process.start() - you need the PID to add a process, and the cgroup needs to exist before the child can accumulate uncapped resource usage. Networking setup also happens after process.start() for the same reason: NetworkManager.setup() takes the child's PID to find its network namespace via /proc/<pid>/ns/net. The tee threads are started immediately after the process, before cgroup and network setup, so no output is lost during those operations.

ContainerConfig.parse() handles the new flags with a simple left-to-right scan: flags beginning with -- are consumed as key-value pairs, and the first non-flag token starts the positional arguments. With --image, all positional arguments are the command (the rootfs comes from the image cache). Without it, the first positional argument is the rootfs path.

Cgroups v2: resource limits as filesystem I/O

Cgroups v2 is a unified kernel resource-accounting hierarchy mounted at /sys/fs/cgroup/. Every entry in that hierarchy is a directory. Setting a limit means writing a number to a file. Adding a process to a cgroup means writing its PID to another file. There are no syscalls, no FFM bindings, no native library. It's plain filesystem I/O via java.nio.file.

The relevant part of the cgroup tree for JContainer looks like this:

/sys/fs/cgroup/
└── jcontainer/                    ← parent group, shared across all containers
    └── <container-id>/            ← per-container group
        ├── memory.max             ← memory limit in bytes
        ├── cpu.max                ← cpu quota and period in microseconds
        └── cgroup.procs           ← write a PID here to add a process

CgroupManager owns the full lifecycle: create, configure, add process, and clean up on close.

public void create() throws IOException {
    Files.createDirectories(cgroupPath);
    enableControllers();
}

private void enableControllers() throws IOException {
    // Enable cpu and memory controllers on the parent group.
    // Without this, writing to memory.max or cpu.max in the child group fails.
    Path subtreeControl = parentPath.resolve("cgroup.subtree_control");
    Files.writeString(subtreeControl, "+cpu +memory\n");
}

The enableControllers() call is easy to miss and frequently the source of "permission denied" errors in cgroup setups. Before a child cgroup can use a controller, that controller must be listed in the parent group's cgroup.subtree_control file. Writing +cpu +memory there activates both. Only then will memory.max and cpu.max exist in the child group.

Setting limits follows directly:

public void setMemoryLimit(long bytes) throws IOException {
    Files.writeString(cgroupPath.resolve("memory.max"), bytes + "\n");
}

public void setCpuLimit(int percent) throws IOException {
    // cpu.max format: "$QUOTA $PERIOD" in microseconds.
    // A 100ms period with a quota of N ms gives N% of one core.
    long period = 100_000;  // 100ms
    long quota  = (long) percent * 1_000;  // percent ms out of every 100ms
    Files.writeString(cgroupPath.resolve("cpu.max"), quota + " " + period + "\n");
}

public void addProcess(long pid) throws IOException {
    Files.writeString(cgroupPath.resolve("cgroup.procs"), pid + "\n");
}

The cpu.max format deserves a note. It's a quota/period pair in microseconds. A period of 100,000 µs (100ms) and a quota of 50,000 µs means the cgroup can use 50ms of CPU every 100ms - i.e., 50% of one core. If you set --cpu 200, the quota becomes 200,000 µs out of a 100ms period, which the scheduler interprets as two full cores. The arithmetic in setCpuLimit() - percent * 1_000 - converts from "percent of a core" to "microseconds per 100ms period."

CgroupManager implements AutoCloseable, so cleanup is guaranteed even if the container exits abnormally:

@Override
public void close() {
    Files.deleteIfExists(cgroupPath);
    // Remove parent group only if no other containers are using it
    if (Files.isDirectory(parentPath) && isDirectoryEmpty(parentPath)) {
        Files.deleteIfExists(parentPath);
    }
}

The parent group cleanup is conditional - if two containers are running simultaneously, both will have child groups under jcontainer/, and the parent should only be removed when the last one exits. Resource limits are Linux-only; on macOS, ContainerParent logs a warning and skips cgroup setup entirely.

Usage at the CLI:

sudo java ... JContainer run --memory 256m --cpu 50 rootfs /bin/sh

Networking: veth pairs and network namespaces

A veth pair is a virtual ethernet cable with two ends: packets sent into one end emerge from the other, regardless of which network namespace each end lives in. The pattern for container networking is: create a pair on the host, move one end into the container's network namespace, configure IP addresses on both, bring the interfaces up.

All of this is orchestrated by NetworkManager using the ip command - no FFM needed. buildSetupCommands() returns the full sequence:

List<String[]> buildSetupCommands(long childPid) {
    String pid    = Long.toString(childPid);
    String nsNet  = "/proc/" + pid + "/ns/net";

    return List.of(
        // Create the veth pair: hostDev <--> eth0
        new String[]{"ip", "link", "add", hostDev, "type", "veth", "peer", "name", CONTAINER_DEV},

        // Move the container end into the child's network namespace
        new String[]{"ip", "link", "set", CONTAINER_DEV, "netns", pid},

        // Configure the host end
        new String[]{"ip", "addr", "add", HOST_IP + SUBNET, "dev", hostDev},
        new String[]{"ip", "link", "set", hostDev, "up"},

        // Configure the container end via nsenter (we're on the host, not inside the container)
        new String[]{"nsenter", "--net=" + nsNet, "ip", "addr", "add", CONTAINER_IP + SUBNET, "dev", CONTAINER_DEV},
        new String[]{"nsenter", "--net=" + nsNet, "ip", "link", "set", CONTAINER_DEV, "up"},
        new String[]{"nsenter", "--net=" + nsNet, "ip", "link", "set", "lo", "up"},
        new String[]{"nsenter", "--net=" + nsNet, "ip", "route", "add", "default", "via", HOST_IP}
    );
}

The IP scheme is fixed: the host side gets 10.0.0.1/24, the container gets 10.0.0.2/24. nsenter --net=/proc/<pid>/ns/net is the key tool - it executes a command inside an existing network namespace by entering it via the /proc file descriptor, without the process being a child of the container. This is how the host configures the container's end of the veth pair without needing to be inside the container process tree.

setup() runs each command in sequence. If any command fails, an IOException is thrown, which ContainerParent catches and demotes to a warning - the container continues without networking rather than failing to start. Teardown on close() is a single command: deleting the host-side interface automatically removes the peer.

String[] buildCleanupCommand() {
    // Deleting the host end removes the container end automatically
    return new String[]{"ip", "link", "delete", hostDev};
}

The network namespace itself is created by unshare --net in the child command, which LinuxRuntime.buildChildCommand() adds when networkEnabled is true. The NetworkManager is then responsible for populating that empty namespace with a working interface.

Usage:

sudo java ... JContainer run --net rootfs /bin/sh
# Inside the container:
/ # ip addr show eth0
/ # ping 10.0.0.1

OCI image support: pulling from the Docker registry

Up to this point, running a container required manually preparing a rootfs directory. Part 2 replaces that with --image:

sudo java ... JContainer run --image alpine:latest /bin/sh

Three classes collaborate to make this work. ImageRef parses the image reference string. RegistryClient speaks the Docker Registry v2 API. LayerExtractor unpacks each layer into a rootfs directory. ImageManager orchestrates all three and manages a local cache so images are only downloaded once.

ImageRef: parsing image references

Image references look simple but hide a surprising amount of ambiguity. alpine means registry-1.docker.io/library/alpine:latest. myorg/myapp:v2 means the myorg namespace on Docker Hub. ghcr.io/myorg/myapp:v2 has an explicit registry host. The parsing logic handles all of these:

public static ImageRef parse(String ref) {
    String registry  = DEFAULT_REGISTRY;   // registry-1.docker.io
    String namespace = DEFAULT_NAMESPACE;  // library
    String tag       = DEFAULT_TAG;        // latest
    String image;

    // Split off the tag (colon that has no slash after it)
    String namePart = ref;
    int colonIdx = ref.lastIndexOf(':');
    if (colonIdx > 0 && !ref.substring(colonIdx).contains("/")) {
        tag      = ref.substring(colonIdx + 1);
        namePart = ref.substring(0, colonIdx);
    }

    String[] parts = namePart.split("/");
    if (parts.length == 1) {
        image = parts[0];                         // "alpine"
    } else if (parts.length == 2) {
        if (parts[0].contains(".") || parts[0].contains(":")) {
            registry = parts[0]; image = parts[1]; // "ghcr.io/myapp"
        } else {
            namespace = parts[0]; image = parts[1]; // "myorg/myapp"
        }
    } else {
        registry  = parts[0];                      // "ghcr.io/myorg/myapp"
        image     = parts[parts.length - 1];
        namespace = String.join("/",
                Arrays.copyOfRange(parts, 1, parts.length - 1));
    }

    return new ImageRef(registry, namespace, image, tag);
}

ImageRef is a Java record, so the parsed components are immutable and the repository() and fullName() helper methods compose from them.

ImageManager: pull with caching

public Path pull(ImageRef ref) throws IOException, InterruptedException {
    Path imageDir = cacheDir
            .resolve(ref.namespace()).resolve(ref.image()).resolve(ref.tag());
    Path rootfs = imageDir.resolve("rootfs");
    Path marker = imageDir.resolve(".complete");

    // Fast path: use the cache if a previous pull completed cleanly
    if (Files.exists(marker) && Files.isDirectory(rootfs)) {
        System.err.println("Using cached image: " + ref.fullName());
        return rootfs;
    }

    // Clean up any partial download before starting fresh
    if (Files.exists(imageDir)) deleteRecursive(imageDir);
    Files.createDirectories(rootfs);

    String token           = registryClient.getToken(ref);
    JsonObject manifest    = registryClient.getManifest(ref, token);
    List<String> digests   = RegistryClient.extractLayerDigests(manifest);

    for (int i = 0; i < digests.size(); i++) {
        String digest    = digests.get(i);
        Path   layerFile = imageDir.resolve("layers")
                                   .resolve(digest.replace(":", "_") + ".tar.gz");

        System.err.printf("  Layer %d/%d: %s%n",
                i + 1, digests.size(), digest.substring(0, 19));

        registryClient.downloadBlob(ref, digest, token, layerFile);
        layerExtractor.extractLayer(layerFile, rootfs);
        Files.deleteIfExists(layerFile);  // save disk space as we go
    }

    Files.createFile(marker);  // mark the pull as complete
    return rootfs;
}

The .complete marker file is a deliberate guard: if the JVM is killed mid-pull, the imageDir will exist but the marker won't, so the next invocation starts fresh rather than using a partial rootfs. Layers are extracted and deleted immediately after extraction to avoid holding all layer tarballs on disk simultaneously.

The registry protocol itself - auth tokens, fat manifests, blob downloads - is the interesting part behind RegistryClient. We cover it fully in Part 3 along with the layer extraction details.

Images are cached under ~/.jcontainer/cache/<namespace>/<image>/<tag>/rootfs. On Linux the rootfs is then used as the input to the same namespace and pivot_root machinery from Part 1. On macOS it's used with chroot. Image support works on both platforms.

Container lifecycle: IDs, state, and logs

Part 1 containers were ephemeral: start, run, exit, forget. Part 2 assigns each container an 8-character hex ID, persists its metadata to disk, captures its output, and adds list, stop, logs, and rm commands.

State: a record backed by JSON

ContainerState is a Java record that serializes to JSON:

public record ContainerState(
        String  id,
        long    pid,
        String  startTime,
        String  rootfs,
        String  image,
        String[] command,
        String  status,
        Integer exitCode
) {
    public static final String STATUS_RUNNING = "running";
    public static final String STATUS_EXITED  = "exited";
    public static final String STATUS_STOPPED = "stopped";

    public ContainerState withStatus(String newStatus, Integer newExitCode) {
        return new ContainerState(
                id, pid, startTime, rootfs, image, command, newStatus, newExitCode);
    }

    public void save(Path containerDir) throws IOException {
        Files.createDirectories(containerDir);
        Files.writeString(containerDir.resolve("metadata.json"), GSON.toJson(this));
    }
}

Records are a natural fit here: the state is immutable, transitions produce new instances (withStatus()), and the save() / load() methods make persistence a one-liner. Each container gets a directory at ~/.jcontainer/containers/<id>/ containing metadata.json, stdout.log, and stderr.log.

ContainerRegistry manages the state directory. Its listAll() method includes a liveness check: for any container marked running, it verifies the PID is still alive via ProcessHandle.of(pid).map(ProcessHandle::isAlive). If the process has died without updating its status (e.g. the parent JVM was killed), the registry corrects the status to exited before returning it.

Capturing output: the tee stream

For logs to work, the container's stdout and stderr need to be captured to files while simultaneously appearing on the terminal. ContainerParent.teeStream() does this with a simple byte pump on a daemon thread:

static Thread teeStream(InputStream input, OutputStream terminal, Path logFile) {
    Thread thread = new Thread(() -> {
        try (OutputStream log = Files.newOutputStream(logFile)) {
            byte[] buffer = new byte[4096];
            int n;
            while ((n = input.read(buffer)) != -1) {
                terminal.write(buffer, 0, n);
                terminal.flush();
                log.write(buffer, 0, n);
                log.flush();
            }
        } catch (IOException ignored) {
            // Stream closed - expected when the process exits
        }
    }, "tee-" + logFile.getFileName());
    thread.setDaemon(true);
    thread.start();
    return thread;
}

One tee thread per stream. ProcessBuilder with redirectInput(INHERIT) lets stdin flow directly to the container (needed for interactive shells), while stdout and stderr go through separate teeStream threads. After the process exits, ContainerParent joins both threads with a 5-second timeout to flush any buffered output before exiting.

Stopping a container

ContainerLifecycle.stop() follows the standard SIGTERM → wait → SIGKILL pattern:

public void stop(String id) throws IOException {
    ContainerState state = registry.get(id);

    ProcessHandle ph = ProcessHandle.of(state.pid())
            .filter(ProcessHandle::isAlive)
            .orElseThrow(() -> new IOException("Container " + id + " is not running"));

    ph.destroy();  // SIGTERM

    long deadline = System.currentTimeMillis() + STOP_TIMEOUT_MS;  // 10 seconds
    while (ph.isAlive() && System.currentTimeMillis() < deadline) {
        Thread.sleep(100);
    }

    if (ph.isAlive()) {
        System.err.println("Container did not stop gracefully, forcing...");
        ph.destroyForcibly();  // SIGKILL
    }

    registry.updateStatus(id, ContainerState.STATUS_STOPPED, null);
}

ProcessHandle.destroy() sends SIGTERM, giving the container process a chance to clean up. If it hasn't exited after 10 seconds, destroyForcibly() sends SIGKILL. ProcessHandle is purely Java - no FFM, no native code. The JDK has had cross-platform process management since Java 9, and it does the job cleanly here.

The full lifecycle in action:

$ sudo java ... JContainer run --image alpine:latest /bin/sh &
Container a3f2b1c0 started (PID 84291)

$ sudo java ... JContainer list
ID         PID      IMAGE                STATUS     STARTED
a3f2b1c0   84291    library/alpine:...   running    2026-02-22T10:14:33Z

$ sudo java ... JContainer logs a3f2b1c0
[stdout from the container session]

$ sudo java ... JContainer stop a3f2b1c0
Stopping container a3f2b1c0 (PID 84291)...
Container a3f2b1c0 stopped.

$ sudo java ... JContainer rm a3f2b1c0
Removed container a3f2b1c0

Testing strategy

The test suite is designed so that unit tests run anywhere without root or a Linux kernel. The pattern that makes this possible is consistent constructor injection: CgroupManager, ContainerRegistry, NetworkManager, and ImageManager all accept their dependencies (a base path, an HttpClient, etc.) as constructor arguments, with the production defaults wired up in the no-arg constructor.

CgroupManagerTest passes a Files.createTempDirectory() as the cgroup root - no /sys/fs/cgroup/ access needed. NetworkManagerTest verifies the ip command sequences produced by buildSetupCommands() as plain string arrays, without executing them. ContainerRegistryTest writes state to a temp directory and reads it back. RegistryClientTest uses a mock HttpClient to verify request URLs and headers.

Integration tests are in a separate source set, annotated with @Tag("integration"), and require a prepared rootfs and root privileges. They're enabled via a Maven profile (mvn verify -Pintegration) and use JUnit's @EnabledOnOs to gate Linux-specific tests:

@Test
@EnabledOnOs(OS.LINUX)
void testContainerHostname() throws Exception {
    // Runs the container and asserts hostname == "container"
}

This split keeps the default mvn test cycle fast and platform-independent, while still having end-to-end coverage available when you need it.

What's next

Part 2 brings JContainer to a point where it's recognisable as a container runtime: pull an image, apply resource limits, give it a network interface, track it with an ID, inspect its logs. What separates it from production runtimes like runc is a long list of things we've deliberately left aside: overlay filesystems (we unpack layers flat, losing the ability to share base layers across containers), Linux capabilities and seccomp filtering (our container inherits the parent's capability set), a daemon process, a client/server model, and image layer content-addressable storage.

Part 3 goes back inside the code for the three topics that warranted more depth than the survey posts could give: the FFM API's mechanics end to end, the precise sequence of kernel operations that make pivot_root work and why chroot isn't equivalent, and the Docker Registry v2 protocol that RegistryClient implements - auth tokens, fat manifests, and layer blobs - one step at a time.