Virtualizing Syscalls with Dynasaur: Extending Linux into the Future

August 20, 2024

Our article Is POSIX Outdated in the Cloud Era? spurred a bit of discussion in the developer community (especially in our native field of high-performance cloud computing), and these discussions have solidified our position: while POSIX within Linux is here to stay, we’re not the only ones who want to innovate beyond what the Linux kernel is currently capable of.

So what can be done? We think that virtualizing syscalls at the userspace level is the most sensible path to making Linux and POSIX extensible without having to interfere with the standard.

What does “virtualizing syscalls” mean?

Applications interact with the host operating system by using syscalls. Everything a program sees about its parent environment is provided through syscalls: how much memory it’s using, what networking interfaces are present, IP addresses, disks, files… all interactions between a piece of software and the host are done via syscalls.

Virtualizing syscalls means adding a layer of abstraction between the application and the operating system, intercepting each interaction so that the underlying behavior can be altered, without the application knowing.

Wine is a well-known implementation of this: When a Windows application is run on Linux using Wine, its syscalls are intercepted and all of the interactions it is having with “Windows” are ersatz. Windows isn’t there, and the application is none the wiser that it is running in a totally different environment than it was written for.

Extending this concept to the whole POSIX stack and implementing fully virtualized syscalls across the Linux kernel would be a powerful tool for developers, as we’ve proven with our own implementation for cunoFS.

Why not just implement virtualized syscalls on a per-application basis?

There are a few familiar methods that provide similar outcomes to what we are proposing, but none of them offer the full abstraction that is required to grant full flexibility in what can be done behind the scenes without actually having to change how POSIX applications behave.

LD_PRELOAD doesn’t work in many cases, including semi-static and static binaries (static binaries have no need to resolve symbols dynamically, so the dynamic loader is not invoked).
PTrace is slow, and requires special privileges to work in a container.
GVisor uses KVM machine virtualisation and is much faster than PTrace, but still needs special privileges, and runs a guest Linux VM to do it.
Syscall User Dispatch is available on newer Linux kernels but is limited to x86_64 processors and requires switching to and from the kernel at the cost of performance.

The goal of broadly adopting syscall virtualization in across the entire Linux kernel should be to provide a single “liquid operating system” distributed across multiple nodes that presents a consistent environment floating on top of any kind of hardware (running anywhere) to all applications, not just those that have been specifically written to implement one of the above solutions.

How cunoFS virtualizes syscalls 🦖

cunoFS has created a lightweight way of intercepting and virtualizing Linux syscalls to implement our filesystem compatibility layer. Using this, we can make object storage (which uses a HTTP API) appear as a local filesystem to any POSIX application, without actually mounting a filesystem on the machine.

We’ve called our syscall virtualization technology Dynasaur. It’s incredibly fast, utilizing a lightweight dynamic binary instrumentation that uses our own ELF loader. Dynasaur syscall interception works with any POSIX application, does not require any special/elevated privileges, and operates entirely within user mode, without switching back and forth with the kernel.

While products such as DynamoRIO, Dyninst, and Pin also use dynamic binary instrumentation, our testing has shown them to be slow and prone to crashing. Dynasaur can intercept 100% of syscalls , and could potentially be a foundation as the general API for syscall virtualization (what we should be doing with our implementation moving forward is something we’ll discuss later in this article).

Currently Dynasaur works with only x86_64 but we are working on an ARM64 (AArch64) version.

What would virtualizing syscalls let developers do?

Beyond expanding the flexibility of filesystems in POSIX environments, we think distributed computing and the development of high-performance workloads would be greatly benefitted by implementing virtualized syscalls across the POSIX toolchain:

Liquid OS: Imagine if your processes had the same freedoms as Virtual Machines – and were not limited to a single machine, but that they could be launched or migrated seamlessly between many machines, whether locally or remotely, but all appearing in the same OS namespace. Memory intensive jobs could move to more memory, CPU intensive jobs to those with more free cores or to spot instances.
Optimize high-performance computing workloads: The system could detect when a program is trying to open a TCP/IP connection to a machine in the same data center, and seamlessly replace this with a much faster InfiniBand link instead, completely invisible to the program.
Create better software development and debugging tools: Creating reproducible development environments is a challenge that would be solved with syscall virtualization, letting developers exactly reproduce what happened during a program’s execution, right down to the kernel level. Buggy behavior could be replayed exactly as it occurred, making debugging environments deterministic.
Scale and move workloads from machine to machine without interruption: Running workloads could be quickly migrated from one machine to another (or saved and resumed without interruption, even if the application is not designed to do so). For example, running programs (or entire interdependent stacks) could be moved to affordable cloud spot instances when they are available, and safely moved back if they are about to be deprovisioned.
Optimize applications for distributed environments: You could allocate memory to a program from a different machine (again, without the application being any the wiser), run processes on different machines as if they were on the same machine, and let any application run on distributed resources, even if they weren’t designed for it.

Traditional virtual machines do not provide the benefits described above: they have fixed memory, network interfaces, and other resources. And while you can pause, resume, and move virtual machines around, you can’t move the processes within them.

While Docker (and Kubernetes) lets you distribute containerized workloads over different machines, it still limits what you can do and adds overheads with a distinct userspace for each node in a pod or swarm.

We envision that running programs in a Linux environment with fully virtualized syscalls would let you move and scale individual processes, giving them direct access to resources from any number of physical or virtualized machines for more control and far greater performance.

Evolving Linux and POSIX functionality (without having to change the kernel or standard)

Let’s be clear — POSIX is a standard, and encouraging interfering with a rigid and reliable standard is a fast way to compromise it.

What we’re suggesting is making the tools that implement it (primarily, Linux/BSD components) more extensible, not interfering with the standard itself. In fact, implementing virtualized syscalls is an ideal way to greatly expand the scope of functionality and flexibility available to applications without having to touch the kernel or POSIX standard itself.

We believe that our efficient method of trapping syscalls used in Dynasaur is the optimal path to achieving this, especially because unlike the Linux kernel functionality that was added for Wine, we avoid expensive Kernel context switches.

What can we do to make it happen?

We hope this post is the beginning of a discussion, not the concluding remarks to our argument.

Given the power that virtualizing syscalls would give developers, and what it could do for the open-source ecosystem and cloud industry, we want ideas on how to best proceed: how can we get the concepts behind Dynasaur into the components that comprise the mainstream POSIX operating environments as quickly as possible? Do we fork, contribute, or simply license for non-commercial use? Do you see potential use-cases for syscall virtualisation that we haven’t mentioned here?

Want to get involved or share your thoughts? Join us in the HackerNews comments, where we’re discussing the future of high-performance computing. We’re keen to hear about what you wish you could (but can’t yet) do using your current POSIX tools: maybe there’s something we can use our technology to help out with.

PetaGene and Storj Join Forces: A New Era for Distributed Storage

October 8, 2024

Today we announce that PetaGene will be acquired by Storj Labs Inc., and become a wholly owned subsidiary of the US-based company. PetaGene started in 2006, quickly gathering numerous awards as a leading solution for high-performance file storage and genomic

cunoFS Launches Windows-native Client at IBC 2024 to Transform Cloud-Based Media Workflows

September 11, 2024

Cambridge, UK, September 2024 – cunoFS, the high-performance mount client designed to revolutionize cloud workflows for the media and entertainment industry, is excited to announce the launch of its Windows-native client at IBC, with a macOS-native client coming soon. This

The Game Changing Solution for Future Proof, High-Speed File Access in Cloud-Based Media Workflows

September 5, 2024

From the content you view on your personal devices to what you see on billboards, cinema screens, and the immersive experiences around you, the truth is that digital content is everywhere. The result? As the demand for content grows, the

cunoFS is a Faster and Cheaper Alternative to FSx for Lustre

June 7, 2024

FSx for Lustre is a distributed parallel filesystem that’s used as part of HPC (high-performance computing) workloads across a variety of complex fields (including life sciences, machine learning, meteorology, finance, and media). It is AWS’s managed version of the Lustre