pegainfer: A Native Rust Inference Engine from Scratch

Feb 17, 2026

Why Write an Inference Engine from Scratch in Rust?

Nowadays, Python-based inference frameworks like vLLM and sglang have captured the majority of the inference serving market, and have gradually become stable and mature. So why not vLLM or sglang — why write an inference engine from scratch in Rust?

The reasons fall into two parts.

Over the past year or so, I have had the opportunity to contribute code to both of the aforementioned frameworks and to participate in the design of some of their modules, including gaining some operational experience with both. Through that process, I found that Python severely limited my ability to pursue high-performance, more complex logic. Phrases like “we need to control CPU usage,” “Python can’t handle this much,” and concerns about long-tail latency caused by the GIL kept coming up again and again. I couldn’t help but think: everyone says GPUs are a hundred-billion-dollar business — is this really what a hundred-billion-dollar infrastructure looks like? The other part comes from the language itself. There is probably no language more permissive than Python: unhandled errors, states not fully matched, leaks (logic bugs). My precious time should not be wasted on problems that a compiler could catch by default.

The other part: since I primarily work on KV cache and routing, and have spent the past year focusing more on the KV cache-related modules of inference engines, I sometimes couldn’t follow what colleagues working on other parts of the framework or on kernels were talking about. This created some friction for the evolution of my KV cache system. Understanding the optimizations and behavior of inference frameworks is crucial, and future work will require co-design rather than an isolated single-node inference engine. How do you understand something? I think of Feynman’s words: “What I cannot create, I do not understand.”

Based on these two reasons, I had actually tried writing one in C++ (stdexec) back when I was still in Beijing last year. For some reason it always felt awkward. At that time, LLM capabilities — and my own ability to work with LLMs — were also weaker, and the project faded away. Later, after vibing out my first reasonably mature open-source Rust LLM KV cache project, I realized that Rust is my native programming language, just like writing this blog post in Chinese right now — I flow much more naturally when writing Rust. After deeply reflecting on vibe coding and what it means for me, I suddenly found the right way to vibe (I’ll write a separate post later on how I collaborate and learn with LLMs). I felt it was time to pick this back up. The time had come.

And so pegainfer was born. As before, 100% of the code was generated by opus 4.6 high — I did not write any of it by hand. Except for the kernels, which I haven’t gone through completely, I have reviewed all of the framework-level code.

The current state of pegainfer: the framework is written in Rust, the kernels in CUDA. It only supports processing one request at a time. There is no prefix caching, no sampler, no scheduler, no CUDA graph support, and the kernels have not been carefully optimized. On my 5070ti, running Qwen3 4B achieves a stable throughput of around 70 tokens/s, with accuracy aligned to HuggingFace (the accuracy alignment was done by opus itself, and the process was quite interesting).

So at this stage, pegainfer is — diplomatically — “full of potential,” or less charitably — “missing basically everything.” This blankness is intentional. Originally I had planned to implement all of the “missing” things above before writing a blog post introducing it, to make things look more official. But I realized that the process of learning, exploring, and adding new features together with LLMs is actually more valuable — and more interesting. Writing it all up only after it’s done would just produce yet another inference engine post covering the same old optimization techniques everyone already knows.

So what is it like to write an inference engine in Rust? Much better than expected. I’m not sure if it’s due to the evolution of opus 4.6, but the experience of writing code, testing, and benchmarking has been excellent. cargo run -r to compile and start, cargo test -r to run tests quickly, including some end-to-end tests. The Rust ecosystem experience is also very good. Here are some of the key libraries pegainfer depends on:

cudarc — Rust CUDA bindings, providing safe wrappers around the CUDA runtime and cuBLAS; the foundation of all GPU computation
axum + tokio — Async HTTP framework, providing an OpenAI-compatible API service; a high-performance server in just a few lines of code
safetensors — HuggingFace’s model weight format; zero-copy loading, much safer than pickle
tokenizers — HuggingFace’s tokenizer, natively implemented in Rust, extremely fast
fastrace — Distributed tracing library, outputs Chrome Trace JSON format, lets you view per-kernel latency directly in the Perfetto UI (this one is particularly interesting — a pegainfer exclusive, more on it later)
logforth — Structured logging library with filtering, colorization, and multiple output targets
half — BF16/FP16 type support, used alongside cudarc for half-precision storage
anyhow — Error handling, takes the pain away

Most importantly, I no longer have to worry about the GIL, unhandled errors, or object lifetimes. Logging, metrics, traces — all out of the box.

What’s Next

In the near term I’ll be gradually filling in the core modules of the inference engine: Sampler (temperature, top-p/top-k), prefix caching (Radix Tree), continuous batching scheduler, CUDA Graph, and kernel optimizations. As each module is completed, I’ll write a corresponding blog post documenting the process from design to implementation — not just the final result.

The code is here: github.com/xiaguan/pegainfer