Member of Technical Staff, Inference Engine
You will build the core of our inference engine: the runtime that takes a set of weights and serves them as a low-latency, high-throughput endpoint. This is the layer where scheduling, batching, memory, and the model meet. Your work decides the latency every user feels and the cost of every token we serve.
ATBF Labs builds the inference engine for production AI. Every token a model serves in production runs through an inference stack, and that stack decides the latency, the cost, and the reliability of the product sitting on top of it. We build ours from first principles: custom GPU kernels, a purpose-built runtime, and a distributed serving layer that holds its tail latency under real load. We are a small team with a high bar, shipping to production from day one.
Key responsibilities
- Design and build the inference runtime: continuous batching, paged attention/KV-cache management, and request scheduling.
- Drive down tail latency (p99) and push up tokens-per-second across LLM, VLM, and multimodal workloads.
- Implement and tune speculative decoding, prefix caching, and structured-output decoding paths.
- Build the serving layer: load balancing, autoscaling, and graceful degradation under real production traffic.
- Profile end to end, find the bottleneck, and fix it, whether it lives in a kernel, the scheduler, or the network.
- Write the benchmarking and regression harness that keeps every release honest.
- 5+ years building performance-critical systems in C++, Rust, CUDA, or similar.
- Strong systems fundamentals: memory, concurrency, scheduling, and the cost of a cache miss.
- Hands-on experience serving deep-learning models in production, or building the systems that do.
- Comfort reading a flamegraph and a kernel trace, and turning both into a measurable win.
- Experience with an inference framework (vLLM, TensorRT-LLM, SGLang, TGI) or building one.
- Familiarity with the transformer serving path: attention, KV-cache, batching, quantization.
- Experience with multi-GPU / multi-node serving and the collectives that make it work.
- Open-source contributions to ML systems or runtimes.
- Cut p99 latency for a 70B model by reworking the batching scheduler under bursty load.
- Add a speculative-decoding path and measure the speed/quality tradeoff across draft models.
- Build a KV-cache eviction policy that holds throughput as context lengths grow.
Base salary plus meaningful equity. The range is a guideline; final numbers reflect experience, skills, and location. Full health, dental, and vision coverage included.
Solve hard problems
Inference is a systems problem from the kernel up. You will work on the parts that decide whether a model is usable in production: latency, throughput, and cost.
Own the whole stack
Small team, large surface area. You will have real ownership across kernels, runtime, and serving, and your work ships to customers, not a backlog.
Measure everything
We make decisions on numbers, not vibes. Every change is benchmarked, every regression is caught, and the survey point marks exactly where we are.
Learn from the best
Work alongside people who have built and operated inference at scale, and who care more about a clean result than a clever one.
ATBF Labs is an equal-opportunity employer. We celebrate diversity and are committed to an inclusive environment for everyone who builds with us.