Member of Technical Staff, Performance & GPU Kernels

San Francisco, CAFull-time · On-site / HybridOpen

You will take ownership of performance at the metal: GPU kernels, memory movement, and the numerics that make low precision safe. A single well-written kernel can change the economics of an entire model. You will find those wins and ship them.

About ATBF Labs

ATBF Labs builds the inference engine for production AI. Every token a model serves in production runs through an inference stack, and that stack decides the latency, the cost, and the reliability of the product sitting on top of it. We build ours from first principles: custom GPU kernels, a purpose-built runtime, and a distributed serving layer that holds its tail latency under real load. We are a small team with a high bar, shipping to production from day one.

What you'll do

Key responsibilities

Write and optimize GPU kernels (CUDA, Triton, CUTLASS) for attention, GEMMs, and quantized paths.
Profile with Nsight / CUPTI to find kernel- and memory-level bottlenecks, then close them.
Implement mixed-precision and quantization schemes (FP8, INT4, and beyond) with measured quality bounds.
Optimize communication for multi-GPU and multi-node serving over NVLink and RDMA (InfiniBand, RoCE).
Co-design model execution with the runtime and research teams for hardware efficiency.
Debug numerical instabilities that only appear at scale, on a fraction of requests.

Minimum qualifications

Deep understanding of GPU architecture, parallel programming, and the memory hierarchy.
Proficiency in CUDA (or ROCm) and a GPU profiler (Nsight, nvprof, CUPTI).
Experience with performance-critical model execution and PyTorch internals.
A track record of measurable speedups, not just refactors.

Preferred qualifications

Experience with ML compilers (torch.compile, Triton, XLA, TVM).
Experience optimizing LLMs, VLMs, or video models for inference.
Familiarity with low-precision numerics and quantization-aware tradeoffs.
Contributions to open-source HPC or ML-systems projects.

Example projects

Implement a fused attention kernel for a novel variant and beat the reference by a measurable margin.
Ship an FP8 path and run experiments to find the optimal speed/quality tradeoff.
Optimize RDMA communication patterns to remove a stall in multi-node decode.

Compensation

$210,000 – $275,000 + equity

Base salary plus meaningful equity. The range is a guideline; final numbers reflect experience, skills, and location. Full health, dental, and vision coverage included.

Why ATBF Labs

Solve hard problems

Inference is a systems problem from the kernel up. You will work on the parts that decide whether a model is usable in production: latency, throughput, and cost.

Own the whole stack

Small team, large surface area. You will have real ownership across kernels, runtime, and serving, and your work ships to customers, not a backlog.

Measure everything

We make decisions on numbers, not vibes. Every change is benchmarked, every regression is caught, and the survey point marks exactly where we are.

Learn from the best

Work alongside people who have built and operated inference at scale, and who care more about a clean result than a clever one.

Apply for this role Or get in touch with a note and your work.

ATBF Labs is an equal-opportunity employer. We celebrate diversity and are committed to an inclusive environment for everyone who builds with us.