Arthur Rasmusson
Arthur Rasmusson is a kernel programmer, virtualization, and machine learning engineer specializing in GPU drivers, and AI inference. Arthur Rasmusson’s experience spans open and close source software development where he has held the roles of Founding Engineer, Principal AI Engineer, and Machine Learning Engineer. As Founding Engineer of Arc Compute, he authored open‑source documentation, and GPU virtualization software for the Open-IOV project. At Cohere, his work on Just‑In‑Time-Inference and Just-In-Time-Training reduced auto-scaling latencies by orders of magnitude by introducing the use of GPUDirect Storage API for real-time scaling of hardware allocations in hybrid training and inference environments to meet customer demand and maximize utilization. Prior to joining Weka he introduced Paged Attention over RDMA (PAoR) to the open source AI community used distributed filesystems in GPU clusters to improve LLM performance by saving wasted computational resources spent on redundant cache data generation in open‑source inference servers delivering orders‑of‑magnitude performance gains at scale. At Weka, as Principal AI Engineer, he authored the open‑source software behind AI “token warehouses” with upstream source code contribution to the open-source TensorRT-LLM project adding the “KV Cache GPUDirect Storage” feature, and implementing Python native support for GPUDirect Storage APIs used in LMCache for the vLLM ecosystem. Arthur Rasmusson’s work bridges low-level systems concepts with high‑performance data paths for AI.
Session
This talk introduces BSD3 MAC LLM UI, a tiny, auditable LLM chat interface built for teams that value isolation, predictability, and a minimal attack surface. Written in C with suckless coding standards and released under the BSD 3‑Clause license, the project provides a no‑JavaScript HTML/CSS web UI and an optional GTK/Qt local GUI. On the backend it can route prompts to an OpenAI‑compatible API or run fully offline via TensorRT‑LLM (v0.21.0) C++ bindings—while remaining safe to deploy in MAC and security by compartmentalization environments (OpenBSD, Linux OS; OpenXT/Qubes Hypervisor).
We’ll cover the core design: a small HTTP/1.1 server, stateless form posts (no DB), strict caps/timeouts, and hardening with pledge(2) on OpenBSD and seccomp on Linux. We’ll show deployment patterns for localhost‑only, WireGuard segments, and Tor hidden services (works in Tor Browser with JS disabled). For developers, we’ll walk through the single‑binary build with Makefile knobs (HAVE_TRTLLM, WITH_GTK/Qt, TLS_BACKEND), compile‑time configuration via config.h, and how to switch between networked and no‑network modes (including future design for Qrexec/HMX to an inference VM). Attendees will leave with a practical template for building security‑first, low‑overhead LLM front ends that fit air‑gapped, offline, or highly regulated stacks—without dragging in a mountain of dependencies.