FasterFS
Abstract
Abstract
Aim
To design and implement a distributed storage system that (1) uses Linux kernel eBPF programs to detect frequently accessed data chunks from live network traffic and cache them locally, eliminating network round-trips; and (2) demonstrates that adding compute nodes linearly increases aggregate read throughput — making the "add nodes, go faster" principle concretely measurable on real hardware.
Google's GFS paper (2003) stated this plainly: the system was designed for commodity hardware because "we cannot make one machine fast enough." Google's web crawl index was too large to read from a single machine at the speed search required, so they split it across thousands of Linux servers reading in parallel — aggregate throughput scaled with node count. Meta's Tectonic (FAST '21) extends this to the exabyte scale: a single general-purpose distributed filesystem that replaced three separate siloed storage systems at Facebook. FasterFS applies the same principle at two-node scale, with eBPF-driven kernel caching layered on top.
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Introduction
In quantitative finance research, analysts repeatedly read the same market data files — order books, tick data, price series — many times during backtesting and simulation runs. These files live on a central storage server, and every read travels over the network. Even on a fast local network, each round-trip costs 3 to 40 milliseconds. Over a full backtest touching hundreds of files, the cumulative delay is significant and slows research iteration.
Two separate problems arise as research teams and datasets grow:
Problem 1 — Repeated reads are expensive. The same data is fetched from the network over and over, even though it was just accessed. Caching helps, but conventionally requires modifying every application to manage its own cache.
Problem 2 — One machine can only go so fast. A single laptop has one disk, one network card, finite CPU. When the dataset grows beyond one machine's capacity to process it quickly, the only answer is horizontal scaling: add more machines. But the software must coordinate this.
Both problems have precedent at scale. XTX Markets built TernFS (2024) — a production distributed filesystem for quantitative machine learning at one of the world's largest algorithmic trading firms — precisely because their researchers were bottlenecked reading gigabytes of market data on a single machine during backtesting. Google encountered the same scaling ceiling in 2003: no single machine could read their web crawl index fast enough, and GFS was the result.
FasterFS addresses both. The first is solved by eBPF kernel programs that detect hot data automatically — no application changes. The second is solved by a distributed architecture where each node holds a partition of the data on its own local storage, and all nodes serve reads simultaneously — adding a node adds its full disk bandwidth to the cluster.
These are two distinct claims, measured independently:
- Caching speedup: MinIO 3.487 ms/chunk → FasterFS cached 0.388 ms/chunk = 8.3× faster on a single Linux node.
- Distributed throughput scaling: Same 600 MB dataset read in 17 seconds on 1 node, 8.7 seconds on 2 nodes — 1.95× speedup, measured on real hardware with two laptops over WiFi.
The kernel watches your traffic and learns what to cache. And every node you add brings its full disk bandwidth to the pool.
-------------------------------------------------------------------------------
Literature Survey and Technologies Used
Why Distributed Storage?
Storage capacity is rarely the bottleneck in research settings — each modern laptop holds a terabyte. The real bottleneck is **I/O throughput**: how fast data can be read from disk and processed. A single machine is limited by its own disk read speed (typically 40–500 MB/s). Two machines reading in parallel deliver twice the aggregate throughput. Four machines deliver four times. This is why distributed filesystems are used in data-intensive domains: not because one machine lacks space, but because it lacks bandwidth.
Google's GFS paper stated this plainly: "we cannot make one machine fast enough." Meta's Tectonic (FAST '21) extends this to exabyte scale — adding storage nodes adds raw bandwidth, and pooling across workloads eliminates the waste of siloed systems. FasterFS demonstrates the same principle at two-node scale: the same 600 MB dataset that takes 17 seconds on one node takes 8.7 seconds when two nodes read their partitions simultaneously.
Industry Systems Studied
DeepSeek 3FS — Fire-Flyer File System (2024)
DeepSeek's open-source distributed filesystem was built for large-scale AI model training, deployed across hundreds of storage nodes with InfiniBand RDMA networking. It uses a replication protocol called CRAQ (Chain Replication with Apportioned Queries) for consistency and NVMe drives for raw throughput. 3FS does not use eBPF, and requires specialized InfiniBand hardware — a different problem space from commodity Ethernet. FasterFS implements the same GraySort benchmark concept (generate → shuffle → sort) at laptop scale as Demo C.
XTX Markets TernFS (2024)
A production distributed filesystem used for quantitative machine learning at XTX Markets, one of the world's largest algorithmic trading firms. TernFS uses eBPF in a financial storage context — but for correctness rather than performance. It hooks into the kernel's file-close system call to ensure data is durably written before being marked complete. TernFS also uses Reed-Solomon erasure coding for fault tolerance and runs on commodity Ethernet, like FasterFS. It demonstrates that eBPF is production-viable in financial storage stacks.
Google File System (GFS, 2003)
The foundational paper for modern distributed filesystems. Ghemawat, Gobioff, and Leung built GFS on thousands of commodity Linux machines because a single fast machine was not fast enough for Google's web crawl index. The design separates a single metadata master from hundreds of chunk servers, each holding a partition of the data and serving reads in parallel. The paper's opening assumption — "component failures are the norm" — established the philosophy that every distributed filesystem since has inherited. FasterFS's chunk-partition-per-node architecture is a direct application of the same idea.
Meta Tectonic (FAST '21)
Meta's Tectonic replaced three separate siloed storage systems — Haystack (photos), f4 (warm blobs), and a data warehouse store — with a single general-purpose distributed filesystem at exabyte scale on commodity servers. Multi-tenancy is the central design challenge: different workloads share the same pool of storage nodes, so unused capacity in one workload is available to another. Tectonic uses Reed-Solomon erasure coding rather than full replication, storing ~1.4× the raw data while tolerating multiple simultaneous disk failures. Like GFS and FasterFS, metadata is separated from data: a distributed key-value store handles the namespace, storage nodes hold the bytes.
JuiceFS (Open Source)
JuiceFS is a POSIX-compatible distributed filesystem that stores file contents in object storage (MinIO or S3) and metadata in Redis. Its architecture is the direct model for FasterFS: a shared object store with stateless client processes providing the intelligence layer. The key lesson: performance gains come not from distributing the storage, but from making each client smarter about what it fetches.
LOBSTER Market Data (Nasdaq)
The LOBSTER dataset provides reconstructed limit order book snapshots from Nasdaq. FasterFS uses a real Apple Inc. (AAPL) order book from June 21, 2012: 91,997 rows of bid/ask prices across 50 price levels, split into 92 sequential chunks stored in the distributed object store. This is the actual workload used for all benchmarks.
Core Technologies
| Technology | Role in FasterFS |
| Linux eBPF | Kernel programs that observe network traffic and detect hot data |
| TC (Traffic Control) Hook | Kernel attachment point where eBPF intercepts MinIO network packets |
| SOCK_OPS Hook | Kernel attachment point where eBPF intercepts MinIO network packets |
| BPF Maps | Kernel hook that applies TCP optimizations to every MinIO connection |
| MinIO | S3-compatible object store — one per node in the distributed setup |
| FastAPI + Python | Web server exposing REST APIs, WebSocket streams, and dashboard |
| Chart.js | Browser-side charting for live metrics and benchmark comparisons |
| JupyterLab | Interactive Python environment embedded in the dashboard |
| boto3 | Python SDK used to read and write objects to MinIO |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Methodology
Overall System Architecture
FasterFS runs one process per node. In the two-laptop setup used for benchmarks, each node has its own MinIO instance holding its partition of the LOBSTER data:
Linux Laptop (primary, chunks 0–45)
├── MinIO :9000 → holds LOBSTER chunks 0–45 locally
├── FasterFS server :8000 → dashboard, API, cluster coordinator
├── TC eBPF → kernel-level hot chunk detection
├── SOCK_OPS eBPF → automatic TCP tuning for MinIO
└── /tmp/fasterfs_cache/ → local disk cache for hot chunks
MacBook (client, chunks 46–91)
├── MinIO :9000 → holds LOBSTER chunks 46–91 locally
├── FasterFS server :8001 → API, registers with Linux primary
└── /tmp/fasterfs_cache/ → local disk cache for hot chunks
Every node registers itself with the primary at startup and sends periodic heartbeats. Chunk assignments ensure each node reads from its own local MinIO at loopback speed — no WiFi bottleneck during the benchmark.
How the Kernel Learns: The eBPF Pipeline
Two eBPF programs are loaded into the Linux kernel's Traffic Control (TC) hook, where every MinIO network packet passes.
Step 1 — Record the Request (Egress)
When an application sends a request to MinIO, the egress eBPF program intercepts the outgoing packet in kernel space. It records the current nanosecond-precision timestamp in a BPF hash map, indexed by TCP connection port.
Step 2 — Measure the Response (Ingress)
When MinIO's response arrives, the ingress program looks up the stored timestamp, computes the exact round-trip latency, and increments a per-connection access counter in a shared kernel map.
Step 3 — Detect Hot Chunks and Cache
When a chunk's access count crosses a threshold (2 accesses), the kernel marks it as "hot." The FasterFS server reads the BPF map and writes the chunk to local disk. On the next access, the chunk is served from local disk — not the network.
Step 4 — Serve from Cache
FasterFS checks local disk before going to MinIO. A cached chunk is served in ~0.39 ms instead of ~3.49 ms over loopback, or ~0.18 ms instead of ~57.8 ms over WiFi.
The application code is unchanged throughout. It calls the same data-reading function. The acceleration is transparent.
TCP Connection Tuning
A second eBPF program runs at the SOCK_OPS hook and fires when a new TCP connection is established to port 9000 (MinIO). It immediately applies two optimizations:
- Disable Nagle's Algorithm — prevents TCP from buffering small packets, which adds latency on frequent small requests like chunk fetches.
- Enable Quick Acknowledgments — instructs the kernel to acknowledge packets immediately, reducing round-trip overhead.
Both apply automatically to every MinIO connection on every Linux node. Over 110 connections were tuned during testing.
The Map-Sharing Engineering Challenge
A critical non-obvious engineering problem was discovered during development. When the egress and ingress eBPF programs were loaded through separate commands, each created its own private copy of the kernel data structures (BPF maps). The egress program recorded timestamps in its map; the ingress program looked them up in a completely separate, always-empty map.
The result: no latency measurements, no hot chunk detections — both programs running but producing nothing.
The fix required loading both programs together in a single kernel operation with a shared map directory, so both reference the exact same kernel memory. This is the correct production pattern for multi-program eBPF deployments. Discovering and resolving this was one of the most technically significant contributions of the project.
Benchmark Design
Demo A — Caching Speedup (single-node)
Five sequential passes over all 92 LOBSTER chunks, measuring per-chunk latency:
| Pass | Backend | What It Measures |
| Local CSV | Direct disk read | Theoretical minimum — no network |
| MinIO (cold) | Object store, no cache | Raw network round-trip cost |
| FasterFS Pass 1 (cold) | MinIO + empty cache | First access, no cache benefit |
| FasterFS Pass 2 (warm) | MinIO + cache writing | Cache fills as chunks cross threshold |
| FasterFS Pass 3 (hot) | Local disk cache | Every chunk served from disk |
The headline number is Pass 3 vs MinIO cold.
Demo B — Distributed Throughput Scaling
A separate benchmark measures raw read bandwidth. Each node reads its assigned chunk partition from its local MinIO (no cache) repeatedly until 300 MB has been read (600 MB total across two nodes). Wall-clock time is measured from when all nodes start to when all nodes finish. This directly answers: does adding a node reduce processing time?
Demo C — Distributed Sort (GraySort at demo scale)
A three-phase distributed sort mirroring DeepSeek's GraySort structure:
- Generate (parallel): Each node generates 50 MB of random 64-byte records (8-byte sortable key + 56-byte payload) and uploads to its own MinIO.
- Partition (coordinator): The primary downloads all data, partitions records into N buckets by the first byte of the key, and uploads each partition to its destination node's MinIO.
- Sort (parallel): Each node downloads its partition, sorts by key in memory using NumPy's lexicographic sort, and uploads the sorted result.
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Results
Demo A — Caching Speedup
Single-node, Linux laptop, loopback MinIO:
| Storage Path | Median Latency (P50) | Notes |
| Local CSV (theoretical min) | 0.3 ms | Direct disk read, no network |
| MinIO cold fetch | 3.487 ms | Network round-trip to object store |
| FasterFS Pass 3 (hot cache) | 0.388 ms | Local disk cache hit, eBPF-detected |
| Speedup | 8.3× | Measured on physical hardware |
MacBook client, reading over WiFi from Linux MinIO (before Option 2):
| Storage Path | Median Latency (P50) | Notes |
| MinIO over WiFi | 57.8 ms | WiFi round-trip to Linux MinIO |
| FasterFS cached (local disk) | 0.177 ms | Mac's local disk, no WiFi involved |
| Speedup | 333× | Measured on physical hardware |
This 333× result represents the caching speedup specifically: WiFi latency eliminated by local disk. It is not the distributed throughput scaling result.
Demo B — Distributed Throughput Scaling
Same 600 MB dataset, read from local MinIO on each node:
| Configuration | Wall-clock time | Aggregate throughput | Per-node detail |
| 1 node (Linux, chunks 0–91) | 17.0 s | 35 MB/s | 37 MB/s from Linux MinIO |
| 2 nodes (Linux + MacBook, 46 chunks each) | 8.7 s | 69 MB/s | Linux: 37 MB/s; Mac: 186 MB/s |
| Speedup | 1.95× | 1.95× | Both measured on hardware |
The Mac's Apple Silicon SSD reads at 186 MB/s from its local MinIO — five times faster than the Linux node. The wall-clock is limited by the slower Linux node. With homogeneous hardware, the speedup approaches 2.0× exactly.
The key result: adding one laptop cut processing time nearly in half, with no configuration changes to the application reading the data.
Demo C — Distributed Sort (GraySort, 100 MB across 2 nodes)
| Phase | Time | Notes |
| Generate (parallel) | 0.71 s | Both nodes generate 50 MB simultaneously |
| Partition (coordinator) | 13.69 s | Primary shuffles data; bottleneck is WiFi download from Mac |
| Sort (parallel) | 0.79 s | Both nodes sort their partitions simultaneously |
| Total | 15.18 s | 100 MB sorted end-to-end across 2 nodes |
The generate and sort phases run genuinely in parallel — both nodes work simultaneously. The partition phase is coordinator-bound: the primary downloads remote data over WiFi, partitions it, and uploads back. With a fast interconnect (Ethernet or InfiniBand), partition time would drop by an order of magnitude.
Live eBPF Kernel Metrics
| Metric | Observed Value |
| Network reads observed by kernel | 435+ per benchmark run |
| Hot chunk events detected | 424+ |
| Average kernel-measured round-trip latency | ~481 microseconds |
| TCP connections tuned by SOCK_OPS | 110+ |
| Operating mode | LIVE kernel (simulation=false) |
All values read directly from kernel BPF maps — not estimated from user space.
The Kernel Heatmap in Action
The /kernel dashboard shows all 92 LOBSTER data chunks as a live grid. At the start of a benchmark, every cell is dark (cold). After two passes, the kernel has detected all 92 chunks as hot and cached them. During the third pass, the entire grid is active — every read served from local disk — while the latency timeline shows a sharp drop at that moment. This is a direct visual of the kernel learning access patterns in real time.
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Conclusions and Future Scope
Conclusions
FasterFS demonstrates two independently measurable results on real hardware:
First: eBPF-driven kernel caching delivers an 8.3× reduction in read latency on a local network, and a 333× reduction over WiFi — without any changes to application code. The kernel observes TCP traffic, identifies hot data, and caches it to local disk transparently.
Second: Distributing data across nodes and reading in parallel scales throughput nearly linearly. The same 600 MB dataset that takes 17 seconds on one laptop takes 8.7 seconds on two — a 1.95× improvement. Adding commodity hardware directly adds bandwidth.
The project also resolved a subtle and non-obvious eBPF architectural problem — map isolation between independently loaded programs — whose correct solution (shared map directories via bpftool) is now documented and reproducible.
The architecture draws directly from production distributed storage systems: JuiceFS's smart-client model, and the principle demonstrated by TernFS that eBPF belongs in financial storage stacks. FasterFS demonstrates both on two commodity laptops connected over WiFi.
Future Scope
Complete eBPF-Driven Caching: Currently the cache write decision is made by a Python counter that mirrors the eBPF threshold. The eBPF ring buffer emits HOT_CHUNK events that are not yet consumed. Adding a ring buffer consumer thread would make the kernel the direct trigger for cache writes — the intended design.
Distributed Cache Coordination: Nodes currently cache independently. A shared cache index would allow nodes to serve each other's cached data, multiplying cache hit rates across the cluster.
XDP for Cross-Machine Traffic: Current eBPF programs monitor loopback traffic. XDP programs can intercept physical NIC traffic, enabling kernel-level detection when MinIO is on a separate machine.
Fault Tolerance: Following TernFS's approach, Reed-Solomon erasure coding across nodes would allow chunk reconstruction from surviving nodes after a failure.
Scale Testing: With 4–8 nodes of similar hardware, the throughput scaling curve (1× → 2× → 4× → 8×) would demonstrate near-linear scaling with homogeneous hardware, resolving the hardware imbalance observed in the current 2-node results.
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
References and Links
- GitHub Repository: (insert link)
- Live Demo Dashboard: http://localhost:8000 (primary node)
1. DeepSeek-AI. Fire-Flyer File System (3FS). GitHub, 2024. https://github.com/deepseek-ai/3FS
2. XTX Markets. TernFS: eBPF-Backed Distributed Filesystem for Quant ML. Engineering Blog, 2024.
3. JuiceFS. Architecture Overview. https://juicefs.com/docs/community/architecture/
4. Gregg, B. BPF Performance Tools: Linux System and Application Observability. Addison-Wesley, 2019.
5. Nakryiko, A. BPF CO-RE (Compile Once – Run Everywhere). lwn.net, 2020.
6. LOBSTER. Limit Order Book Data. University of Maryland. https://lobsterdata.com/
7. Linux Kernel Documentation. BPF: Extended Berkeley Packet Filter. kernel.org.
8. DeepSeek-AI. GraySort Benchmark on 3FS. GitHub, 2024.
9. Corbet, J. The BPF ring buffer. lwn.net, 2020.
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Team Details
Mentors
- Swaraj Singh
- Calisto Abel Mathias
Mentees
- Aryan Palimkar
- Ajay Sharma
- Ankit Kumar
- Pradyun Diwakar
Academic Year: 2025–2026
Club / Event: IEEE NITK Student Branch / Year Long Executive Project
Report Information
Team Members
Team Members
Report Details
Created: April 6, 2026, 9:26 a.m.
Approved by: None
Approval date: None
Report Details
Created: April 6, 2026, 9:26 a.m.
Approved by: None
Approval date: None