Over the years, I’ve worked on multiple projects where customers tried to run data-intensive workloads in the cloud: simulations, analytics pipelines, long-running Kubernetes jobs, and more recently, AI-style processing.
In many of these cases, storage quickly became the main bottleneck.
This post summarizes what I learned while designing and tuning a high-performance shared filesystem on AWS for one of those consulting engagements. The goal wasn’t to build “yet another NFS server”, but to understand how far cloud infrastructure can be pushed when performance actually matters.
Contents
The initial setup was fairly standard:
We started with Amazon EFS. It worked. It was reliable. It was easy to operate.
But as load increased, we saw:
For general-purpose workloads, EFS is fine. For HPC-style and AI-style pipelines, it often isn’t.
At that point, we had two choices: