Lessons Learned from Building an Optimized Shared Filesystem for HPC and AI Workloads on AWS

Over the years, I’ve worked on multiple projects where customers tried to run data-intensive workloads in the cloud: simulations, analytics pipelines, long-running Kubernetes jobs, and more recently, AI-style processing.

In many of these cases, storage quickly became the main bottleneck.

This post summarizes what I learned while designing and tuning a high-performance shared filesystem on AWS for one of those consulting engagements. The goal wasn’t to build “yet another NFS server”, but to understand how far cloud infrastructure can be pushed when performance actually matters.

Contents

When Managed Cloud Storage Is Not Enough

The initial setup was fairly standard:

Kubernetes workloads
Shared POSIX storage
Long-running jobs
Continuous read/write patterns

We started with Amazon EFS. It worked. It was reliable. It was easy to operate.

But as load increased, we saw:

Growing latency
Inconsistent throughput
Performance ceilings that were hard to bypass

For general-purpose workloads, EFS is fine. For HPC-style and AI-style pipelines, it often isn’t.

At that point, we had two choices:

Accept the limitations