What the research is:
Tectonic, our data center scale distributed file system, enables better resource utilization, promotes simpler services, and requires less operational complexity than our previous approach. Our previous storage infrastructure consisted of a set of use-case specific storage systems. Clusters, or instances of these storage systems, used to scale to tens of petabytes. As Facebook’s scale grew, this constellation of storage system architecture became increasingly resource inefficient and operationally complex.
Each Tectonic cluster scales to exabytes and serves storage needs for an entire data center. With Tectonic, our consolidated storage architecture promotes resource efficiency by harvesting resources that were otherwise stranded in smaller clusters. This consolidation has also significantly simplified our storage operations because we now have a single system and fewer clusters to manage.
How it works:
In building this system we simultaneously solved three high-level challenges: supporting exabyte-scale, isolating performance between tenants, and enabling tenant-specific optimizations.
Exabyte-scale clusters are important for operational simplicity and resource sharing. Tectonic disaggregates the file system metadata into independently scalable layers, and hash-partitions each metadata layer into a scalable shared key-value store. Combined with a linearly scalable storage node layer, this disaggregated metadata allows the system to meet the storage needs of an entire data center.
Tectonic simplifies performance isolation by solving the isolation problem in each tenant for groups of applications with similar traffic patterns and latency requirements. Instead of managing resources between hundreds of applications, Tectonic only manages resources between dozens of traffic groups.
Tectonic uses tenant-specific optimizations to match the performance of specialized storage systems. These optimizations are enabled by a client-driven microservice architecture that includes a rich set of client-side configurations for controlling how tenants interact with Tectonic.
Why it matters:
Most large-scale cloud services are dependent on storage. As cloud services become more popular, the need for data storage and processing is growing rapidly. Distributed storage systems must scale and evolve to store and process this data efficiently. For example, with growth in storage footprint, scalability of individual storage clusters can become a bottleneck.
Adopting Tectonic has helped our storage scale and yielded many operational and efficiency improvements. By moving our data warehouse onto Tectonic, we’ve reduced the number of data warehouse clusters by 10x, simplifying operations and unstranding resources. Tectonic manages these efficiency improvements while providing performance that’s comparable to or better than that of our previous specialized storage systems.