What is Amazon S3?
What is Amazon S3 (Simple Storage Service)?
Amazon S3 (Simple Storage Service) is a massively scalable, highly durable cloud object storage platform provided by Amazon Web Services (AWS). Launched in 2006, it was one of the very first cloud services ever created. Today, it is universally recognized as the absolute physical foundation of the modern open data ecosystem. The vast majority of the world’s Data Lakes and Open Data Lakehouses physically reside on Amazon S3 hard drives.
S3 revolutionized data engineering by completely eliminating the concept of “buying hard drives.” It provided organizations with an infinitely elastic, flat namespace accessed entirely via secure HTTP REST APIs. An enterprise can store a single 10-kilobyte text file or ten petabytes of Apache Parquet data in an S3 bucket, and S3 seamlessly handles the physical hardware provisioning, load balancing, and data distribution entirely in the background.
Architecture and Extreme Durability
S3 is engineered to provide “Eleven Nines” (99.999999999%) of data durability.
If a data engineer uploads a massive dataset to an S3 bucket, S3 does not simply save it to a single hard drive. Automatically and invisibly, S3 duplicates that specific object and physically scatters the copies across a minimum of three completely distinct physical data centers (Availability Zones) separated by miles of geography. If a massive catastrophic event (like a flood or a fire) completely destroys an entire AWS data center, the data is entirely safe because identical copies exist in the other two facilities. The system automatically detects the loss and instantly generates new copies to restore the “Eleven Nines” mathematical guarantee.
The S3 API: The Lingua Franca of Data
The most profound impact of Amazon S3 is not the storage itself, but its API architecture.
The S3 API defines the exact programmatic commands required to interact with cloud storage (e.g., GET, PUT, DELETE, LIST). Because S3 achieved such massive market dominance early on, the S3 API became the de facto, undisputed standard language for data storage globally.
Today, almost every single massive data engine in the world (Apache Spark, Snowflake, Dremio, Trino) natively speaks the S3 API. Even competing storage platforms (like MinIO or on-premises object storage appliances) intentionally clone the exact S3 API architecture. This guarantees that a data engineer can write a complex pipeline to analyze data on AWS S3, and later migrate that exact same pipeline to an on-premises MinIO server without rewriting a single line of code, as long as the endpoint speaks “S3.”
Tiered Storage and Cost Optimization
To manage petabyte-scale Data Lakehouses affordably, data teams rely heavily on S3’s sophisticated Intelligent Tiering lifecycle policies.
Storing data that is accessed every single day (like recent sales data) requires high-performance, expensive flash storage (S3 Standard). However, storing ten-year-old financial data that auditors only query once a year on expensive flash storage is a massive waste of money. S3 allows data engineers to set automated lifecycle rules. If a Parquet file hasn’t been queried by the Data Lakehouse in 90 days, S3 automatically and invisibly moves that file down to “S3 Glacier” (cold, magnetic tape storage), drastically reducing the monthly storage cost by up to 90%, while keeping the metadata entirely intact for future retrieval.
Summary of Technical Value
Amazon S3 is the infrastructural bedrock of Big Data. By providing infinitely scalable, flawlessly durable, and heavily cost-optimized object storage accessed via a universally adopted API, S3 fundamentally enabled the decoupling of storage from compute. It is the absolute prerequisite architecture that allowed the modern Open Data Lakehouse to exist.
Learn More
To learn more about the Data Lakehouse, read the book “Lakehouse for Everyone” by Alex Merced. You can find this and other books by Alex Merced at books.alexmerced.com.