The Architecture of Grab's Data Lake

RelatedHighlights

For low throughput data, Grab uses Parquet with Copy on Write (CoW) .

Here's the main operations for Copy on Write:

Write Operations - Whenever there's a write, you create a new version of the file that includes the latest change. You can also keep the previous version for consistency and rollback purposes. This helps prevent data corruption, incon

For High Throughput data, Grab uses Apache Avro with a strategy called Merge on Read (MOR) .

Here's the main operations with Merge on Read:

Write Operations - When data is written, it's appended to the end of a log file. This is much more efficient than merging it in the current data and reduces the latency of writes.