Data Storage
For low throughput data, Grab uses Parquet with Copy on Write (CoW) .
Here's the main operations for Copy on Write:
Here's the main operations for Copy on Write:
- Write Operations - Whenever there's a write, you create a new version of the file that includes the latest change. You can also keep the previous version for consistency and rollback purposes. This helps prevent data corruption,
The Architecture of Grab's Data Lake
Denormalization
Another way Reddit minimizes joins is by using denormalization.
They took all the metadata fields required for displaying an image post and put them together into a single JSONB field. Instead of fetching different fields and combining them, they can just fetch that single JSONB field.
This made it much more efficient to fetch all the... See more
Another way Reddit minimizes joins is by using denormalization.
They took all the metadata fields required for displaying an image post and put them together into a single JSONB field. Instead of fetching different fields and combining them, they can just fetch that single JSONB field.
This made it much more efficient to fetch all the... See more
Shortwave — rajhesh.panchanadhan@gmail.com [Gmail alternative]
filesystem_spec
A specification for pythonic filesystems.
Install
pip install fsspec
would install the base fsspec. Various optionally supported features might require specification of custom extra require, e.g. pip install fsspec[ssh] will install dependencies for ssh backends support. Use pip install fsspec[full] for installation of all known... See more
A specification for pythonic filesystems.
Install
pip install fsspec
would install the base fsspec. Various optionally supported features might require specification of custom extra require, e.g. pip install fsspec[ssh] will install dependencies for ssh backends support. Use pip install fsspec[full] for installation of all known... See more
fsspec • GitHub - fsspec/filesystem_spec: A specification that python filesystems should adhere to.
- Scalability is crucial - systems need to be designed with the assumption that query volume, document corpus size, indexing complexity etc. could increase by 10x. What works at one scale may completely break at a higher scale.
- Sharding the index, either by document or by word, is important to distribute the indexing and querying load across machines.
Claude
memary: Open-Source Longterm Memory for Autonomous Agents
memary demo
Why use memary?
Agents use LLMs that are currently constrained to finite context windows. memary overcomes this limitation by allowing your agents to store a large corpus of information in knowledge graphs, infer user knowledge through our memory modules, and only retrieve... See more
memary demo
Why use memary?
Agents use LLMs that are currently constrained to finite context windows. memary overcomes this limitation by allowing your agents to store a large corpus of information in knowledge graphs, infer user knowledge through our memory modules, and only retrieve... See more
GitHub - kingjulio8238/memary: Longterm Memory for Autonomous Agents.
Data
For High Throughput data, Grab uses Apache Avro with a strategy called Merge on Read (MOR) .
Here's the main operations with Merge on Read:
Here's the main operations with Merge on Read:
- Write Operations - When data is written, it's appended to the end of a log file. This is much more efficient than merging it in the current data and reduces the latency of writes.
- Read Operations - When you need
The Architecture of Grab's Data Lake
Rottnest : Data Lake Indices
You don't need ElasticSearch or some vector database to do full text search or vector search. Parquet + Rottnest is all you need. Rottnest is like Postgres indices for Parquet. Read more on what it can do for e.g. logs here.
Installation
Local installation: pip install rottnest .
Rottnest supports many different index... See more
You don't need ElasticSearch or some vector database to do full text search or vector search. Parquet + Rottnest is all you need. Rottnest is like Postgres indices for Parquet. Read more on what it can do for e.g. logs here.
Installation
Local installation: pip install rottnest .
Rottnest supports many different index... See more
Ziheng Wang • GitHub - marsupialtail/rottnest: Data lake indices
It turns out there's a handy feature in PostgreSQL called row constructor comparisons that allows me to compare tuples of columns. That's exactly what we need. Instead of doing CreateAt > ?1 OR (CreateAt = ?1 AND Id > ?2) , we can do ( CreateAt, Id) > (?1, ?2) . And the row constructor comparisons are lexicographical, meaning that it's... See more
Making a Postgres query 1,000 times faster
Datasette is a tool for exploring and publishing data. It helps people take data of any shape, analyze and explore it, and publish it as an interactive website and accompanying API.
Datasette is aimed at data journalists, museum curators, archivists, local governments, scientists, researchers and anyone else who has data that they wish to share with... See more
Datasette is aimed at data journalists, museum curators, archivists, local governments, scientists, researchers and anyone else who has data that they wish to share with... See more