Data Storage
Our Goals
We made it lightweight and kept the efficiency in mind:
We made it lightweight and kept the efficiency in mind:
- Self-contained
We ship a single dependency-free binary that runs on all Linux distributions - Fast to deploy, safe to operate
We are sysadmins, we know the value of operator-friendly software - Deploy everywhere on every machine
We do not have a dedicated backbone, and neither do you,
so
Garage - An open-source distributed object storage service
- Scalability is crucial - systems need to be designed with the assumption that query volume, document corpus size, indexing complexity etc. could increase by 10x. What works at one scale may completely break at a higher scale.
- Sharding the index, either by document or by word, is important to distribute the indexing and querying load across machines.
Claude
At the current pace of media content creation, Reddit expects their media metadata to be roughly 50 terabytes. This means they need to implement sharding and partition their tables across multiple Postgres instances.
Reddit shards their tables based on post_id where they use range-based partitioning. All posts with a post_id in a certain range will... See more
Reddit shards their tables based on post_id where they use range-based partitioning. All posts with a post_id in a certain range will... See more
Shortwave β rajhesh.panchanadhan@gmail.com [Gmail alternative]
pg_vectorize: a VectorDB for Postgres
A Postgres extension that automates the transformation and orchestration of text to embeddings and provides hooks into the most popular LLMs. This allows you to do vector search and build LLM applications on existing data with as little as two function calls.
This project relies heavily on the work by pgvector... See more
A Postgres extension that automates the transformation and orchestration of text to embeddings and provides hooks into the most popular LLMs. This allows you to do vector search and build LLM applications on existing data with as little as two function calls.
This project relies heavily on the work by pgvector... See more
GitHub - tembo-io/pg_vectorize: The simplest way to orchestrate vector search on Postgres
Overview
pg_lakehouse is an extension that transforms Postgres into an analytical query engine over object stores like S3 and table formats like Delta Lake. Queries are pushed down to Apache DataFusion, which delivers excellent analytical performance. Combinations of the following object stores, table formats, and file formats are supported.
Object... See more
pg_lakehouse is an extension that transforms Postgres into an analytical query engine over object stores like S3 and table formats like Delta Lake. Queries are pushed down to Apache DataFusion, which delivers excellent analytical performance. Combinations of the following object stores, table formats, and file formats are supported.
Object... See more
https://github.com/paradedb/paradedb/tree/dev/pg_l...
memary: Open-Source Longterm Memory for Autonomous Agents
memary demo
Why use memary?
Agents use LLMs that are currently constrained to finite context windows. memary overcomes this limitation by allowing your agents to store a large corpus of information in knowledge graphs, infer user knowledge through our memory modules, and only retrieve... See more
memary demo
Why use memary?
Agents use LLMs that are currently constrained to finite context windows. memary overcomes this limitation by allowing your agents to store a large corpus of information in knowledge graphs, infer user knowledge through our memory modules, and only retrieve... See more
GitHub - kingjulio8238/memary: Longterm Memory for Autonomous Agents.
Data
With Quary, engineers can:
View the documentation.
- π Connect to their Database
- π Write SQL queries to transform, organize, and document tables in a database
- π Create charts, dashboards and reports (in development)
- π§ͺ Test, collaborate & refactor iteratively through version control
- π Deploy the organised, documented model back up to the database
View the documentation.
GitHub - quarylabs/quary: Open-source BI for engineers
Spice.ai OSS
What is Spice?
Spice is a small, portable runtime that provides developers with a unified SQL query interface to locally materialize, accelerate, and query data tables sourced from any database, data warehouse, or data lake.
Spice makes it easy to build data-driven and data-intensive applications by streamlining the use of data and... See more
What is Spice?
Spice is a small, portable runtime that provides developers with a unified SQL query interface to locally materialize, accelerate, and query data tables sourced from any database, data warehouse, or data lake.
Spice makes it easy to build data-driven and data-intensive applications by streamlining the use of data and... See more
spiceai β’ GitHub - spiceai/spiceai: A unified SQL query interface and portable runtime to locally materialize, accelerate, and query data tables sourced from any database, data warehouse, or data lake.
For High Throughput data, Grab uses Apache Avro with a strategy called Merge on Read (MOR) .
Here's the main operations with Merge on Read:
Here's the main operations with Merge on Read:
- Write Operations - When data is written, it's appended to the end of a log file. This is much more efficient than merging it in the current data and reduces the latency of writes.
- Read Operations - When you need