Polars — Processing hundreds of GBs of textual data on a daily basis at MDPI
Polars
pola.rsKenneth D'Amica and added
The language is always only as good as its community. Let’s look at some of the existing open-source tools and frameworks built in and around Rust:
- DataFusion based on Apache Arrow: Apache Arrow DataFusion SQL Query Engine similar to Spark
- Polars: It’s a faster Pandas. Probably going to compete with DuckDB (?)
- Delta Lake Rust: A native Rust library fo
Data Engineering • Rust for Data Engineering
Nicolay Gerold added
Data bases have gotten so good at this, that the term is almost misleading now. “Base” suggests something rigid, without which the data would slip away. But the data is always there, just bits on a nameless hard disk. The structure and the accessibility that a modern database provides exist completely independently from that hard disk. That’s right... See more
DuckDB Doesn’t Need Data To Be a Database
Nicolay Gerold added
You’ve got a vector database that has all the right database fundamentals you require, has the right incremental indexing strategy for your use case, has a good story around your metadata filtering needs, and will keep its index up-to-date with latencies you can tolerate. Awesome.
Your ML team (or maybe OpenAI) comes out with a new version of their... See more
Your ML team (or maybe OpenAI) comes out with a new version of their... See more
6 Hard Problems Scaling Vector Search
Nicolay Gerold added
Overview
pg_lakehouse is an extension that transforms Postgres into an analytical query engine over object stores like S3 and table formats like Delta Lake. Queries are pushed down to Apache DataFusion, which delivers excellent analytical performance. Combinations of the following object stores, table formats, and file formats are supported.
Object ... See more
pg_lakehouse is an extension that transforms Postgres into an analytical query engine over object stores like S3 and table formats like Delta Lake. Queries are pushed down to Apache DataFusion, which delivers excellent analytical performance. Combinations of the following object stores, table formats, and file formats are supported.
Object ... See more
https://github.com/paradedb/paradedb/tree/dev/pg_l...
Nicolay Gerold added
(1) The separation between storage and compute , as encouraged by data lake architectures (e.g. the implementation of P would look different in a traditional database like PostgreSQL, or a cloud warehouse like Snowflake). This architecture is the focus of the current system, and it is prevalent in most mid-to-large enterprises (its benefits that be... See more
Jacopo Tagliabue • Reproducible data science over data lakes: replayable data pipelines with Bauplan and Nessie.
Nicolay Gerold added
GPU-accelerated databases are mind-blowing!
Imagine a database natively integrated with best-in-class AI foundational models:
• Zero warmup latency
• Massive GPU-backed scalability
• Ability to process your data with any model
• Ability to train and fine-tune models on your data
There are 1.7 million deployments of PostgreSQL worldwide, one o... See more
Nathan Storey added