Data Storage
filesystem_spec
A specification for pythonic filesystems.
Install
pip install fsspec
would install the base fsspec. Various optionally supported features might require specification of custom extra require, e.g. pip install fsspec[ssh] will install dependencies for ssh backends support. Use pip install fsspec[full] for installation of all known... See more
A specification for pythonic filesystems.
Install
pip install fsspec
would install the base fsspec. Various optionally supported features might require specification of custom extra require, e.g. pip install fsspec[ssh] will install dependencies for ssh backends support. Use pip install fsspec[full] for installation of all known... See more
fsspec • GitHub - fsspec/filesystem_spec: A specification that python filesystems should adhere to.
At the current pace of media content creation, Reddit expects their media metadata to be roughly 50 terabytes. This means they need to implement sharding and partition their tables across multiple Postgres instances.
Reddit shards their tables based on post_id where they use range-based partitioning. All posts with a post_id in a certain range will... See more
Reddit shards their tables based on post_id where they use range-based partitioning. All posts with a post_id in a certain range will... See more
Shortwave — rajhesh.panchanadhan@gmail.com [Gmail alternative]
WebDataset
WebDataset is a library for writing I/O pipelines for large datasets. Its sequential I/O and sharding features make it especially useful for streaming large-scale datasets to a DataLoader.
The WebDataset format
A WebDataset file is a TAR archive containing a series of data files. All successive data files with the same prefix are... See more
WebDataset is a library for writing I/O pipelines for large datasets. Its sequential I/O and sharding features make it especially useful for streaming large-scale datasets to a DataLoader.
The WebDataset format
A WebDataset file is a TAR archive containing a series of data files. All successive data files with the same prefix are... See more
WebDataset
PGlite - Postgres in WASM
PGlite is a WASM Postgres build packaged into a TypeScript client library that enables you to run Postgres in the browser, Node.js and Bun, with no need to install any other dependencies. It is only 3.7mb gzipped.
import { PGlite } from "@electric-sql/pglite"
const db = new PGlite()
await db.query("select 'Hello world' as... See more
PGlite is a WASM Postgres build packaged into a TypeScript client library that enables you to run Postgres in the browser, Node.js and Bun, with no need to install any other dependencies. It is only 3.7mb gzipped.
import { PGlite } from "@electric-sql/pglite"
const db = new PGlite()
await db.query("select 'Hello world' as... See more
electric-sql • GitHub - electric-sql/pglite: Lightweight Postgres packaged as WASM into a TypeScript library for the browser, Node.js, Bun and Deno
Local database for development
Each table in the database had an accompanying script that would generate a subset of the data for use in local development, since the final database was too large to run on a developer's machine.
This let each developer work with a live, local, copy of the database and enabled efficient development of changes.
I highly... See more
Each table in the database had an accompanying script that would generate a subset of the data for use in local development, since the final database was too large to run on a developer's machine.
This let each developer work with a live, local, copy of the database and enabled efficient development of changes.
I highly... See more
Bill Mill • notes.billmill.org
Our Goals
We made it lightweight and kept the efficiency in mind:
We made it lightweight and kept the efficiency in mind:
- Self-contained
We ship a single dependency-free binary that runs on all Linux distributions - Fast to deploy, safe to operate
We are sysadmins, we know the value of operator-friendly software - Deploy everywhere on every machine
We do not have a dedicated backbone, and neither do you,
so
Garage - An open-source distributed object storage service
Expose Delta Tables via REST APIs
Git repo to test 3 architectures to expose delta tables via REST APIs. See also my blogpost here. Architectures can be described as follows:
Git repo to test 3 architectures to expose delta tables via REST APIs. See also my blogpost here. Architectures can be described as follows:
- Architecture A: Direct, Web App with DuckDB. In this architecture, APIs are directly connecting to the delta table and there is no layer in between. This implies that all data
GitHub - rebremer/expose-deltatable-via-restapi
A serverless vector database
built from first principles on object storage: 10-100x cheaper, usage-based pricing, massive scalability
built from first principles on object storage: 10-100x cheaper, usage-based pricing, massive scalability
turbopuffer
Classwords are suffixes added to database column names to indicate the type of data they contain. This improves readability and makes it easier to understand the database schema. Base classwords include text, calendar, numeric and domain-specific types. It is best to avoid redundancy in column names, as this can lead to unnecessary verbosity. Using... See more
Gemini - chat to supercharge your ideas
Text Classwords
identifier (or id)
code[_<standard>]
name
description (or desc)
indicator (or ind)
number
text
Calendar Classwords
date
datetime[<timezone>] (or dt[<timezone>])
timestamp[<timezone>] (or ts[<timezone>])
Numeric Classwords
count
amount[_<currency>]
<quantity_property>[_<unit_of_measure>]
ratio
factor
percent (or pct)
Domain-Specific Classwords
uri
address
email
sku
json
geojson