Data Processing

Why you should move your ETL stack to Modal

google GitHub - google/magika: Detect file content types with deep learning

hatchet-dev GitHub - hatchet-dev/hatchet: A distributed, fault-tolerant task queue

How Levels.fyi Built Scalable Search with PostgreSQL

Data Engineering Data Orchestration Trends: The Shift From Data Pipelines to Data Products

spiceai GitHub - spiceai/spiceai: A unified SQL query interface and portable runtime to locally materialize, accelerate, and query data tables sourced from any database, data warehouse, or data lake.

huggingface GitHub - huggingface/datatrove: Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.

databonsai GitHub - databonsai/databonsai: clean & curate your data with LLMs.

Overview — Apache Arrow Ballista documentation