Data Processing

What is Hatchet?

Hatchet replaces difficult to manage legacy queues or pub/sub systems so you can design durable workloads that recover from failure and solve for problems like concurrency , fairness , and rate limiting . Instead of managing your own task queue or pub/sub system, you can use Hatchet to distribute your functions between a set of wor... See more

hatchet-dev • GitHub - hatchet-dev/hatchet: A distributed, fault-tolerant task queue

So what abstractions do we have as of today? For example, let’s take the resource abstraction (Dagster, Prefect, referred to as an operator in Airflow). You abstract complex environments and connections away with a simple construct like that. You have the immediate benefits of defining that once and using it in every task or pipeline with context.r... See more

Data Engineering • Data Orchestration Trends: The Shift From Data Pipelines to Data Products

Traditional ETL solutions are still quite powerful when it comes to:

Common connectors with small-medium data volumes : we still have a lot of respect for companies like Fivetran, who have really nailed the user experience for the most common ETL use cases, like syncing Zendesk tickets or a production Postgres read replica into Snowflake. The only

Why you should move your ETL stack to Modal

DataTrove

DataTrove is a library to process, filter and deduplicate text data at a very large scale. It provides a set of prebuilt commonly used processing blocks with a framework to easily add custom functionality.

DataTrove processing pipelines are platform-agnostic, running out of the box locally or on a slurm cluster. Its (relatively) low memory... See more

huggingface • GitHub - huggingface/datatrove: Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.

1️⃣ RudderStack provides data pipelines to collect data from applications, websites and SaaS platforms.

2️⃣ Its API architecture and SDKs ensure developers can gather data from different sources and leverage them into their applications without disruptions.

3️⃣ RudderStack is highly versatile and integrates with over 90+ tools and data warehouse dest... See more

Bap • Our 5 favourite open-source customer data platforms

Programmable platform for data in motion

An open-source data streaming platform with in-line computation capabilities. Apply your custom programs to aggregate, correlate, and transform data records in real-time as they move over the network.

The programmable data streaming platform

Optimizing Further

Creating so many indices and aggregating so many tables is sub-optimal. To optimize this, we employ materialized views, which create a separate disk-based entity and hence support indexing. The only downside is that we have to keep it updated.

CREATE MATERIALIZED VIEW search_view AS

ᅠᅠSELECT c.name FROM company c UNION

ᅠᅠSELECT c.na... See more

How Levels.fyi Built Scalable Search with PostgreSQL

Clean & curate your data with LLMs

databonsai is a Python library that uses LLMs to perform data cleaning tasks.

Features

Suite of tools for data processing using LLMs including categorization, transformation, and extraction

Validation of LLM outputs

Batch processing for token savings

Retry logic with exponential backoff for handling rate limits an

databonsai • GitHub - databonsai/databonsai: clean & curate your data with LLMs.

Spice.ai OSS

What is Spice?

Spice is a small, portable runtime that provides developers with a unified SQL query interface to locally materialize, accelerate, and query data tables sourced from any database, data warehouse, or data lake.

Spice makes it easy to build data-driven and data-intensive applications by streamlining the use of data and mach... See more

spiceai • GitHub - spiceai/spiceai: A unified SQL query interface and portable runtime to locally materialize, accelerate, and query data tables sourced from any database, data warehouse, or data lake.