Data Processing
DataTrove
DataTrove is a library to process, filter and deduplicate text data at a very large scale. It provides a set of prebuilt commonly used processing blocks with a framework to easily add custom functionality.
DataTrove processing pipelines are platform-agnostic, running out of the box locally or on a slurm cluster. Its (relatively) low memory... See more
DataTrove is a library to process, filter and deduplicate text data at a very large scale. It provides a set of prebuilt commonly used processing blocks with a framework to easily add custom functionality.
DataTrove processing pipelines are platform-agnostic, running out of the box locally or on a slurm cluster. Its (relatively) low memory... See more
huggingface • GitHub - huggingface/datatrove: Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
Nicolay Gerold added 10mo
What's a Data Diff?
A data diff is the value-level comparison between two tables—used to identify critical changes to your data and guarantee data quality.
There is a lot you can do with data-diff: you can test SQL code by comparing development or staging environment data to production, or compare source and target data to identify discrepancies whe... See more
A data diff is the value-level comparison between two tables—used to identify critical changes to your data and guarantee data quality.
There is a lot you can do with data-diff: you can test SQL code by comparing development or staging environment data to production, or compare source and target data to identify discrepancies whe... See more
datafold • GitHub - datafold/data-diff: Compare tables within or across databases
Nicolay Gerold added 10mo
Optimizing Further
Creating so many indices and aggregating so many tables is sub-optimal. To optimize this, we employ materialized views, which create a separate disk-based entity and hence support indexing. The only downside is that we have to keep it updated.
CREATE MATERIALIZED VIEW search_view AS
ᅠᅠSELECT c.name FROM company c UNION
ᅠᅠSELECT c.na... See more
Creating so many indices and aggregating so many tables is sub-optimal. To optimize this, we employ materialized views, which create a separate disk-based entity and hence support indexing. The only downside is that we have to keep it updated.
CREATE MATERIALIZED VIEW search_view AS
ᅠᅠSELECT c.name FROM company c UNION
ᅠᅠSELECT c.na... See more
How Levels.fyi Built Scalable Search with PostgreSQL
Nicolay Gerold added 7mo
Clean & curate your data with LLMs
databonsai is a Python library that uses LLMs to perform data cleaning tasks.
Features
databonsai is a Python library that uses LLMs to perform data cleaning tasks.
Features
- Suite of tools for data processing using LLMs including categorization, transformation, and extraction
- Validation of LLM outputs
- Batch processing for token savings
- Retry logic with exponential backoff for handling rate limits an
databonsai • GitHub - databonsai/databonsai: clean & curate your data with LLMs.
Nicolay Gerold added 7mo
Programmable platform for data in motion
An open-source data streaming platform with in-line computation capabilities. Apply your custom programs to aggregate, correlate, and transform data records in real-time as they move over the network.
An open-source data streaming platform with in-line computation capabilities. Apply your custom programs to aggregate, correlate, and transform data records in real-time as they move over the network.
The programmable data streaming platform
Nicolay Gerold added 9mo
This job copied 12m rows from Clickhouse to Snowflake in 16 minutes using:
Even if Fivetran had a ClickHouse connector (it doesn’t at the time of this writing), syncing 12m rows would cost ~$3300. The total cost of this Modal job... See more
- 5 CPUs : at $0.192 / CPU hour that comes out to $0.26
- 4.4 GiB of memory: at $0.024 / GiB per hour that comes out to $0.03
Even if Fivetran had a ClickHouse connector (it doesn’t at the time of this writing), syncing 12m rows would cost ~$3300. The total cost of this Modal job... See more
Why you should move your ETL stack to Modal
Nicolay Gerold added 7mo
(1) The separation between storage and compute , as encouraged by data lake architectures (e.g. the implementation of P would look different in a traditional database like PostgreSQL, or a cloud warehouse like Snowflake). This architecture is the focus of the current system, and it is prevalent in most mid-to-large enterprises (its benefits that be... See more
Jacopo Tagliabue • Reproducible data science over data lakes: replayable data pipelines with Bauplan and Nessie.
Nicolay Gerold added 7mo
Overview¶
Ballista is a distributed compute platform primarily implemented in Rust, and powered by Apache Arrow.
Ballista has a scheduler and an executor process that are standard Rust executables and can be executed directly, but Dockerfiles are provided to build images for use in containerized environments, such as Docker, Docker Compose, and Kube... See more
Ballista is a distributed compute platform primarily implemented in Rust, and powered by Apache Arrow.
Ballista has a scheduler and an executor process that are standard Rust executables and can be executed directly, but Dockerfiles are provided to build images for use in containerized environments, such as Docker, Docker Compose, and Kube... See more
Overview — Apache Arrow Ballista documentation
Nicolay Gerold added 9mo
1️⃣ RudderStack provides data pipelines to collect data from applications, websites and SaaS platforms.
2️⃣ Its API architecture and SDKs ensure developers can gather data from different sources and leverage them into their applications without disruptions.
3️⃣ RudderStack is highly versatile and integrates with over 90+ tools and data warehouse dest... See more
2️⃣ Its API architecture and SDKs ensure developers can gather data from different sources and leverage them into their applications without disruptions.
3️⃣ RudderStack is highly versatile and integrates with over 90+ tools and data warehouse dest... See more
Bap • Our 5 favourite open-source customer data platforms
Nicolay Gerold added 7mo
data load tool (dlt) — the open-source Python library for data loading
Be it a Google Colab notebook, AWS Lambda function, an Airflow DAG, your local laptop,
or a GPT-4 assisted development playground— dlt can be dropped in anywhere.
Be it a Google Colab notebook, AWS Lambda function, an Airflow DAG, your local laptop,
or a GPT-4 assisted development playground— dlt can be dropped in anywhere.
dlt-hub • GitHub - dlt-hub/dlt: data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
Nicolay Gerold added 9mo