Data Processing
What's a Data Diff?
A data diff is the value-level comparison between two tables—used to identify critical changes to your data and guarantee data quality.
There is a lot you can do with data-diff: you can test SQL code by comparing development or staging environment data to production, or compare source and target data to identify discrepancies... See more
A data diff is the value-level comparison between two tables—used to identify critical changes to your data and guarantee data quality.
There is a lot you can do with data-diff: you can test SQL code by comparing development or staging environment data to production, or compare source and target data to identify discrepancies... See more
datafold • GitHub - datafold/data-diff: Compare tables within or across databases
The backbone for Versatile ai
Meet Instill Cloud, a no-code/low-code platform that accelerates AI application development by 10x. Effortlessly connect to diverse data sources, seamlessly integrate AI models, and deploy customized logic for your projects, no matter how complex, with lightning speed.
Meet Instill Cloud, a no-code/low-code platform that accelerates AI application development by 10x. Effortlessly connect to diverse data sources, seamlessly integrate AI models, and deploy customized logic for your projects, no matter how complex, with lightning speed.
Instill AI
Optimizing Further
Creating so many indices and aggregating so many tables is sub-optimal. To optimize this, we employ materialized views, which create a separate disk-based entity and hence support indexing. The only downside is that we have to keep it updated.
CREATE MATERIALIZED VIEW search_view AS
ᅠᅠSELECT c.name FROM company c UNION
ᅠᅠSELECT... See more
Creating so many indices and aggregating so many tables is sub-optimal. To optimize this, we employ materialized views, which create a separate disk-based entity and hence support indexing. The only downside is that we have to keep it updated.
CREATE MATERIALIZED VIEW search_view AS
ᅠᅠSELECT c.name FROM company c UNION
ᅠᅠSELECT... See more
How Levels.fyi Built Scalable Search with PostgreSQL
DataTrove
DataTrove is a library to process, filter and deduplicate text data at a very large scale. It provides a set of prebuilt commonly used processing blocks with a framework to easily add custom functionality.
DataTrove processing pipelines are platform-agnostic, running out of the box locally or on a slurm cluster. Its (relatively) low memory... See more
DataTrove is a library to process, filter and deduplicate text data at a very large scale. It provides a set of prebuilt commonly used processing blocks with a framework to easily add custom functionality.
DataTrove processing pipelines are platform-agnostic, running out of the box locally or on a slurm cluster. Its (relatively) low memory... See more
huggingface • GitHub - huggingface/datatrove: Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
Most commonly, ETL means moving data from some source system (e.g. a production database, Slack API) into an analytical data warehouse (e.g. Snowflake) where the data is easier to combine and analyze. Most data teams use a vendor like Fivetran or an orchestration platform like Airflow to do this.
Modal is a great solution for ETL if you are... See more
Modal is a great solution for ETL if you are... See more
Why you should move your ETL stack to Modal
Programmable platform for data in motion
An open-source data streaming platform with in-line computation capabilities. Apply your custom programs to aggregate, correlate, and transform data records in real-time as they move over the network.
An open-source data streaming platform with in-line computation capabilities. Apply your custom programs to aggregate, correlate, and transform data records in real-time as they move over the network.
The programmable data streaming platform
data load tool (dlt) — the open-source Python library for data loading
Be it a Google Colab notebook, AWS Lambda function, an Airflow DAG, your local laptop,
or a GPT-4 assisted development playground— dlt can be dropped in anywhere.
Be it a Google Colab notebook, AWS Lambda function, an Airflow DAG, your local laptop,
or a GPT-4 assisted development playground— dlt can be dropped in anywhere.
dlt-hub • GitHub - dlt-hub/dlt: data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
Overview¶
Ballista is a distributed compute platform primarily implemented in Rust, and powered by Apache Arrow.
Ballista has a scheduler and an executor process that are standard Rust executables and can be executed directly, but Dockerfiles are provided to build images for use in containerized environments, such as Docker, Docker Compose, and... See more
Ballista is a distributed compute platform primarily implemented in Rust, and powered by Apache Arrow.
Ballista has a scheduler and an executor process that are standard Rust executables and can be executed directly, but Dockerfiles are provided to build images for use in containerized environments, such as Docker, Docker Compose, and... See more
Overview — Apache Arrow Ballista documentation
SQLGlot is a no-dependency SQL parser, transpiler, optimizer, and engine. It can be used to format SQL or translate between 20 different dialects like DuckDB, Presto / Trino, Spark / Databricks, Snowflake, and BigQuery. It aims to read a wide variety of SQL inputs and output syntactically and semantically correct SQL in the targeted dialects.
It is... See more
It is... See more