Data Processing
Koheesio
CI/CD
Package
Meta
Koheesio, named after the Finnish word for cohesion, is a robust Python framework for building efficient data pipelines. It promotes modularity and collaboration, enabling the creation of complex pipelines from simple, reusable components.
The framework is versatile, aiming to support multiple implementations and working sea... See more
CI/CD
Package
Meta
Koheesio, named after the Finnish word for cohesion, is a robust Python framework for building efficient data pipelines. It promotes modularity and collaboration, enabling the creation of complex pipelines from simple, reusable components.
The framework is versatile, aiming to support multiple implementations and working sea... See more
GitHub - Nike-Inc/koheesio: Python framework for building efficient data pipelines. It promotes modularity and collaboration, enabling the creation of complex pipelines from simple, reusable components.
Nicolay Gerold added 6mo
Traditional ETL solutions are still quite powerful when it comes to:
- Common connectors with small-medium data volumes : we still have a lot of respect for companies like Fivetran, who have really nailed the user experience for the most common ETL use cases, like syncing Zendesk tickets or a production Postgres read replica into Snowflake. The only
Why you should move your ETL stack to Modal
Nicolay Gerold added 7mo
This job copied 12m rows from Clickhouse to Snowflake in 16 minutes using:
Even if Fivetran had a ClickHouse connector (it doesn’t at the time of this writing), syncing 12m rows would cost ~$3300. The total cost of this Modal job... See more
- 5 CPUs : at $0.192 / CPU hour that comes out to $0.26
- 4.4 GiB of memory: at $0.024 / GiB per hour that comes out to $0.03
Even if Fivetran had a ClickHouse connector (it doesn’t at the time of this writing), syncing 12m rows would cost ~$3300. The total cost of this Modal job... See more
Why you should move your ETL stack to Modal
Nicolay Gerold added 7mo
Most commonly, ETL means moving data from some source system (e.g. a production database, Slack API) into an analytical data warehouse (e.g. Snowflake) where the data is easier to combine and analyze. Most data teams use a vendor like Fivetran or an orchestration platform like Airflow to do this.
Modal is a great solution for ETL if you are primaril... See more
Modal is a great solution for ETL if you are primaril... See more
Why you should move your ETL stack to Modal
Nicolay Gerold added 7mo
1️⃣ RudderStack provides data pipelines to collect data from applications, websites and SaaS platforms.
2️⃣ Its API architecture and SDKs ensure developers can gather data from different sources and leverage them into their applications without disruptions.
3️⃣ RudderStack is highly versatile and integrates with over 90+ tools and data warehouse dest... See more
2️⃣ Its API architecture and SDKs ensure developers can gather data from different sources and leverage them into their applications without disruptions.
3️⃣ RudderStack is highly versatile and integrates with over 90+ tools and data warehouse dest... See more
Bap • Our 5 favourite open-source customer data platforms
Nicolay Gerold added 7mo
Clean & curate your data with LLMs
databonsai is a Python library that uses LLMs to perform data cleaning tasks.
Features
databonsai is a Python library that uses LLMs to perform data cleaning tasks.
Features
- Suite of tools for data processing using LLMs including categorization, transformation, and extraction
- Validation of LLM outputs
- Batch processing for token savings
- Retry logic with exponential backoff for handling rate limits an
databonsai • GitHub - databonsai/databonsai: clean & curate your data with LLMs.
Nicolay Gerold added 7mo
(1) The separation between storage and compute , as encouraged by data lake architectures (e.g. the implementation of P would look different in a traditional database like PostgreSQL, or a cloud warehouse like Snowflake). This architecture is the focus of the current system, and it is prevalent in most mid-to-large enterprises (its benefits that be... See more
Jacopo Tagliabue • Reproducible data science over data lakes: replayable data pipelines with Bauplan and Nessie.
Nicolay Gerold added 7mo
Optimizing Further
Creating so many indices and aggregating so many tables is sub-optimal. To optimize this, we employ materialized views, which create a separate disk-based entity and hence support indexing. The only downside is that we have to keep it updated.
CREATE MATERIALIZED VIEW search_view AS
ᅠᅠSELECT c.name FROM company c UNION
ᅠᅠSELECT c.na... See more
Creating so many indices and aggregating so many tables is sub-optimal. To optimize this, we employ materialized views, which create a separate disk-based entity and hence support indexing. The only downside is that we have to keep it updated.
CREATE MATERIALIZED VIEW search_view AS
ᅠᅠSELECT c.name FROM company c UNION
ᅠᅠSELECT c.na... See more
How Levels.fyi Built Scalable Search with PostgreSQL
Nicolay Gerold added 7mo
Spice.ai OSS
What is Spice?
Spice is a small, portable runtime that provides developers with a unified SQL query interface to locally materialize, accelerate, and query data tables sourced from any database, data warehouse, or data lake.
Spice makes it easy to build data-driven and data-intensive applications by streamlining the use of data and mach... See more
What is Spice?
Spice is a small, portable runtime that provides developers with a unified SQL query interface to locally materialize, accelerate, and query data tables sourced from any database, data warehouse, or data lake.
Spice makes it easy to build data-driven and data-intensive applications by streamlining the use of data and mach... See more
spiceai • GitHub - spiceai/spiceai: A unified SQL query interface and portable runtime to locally materialize, accelerate, and query data tables sourced from any database, data warehouse, or data lake.
Nicolay Gerold added 8mo
What is Hatchet?
Hatchet replaces difficult to manage legacy queues or pub/sub systems so you can design durable workloads that recover from failure and solve for problems like concurrency , fairness , and rate limiting . Instead of managing your own task queue or pub/sub system, you can use Hatchet to distribute your functions between a set of wor... See more
Hatchet replaces difficult to manage legacy queues or pub/sub systems so you can design durable workloads that recover from failure and solve for problems like concurrency , fairness , and rate limiting . Instead of managing your own task queue or pub/sub system, you can use Hatchet to distribute your functions between a set of wor... See more
hatchet-dev • GitHub - hatchet-dev/hatchet: A distributed, fault-tolerant task queue
Nicolay Gerold added 8mo