Sublime
An inspiration engine for ideas

So I wrote a 5400-word lecture note on the basics of data engineering for my students, covering:
* data formats (row- vs. column-based, text vs. binary)
* ETL
* batch processing vs. stream processing
* training datasets
WIP. Feedback much... See more

Create a Structured dataset using Instructor by @jxnlco
@pydantic, the backbone of Instructor, enables high customization with datatype hints for schema validation and seamlessly integrates with @lancedb for direct data insertion.
Checkout-https://t.co/IhJtQy14iv https://t.co/xcuHIOJEtX
ETL
The part of the system I'm most proud of, and on which I spent the most effort, is the ETL process.
We had a series of shell scripts for each data source we ingested (there were many), which would pull the data and put it in an s3 bucket.
Then, early in the morning, a cron job would spin up an EC2 instance, which would pull in the latest ETL code... See more
The part of the system I'm most proud of, and on which I spent the most effort, is the ETL process.
We had a series of shell scripts for each data source we ingested (there were many), which would pull the data and put it in an s3 bucket.
Then, early in the morning, a cron job would spin up an EC2 instance, which would pull in the latest ETL code... See more
Bill Mill • notes.billmill.org
ETL
The part of the system I'm most proud of, and on which I spent the most effort, is the ETL process.
We had a series of shell scripts for each data source we ingested (there were many), which would pull the data and put it in an s3 bucket.
Then, early in the morning, a cron job would spin up an EC2 instance, which would pull in the latest ETL code... See more
The part of the system I'm most proud of, and on which I spent the most effort, is the ETL process.
We had a series of shell scripts for each data source we ingested (there were many), which would pull the data and put it in an s3 bucket.
Then, early in the morning, a cron job would spin up an EC2 instance, which would pull in the latest ETL code... See more
Bill Mill • notes.billmill.org

LLMs have made exciting progress on hard tasks! But they still struggle to analyze complex, unstructured documents (including today's Gemini 1.5 Pro 002).
We (UC Berkeley) built 📜DocETL, an open-source, low-code system for LLM-powered data processing: https://t.co/VmJ1zyre6m

🚨BREAKING: New Python library for agentic data processing and ETL with AI
Introducing DocETL.
Here's what you need to know: https://t.co/94glNVRQfX

