DuckDB - CA Lee

duckdb has looked interesting to me for a while. You can query a wide variety of tabular data sources e.g. `csv`, `parquet` files *without* a server. 2019 - 2023 - Some data source appears in the form of .csv / Excel files (email, WhatsApp) - File stays locally in `/Downloads` after viewing - Couple months later, completely forgot what the data is for 2023 - 2025 - Load as much data as possible (without context) into Google BigQuery - Scrapers insert data without de-duplication by primary (or composite primary) key - Rely on dbt to de-duplicate using ``` ROW_NUMBER() OVER ( PARTITION BY ORDER BY ) ``` - End up with a very cluttered, messy data warehouse which no one wants to use Key takeaways - Only quality data should enter the warehouse (less is more) Planned workflow Set up Marimo for data notebooks (I use a `docker-compose.yml` file, but you can run it using Python) ```python uv pip install --system --upgrade marimo[sql] \ --no-cache-dir && \ marimo edit /app \ --host 0.0.0.0 \ --port 8501 \ --no-token ``` Use Motherduck as the data warehouse 1. Create an access token in the Motherduck console 2. Save the token in a `.env` file ``` motherduck_token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9 ``` 3. Start your docker container ``` docker-compose up --build -d ``` 4. In a new marimo notebook, connect to your Motherduck instance ```python import duckdb # Connect to MotherDuck duckdb.sql("ATTACH IF NOT EXISTS 'md:db'") ``` 5. I think this line is magical - using the duckdb engine, it uses SQL to directly select from `.csv` file ```sql SELECT * FROM 'md/mpob_closing_stocks_by_products.csv' ``` 6. I write some code to de-duplicate the data and insert that into schema `mpob` in my data warehouse (be sure to specify `USE db` or duckdb defaults to inserting the data into `memory`) ```sql USE db; CREATE SCHEMA IF NOT EXISTS mpob; CREATE TABLE mpob.closing_stocks_by_product AS SELECT -- Metadata report_year, report_no, report_title, -- Data product, period, value AS mt, updated_at FROM ( SELECT value, product, period, report_year, report_title, report_no, data_source, updated_at, ROW_NUMBER() OVER ( PARTITION BY product, period ORDER BY updated_at DESC ) AS rn FROM 'md/mpob_closing_stocks_by_products.csv' ) t WHERE rn = 1; ``` Use `COPY FROM DATABASE` to transfer tables from memory to `your_database.db` ``` ATTACH 'my_database.db'; COPY FROM DATABASE memory TO my_database; DETACH my_database; ```