duckdb has looked interesting to me for a while.
You can query a wide variety of tabular data sources e.g. `csv`, `parquet` files *without* a server.
2019 - 2023
- Some data source appears in the form of .csv / Excel files (email, WhatsApp)
- File stays locally in `/Downloads` after viewing
- Couple months later, completely forgot what the data is for
2023 - 2025
- Load as much data as possible (without context) into Google BigQuery
- Scrapers insert data without de-duplication by primary (or composite primary) key
- Rely on dbt to de-duplicate using
```
ROW_NUMBER() OVER (
PARTITION BY
ORDER BY
)
```
- End up with a very cluttered, messy data warehouse which no one wants to use
Key takeaways
- Only quality data should enter the warehouse (less is more)
Planned workflow
Set up Marimo for data notebooks
(I use a `docker-compose.yml` file, but you can run it using Python)
```python
uv pip install --system --upgrade marimo[sql] \
--no-cache-dir && \
marimo edit /app \
--host 0.0.0.0 \
--port 8501 \
--no-token
```
Use Motherduck as the data warehouse
1. Create an access token in the Motherduck console
2. Save the token in a `.env` file
```
motherduck_token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9
```
3. Start your docker container
```
docker-compose up --build -d
```
4. In a new marimo notebook, connect to your Motherduck instance
```python
import duckdb
# Connect to MotherDuck
duckdb.sql("ATTACH IF NOT EXISTS 'md:db'")
```
5. I think this line is magical - using the duckdb engine, it uses SQL to directly select from `.csv` file
```sql
SELECT * FROM 'md/mpob_closing_stocks_by_products.csv'
```
6. I write some code to de-duplicate the data and insert that into schema `mpob` in my data warehouse (be sure to specify `USE db` or duckdb defaults to inserting the data into `memory`)
```sql
USE db;
CREATE SCHEMA IF NOT EXISTS mpob;
CREATE TABLE mpob.closing_stocks_by_product AS
SELECT
-- Metadata
report_year,
report_no,
report_title,
-- Data
product,
period,
value AS mt,
updated_at
FROM (
SELECT
value,
product,
period,
report_year,
report_title,
report_no,
data_source,
updated_at,
ROW_NUMBER() OVER (
PARTITION BY product, period
ORDER BY updated_at DESC
) AS rn
FROM 'md/mpob_closing_stocks_by_products.csv'
) t
WHERE rn = 1;
```
Use `COPY FROM DATABASE` to transfer tables from memory to `your_database.db`
```
ATTACH 'my_database.db';
COPY FROM DATABASE memory TO my_database;
DETACH my_database;
```