[Avg. reading time: 17 minutes]

Duck DB

DuckDB is a fast, open-source, in-process analytical database designed for efficient data analysis.

Its Portable - Single Binary - No external dependencies.

Key Features

Embeddable Design: Integrates directly into applications without the need for a separate server, simplifying deployment.

Columnar Storage: Optimized for analytical workloads by storing data in columns, enhancing query performance.

SQL Support: Offers a robust SQL interface for complex queries, compatible with various programming languages.

Extensibility: Supports extensions for custom data types, functions, and file formats.

Automatic Parallelism: DuckDB has improved its automatic parallelism capabilities, meaning it can more effectively utilize multiple CPU cores without requiring manual tuning. This results in faster query execution for large datasets.

Parquet File Improvements: DuckDB has improved its handling of Parquet files, both in terms of reading speed and support for more complex data types and compression codecs. This makes DuckDB an even better choice for working with large datasets stored in Parquet format.

Query Caching: Improves the performance of repeated queries by caching the results of previous executions. This can be a game-changer for analytics workloads with similar queries being run multiple times.

Download the CLI Client

Windows
Mac
Linux).
For other programming languages, visit https://duckdb.org/docs/installation/
Unzip the file.
Open Command / Terminal and run the Executable.

DuckDB in Data Engineering

Download orders.parquet from

https://github.com/duckdb/duckdb-data/releases/download/v1.0/orders.parquet

More files are available here

https://github.com/cwida/duckdb-data/releases/

Open Command Prompt or Terminal

duckdb –ui orders.duckdb

duckdb

# Create / Open a database

.open orders.duckdb

Duckdb allows you to read the contents of orders.parquet as is without needing a table. Double quotes around the file name orders.parquet is essential.

describe table  "orders.parquet"

select * from "orders.parquet" limit 3;

DuckDB supports CTAS syntax and helps to create tables from the actual file.

show tables;

create table orders  as select * from "orders.parquet";

select count(*) from orders;

DuckDB supports parallel query processing, and queries run fast.

This table has 1.5 million rows, and aggregation happens in less than a second.

select now(); select o_orderpriority,count(*) cnt from orders group by o_orderpriority; select now();

DuckDB also helps to convert parquet files to CSV in a snap. It also supports converting CSV to Parquet.

COPY "orders.parquet" to 'orders.csv'  (FORMAT "CSV", HEADER 1);Select * from "orders.csv" limit 3;

It also supports exporting existing Tables to Parquet files.

COPY "orders" to  'neworder.parquet' (FORMAT "PARQUET");

DuckDB supports Programming languages such as Python, R, JAVA, node.js, C/C++.

DuckDB ably supports Higher-level SQL programming such as Macros, Sequences, Window Functions.

Get sample data from Yellow Cab

https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page

Copy yellow cabs data into yellowcabs folder

create table taxi_trips as select * from "yellowcabs/*.parquet";

SELECT
    PULocationID,
    EXTRACT(HOUR FROM tpep_pickup_datetime) AS hour_of_day,
    AVG(fare_amount) AS avg_fare
FROM
    taxi_trips
GROUP BY
    PULocationID,
    hour_of_day;

Extensions

https://duckdb.org/docs/extensions/overview

INSTALL json;
LOAD json;

select * from 'https://github.com/duckdb/duckdb-data/releases/download/v1.0/canada.json';

describe 'select * from https://github.com/duckdb/duckdb-data/releases/download/v1.0/canada.json';

with cte as (
SELECT 
  unnest(features) as data
FROM 'https://github.com/duckdb/duckdb-data/releases/download/v1.0/canada.json')
select data.type, data.properties.name from cte

Load directly from HTTP location

select * from 'https://raw.githubusercontent.com/gchandra10/filestorage/main/sales_100.csv'

Press CTRL+D to quit the CLI.

Duck DB in IoT

Edge Analytics: DuckDB enables “edge analytics, where the Python program performs real-time analysis of the data close to the source.” This reduces the need to send all raw data to the cloud, decreases latency, and enables immediate insights like anomaly detection.

Data Preprocessing and Aggregation: DuckDB is “lightweight and designed for complex analytics queries, making it ideal for handling aggregated IoT data on the edge before it is sent to more comprehensive reporting tools.”

Real-Time Processing: Enables immediate analysis of streaming data from sensors, facilitating prompt decision-making.

Privacy: Personal data analysis on edge devices, and pre-processing data for machine learning, especially when data privacy is paramount.

Resource Efficiency: Its minimal resource footprint makes it suitable for deployment on devices with limited computational capabilities.

Popular Edge Devices that support DuckDB

Single-Board computers

Raspberry Pi
Orange Pi 5 Plus
ASUS Tinker Board S
Banana Pi M5
Rock Pi 4

---Ver 6.0.5

Adv - IoT Upper Stack