[Avg. reading time: 17 minutes]
Duck DB
DuckDB is a fast, open-source, in-process analytical database designed for efficient data analysis.
Its Portable - Single Binary - No external dependencies.
Key Features
Embeddable Design: Integrates directly into applications without the need for a separate server, simplifying deployment.
Columnar Storage: Optimized for analytical workloads by storing data in columns, enhancing query performance.
SQL Support: Offers a robust SQL interface for complex queries, compatible with various programming languages.
Extensibility: Supports extensions for custom data types, functions, and file formats.
Automatic Parallelism: DuckDB has improved its automatic parallelism capabilities, meaning it can more effectively utilize multiple CPU cores without requiring manual tuning. This results in faster query execution for large datasets.
Parquet File Improvements: DuckDB has improved its handling of Parquet files, both in terms of reading speed and support for more complex data types and compression codecs. This makes DuckDB an even better choice for working with large datasets stored in Parquet format.
Query Caching: Improves the performance of repeated queries by caching the results of previous executions. This can be a game-changer for analytics workloads with similar queries being run multiple times.
Download the CLI Client
-
Linux).
-
For other programming languages, visit https://duckdb.org/docs/installation/
-
Unzip the file.
-
Open Command / Terminal and run the Executable.
DuckDB in Data Engineering
Download orders.parquet from
https://github.com/duckdb/duckdb-data/releases/download/v1.0/orders.parquet
More files are available here
https://github.com/cwida/duckdb-data/releases/
Open Command Prompt or Terminal
duckdb –ui orders.duckdb
or
duckdb
# Create / Open a database
.open orders.duckdb
Duckdb allows you to read the contents of orders.parquet as is without needing a table. Double quotes around the file name orders.parquet is essential.
describe table "orders.parquet"
select * from "orders.parquet" limit 3;
DuckDB supports CTAS syntax and helps to create tables from the actual file.
show tables;
create table orders as select * from "orders.parquet";
select count(*) from orders;
DuckDB supports parallel query processing, and queries run fast.
This table has 1.5 million rows, and aggregation happens in less than a second.
select now(); select o_orderpriority,count(*) cnt from orders group by o_orderpriority; select now();
DuckDB also helps to convert parquet files to CSV in a snap. It also supports converting CSV to Parquet.
COPY "orders.parquet" to 'orders.csv' (FORMAT "CSV", HEADER 1);Select * from "orders.csv" limit 3;
It also supports exporting existing Tables to Parquet files.
COPY "orders" to 'neworder.parquet' (FORMAT "PARQUET");
DuckDB supports Programming languages such as Python, R, JAVA, node.js, C/C++.
DuckDB ably supports Higher-level SQL programming such as Macros, Sequences, Window Functions.
Get sample data from Yellow Cab
https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
Copy yellow cabs data into yellowcabs folder
create table taxi_trips as select * from "yellowcabs/*.parquet";
SELECT
PULocationID,
EXTRACT(HOUR FROM tpep_pickup_datetime) AS hour_of_day,
AVG(fare_amount) AS avg_fare
FROM
taxi_trips
GROUP BY
PULocationID,
hour_of_day;
Extensions
https://duckdb.org/docs/extensions/overview
INSTALL json;
LOAD json;
select * from 'https://github.com/duckdb/duckdb-data/releases/download/v1.0/canada.json';
describe 'select * from https://github.com/duckdb/duckdb-data/releases/download/v1.0/canada.json';
with cte as (
SELECT
unnest(features) as data
FROM 'https://github.com/duckdb/duckdb-data/releases/download/v1.0/canada.json')
select data.type, data.properties.name from cte
Load directly from HTTP location
select * from 'https://raw.githubusercontent.com/gchandra10/filestorage/main/sales_100.csv'
Press CTRL+D to quit the CLI.
Duck DB in IoT
Edge Analytics: DuckDB enables “edge analytics, where the Python program performs real-time analysis of the data close to the source.” This reduces the need to send all raw data to the cloud, decreases latency, and enables immediate insights like anomaly detection.
Data Preprocessing and Aggregation: DuckDB is “lightweight and designed for complex analytics queries, making it ideal for handling aggregated IoT data on the edge before it is sent to more comprehensive reporting tools.”
Real-Time Processing: Enables immediate analysis of streaming data from sensors, facilitating prompt decision-making.
Privacy: Personal data analysis on edge devices, and pre-processing data for machine learning, especially when data privacy is paramount.
Resource Efficiency: Its minimal resource footprint makes it suitable for deployment on devices with limited computational capabilities.
Popular Edge Devices that support DuckDB
Single-Board computers
- Raspberry Pi
- Orange Pi 5 Plus
- ASUS Tinker Board S
- Banana Pi M5
- Rock Pi 4
---