{"id":709,"date":"2026-04-13T15:38:06","date_gmt":"2026-04-13T07:38:06","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=709"},"modified":"2026-04-13T15:38:06","modified_gmt":"2026-04-13T07:38:06","slug":"an-implementation-guide-to-building-a-duckdb-python-analytics-pipeline-with-sql-dataframes-parquet-udfs-and-performance-profiling","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=709","title":{"rendered":"An Implementation Guide to Building a DuckDB-Python Analytics Pipeline with SQL, DataFrames, Parquet, UDFs, and Performance Profiling"},"content":{"rendered":"<p>In this tutorial, we build a comprehensive, hands-on understanding of <a href=\"https:\/\/github.com\/duckdb\/duckdb-python\"><strong>DuckDB-Python<\/strong><\/a> by working through its features directly in code on Colab. We start with the fundamentals of connection management and data generation, then move into real analytical workflows, including querying Pandas, Polars, and Arrow objects without manual loading, transforming results across multiple formats, and writing expressive SQL for window functions, pivots, macros, recursive CTEs, and joins. As we progress, we also explore performance-oriented capabilities such as bulk insertion, profiling, partitioned storage, multi-threaded access, remote file querying, and efficient export patterns, so we not only learn what DuckDB can do, but also how to use it as a serious analytical engine within Python.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">import subprocess, sys\n\n\nfor pkg in [\"duckdb\", \"pandas\", \"pyarrow\", \"polars\"]:\n   try:\n       subprocess.check_call(\n           [sys.executable, \"-m\", \"pip\", \"install\", \"-q\", pkg],\n           stderr=subprocess.DEVNULL,\n       )\n   except subprocess.CalledProcessError:\n       subprocess.check_call(\n           [sys.executable, \"-m\", \"pip\", \"install\", \"-q\", \"--break-system-packages\", pkg],\n           stderr=subprocess.DEVNULL,\n       )\n\n\nimport duckdb\nimport pandas as pd\nimport pyarrow as pa\nimport pyarrow.parquet as pq\nimport polars as pl\nimport numpy as np\nimport json, os, time, threading, tempfile\nfrom datetime import date, datetime, timedelta\n\n\nprint(f\"DuckDB version : {duckdb.__version__}\")\nprint(f\"Pandas version : {pd.__version__}\")\nprint(f\"PyArrow version: {pa.__version__}\")\nprint(f\"Polars version : {pl.__version__}\")\nprint(\"=\" * 72)\n\n\nWORKDIR = tempfile.mkdtemp(prefix=\"duckdb_tutorial_\")\nos.chdir(WORKDIR)\nprint(f\"Working directory: {WORKDIR}n\")\n\n\nprint(\"=\" * 72)\nprint(\"SECTION 1: Connection Management\")\nprint(\"=\" * 72)\n\n\ncon = duckdb.connect()\nprint(con.sql(\"SELECT 'Hello from in-memory DuckDB!' AS greeting\").fetchone()[0])\n\n\nDB_PATH = os.path.join(WORKDIR, \"tutorial.duckdb\")\npcon = duckdb.connect(DB_PATH)\npcon.sql(\"CREATE OR REPLACE TABLE persisted(id INT, val TEXT)\")\npcon.sql(\"INSERT INTO persisted VALUES (1,'alpha'), (2,'beta')\")\nprint(\"Persisted rows:\", pcon.sql(\"SELECT count(*) FROM persisted\").fetchone()[0])\npcon.close()\n\n\npcon2 = duckdb.connect(DB_PATH)\nprint(\"After re-open :\", pcon2.sql(\"SELECT * FROM persisted ORDER BY id\").fetchall())\npcon2.close()\n\n\ncon_cfg = duckdb.connect(config={\n   \"threads\": 2,\n   \"memory_limit\": \"512MB\",\n   \"default_order\": \"DESC\",\n})\nprint(\"Configured threads:\", con_cfg.sql(\"SELECT current_setting('threads')\").fetchone()[0])\ncon_cfg.close()\n\n\nwith duckdb.connect() as ctx:\n   ctx.sql(\"SELECT 42 AS answer\").show()\n\n\nprint()\n\n\nprint(\"=\" * 72)\nprint(\"SECTION 2: Synthetic Data Generation\")\nprint(\"=\" * 72)\n\n\ncon = duckdb.connect()\n\n\ncon.sql(\"\"\"\n   CREATE OR REPLACE TABLE sales AS\n   SELECT\n       i                                                     AS order_id,\n       '2023-01-01'::DATE + (i % 365)::INT                  AS order_date,\n       CASE (i % 5)\n           WHEN 0 THEN 'Electronics'\n           WHEN 1 THEN 'Clothing'\n           WHEN 2 THEN 'Groceries'\n           WHEN 3 THEN 'Furniture'\n           ELSE        'Books'\n       END                                                   AS category,\n       ROUND(10 + random() * 990, 2)                         AS amount,\n       CASE (i % 3)\n           WHEN 0 THEN 'US'\n           WHEN 1 THEN 'EU'\n           ELSE        'APAC'\n       END                                                   AS region,\n       CASE WHEN random() &lt; 0.1 THEN TRUE ELSE FALSE END    AS returned\n   FROM generate_series(1, 100000) t(i)\n\"\"\")\n\n\ncon.sql(\"SUMMARIZE sales\").show()\nprint()\n\n\nprint(\"=\" * 72)\nprint(\"SECTION 3: Zero-Copy DataFrame Integration\")\nprint(\"=\" * 72)\n\n\npdf = pd.DataFrame({\n   \"product\": [\"Widget\", \"Gadget\", \"Doohickey\", \"Thingamajig\"],\n   \"price\":   [9.99, 24.50, 4.75, 15.00],\n   \"stock\":   [120, 45, 300, 78],\n})\nprint(\"Query Pandas DF directly:\")\ncon.sql(\"SELECT product, price * stock AS inventory_value FROM pdf ORDER BY inventory_value DESC\").show()\n\n\nplf = pl.DataFrame({\n   \"city\":   [\"Montreal\", \"Toronto\", \"Vancouver\", \"Calgary\"],\n   \"temp_c\": [-12.5, -5.0, 3.0, -18.0],\n})\nprint(\"Query Polars DF directly:\")\ncon.sql(\"SELECT city, temp_c, temp_c * 9\/5 + 32 AS temp_f FROM plf WHERE temp_c &lt; 0\").show()\n\n\narrow_tbl = pa.table({\n   \"sensor_id\": [1, 2, 3, 4, 5],\n   \"reading\":   [23.1, 47.8, 12.3, 99.0, 55.5],\n})\nprint(\"Query Arrow table directly:\")\ncon.sql(\"SELECT sensor_id, reading FROM arrow_tbl WHERE reading &gt; 30\").show()\n\n\nprint()\n\n\nprint(\"=\" * 72)\nprint(\"SECTION 4: Result Conversion\")\nprint(\"=\" * 72)\n\n\nq = \"SELECT category, SUM(amount) AS total FROM sales GROUP BY category ORDER BY total DESC\"\n\n\nprint(\"\u2192 Python list  :\", con.sql(q).fetchall()[:2], \"...\")\nprint(\"\u2192 Pandas DF    :n\", con.sql(q).df().head(3))\nprint(\"\u2192 Polars DF    :n\", con.sql(q).pl().head(3))\nprint(\"\u2192 Arrow Table  :\", con.sql(q).arrow().schema)\nprint(\"\u2192 NumPy arrays :\", {k: v[:2] for k, v in con.sql(q).fetchnumpy().items()})\nprint()<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We set up the full DuckDB-Python environment by installing the required libraries and importing all the necessary modules for the tutorial. We create our working directory, initialize DuckDB connections, and explore both in-memory and persistent database usage along with basic configuration options. We also generate a large synthetic sales dataset and begin working with DuckDB\u2019s direct integration with Pandas, Polars, and PyArrow, which shows us how naturally DuckDB fits into Python-based data workflows.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">print(\"=\" * 72)\nprint(\"SECTION 5: Relational API\")\nprint(\"=\" * 72)\n\n\nrel = (\n   con.table(\"sales\")\n      .filter(\"NOT returned\")\n      .aggregate(\"category, region, SUM(amount) AS revenue, COUNT(*) AS orders\")\n      .filter(\"revenue &gt; 1000000\")\n      .order(\"revenue DESC\")\n      .limit(10)\n)\nprint(\"Relational API result:\")\nrel.show()\n\n\ntop_cats = con.sql(\"SELECT DISTINCT category FROM sales ORDER BY category LIMIT 3\")\nprint(\"Top categories relation fed into next query:\")\ncon.sql(\"SELECT s.* FROM sales s SEMI JOIN top_cats ON s.category = top_cats.category LIMIT 5\").show()\n\n\nprint()\n\n\nprint(\"=\" * 72)\nprint(\"SECTION 6: Window Functions &amp; Advanced SQL\")\nprint(\"=\" * 72)\n\n\ncon.sql(\"\"\"\n   WITH daily AS (\n       SELECT\n           order_date,\n           region,\n           SUM(amount) AS daily_rev\n       FROM sales\n       WHERE NOT returned\n       GROUP BY order_date, region\n   )\n   SELECT\n       order_date,\n       region,\n       daily_rev,\n       SUM(daily_rev) OVER (\n           PARTITION BY region ORDER BY order_date\n       ) AS cum_revenue,\n       AVG(daily_rev) OVER (\n           PARTITION BY region ORDER BY order_date\n           ROWS BETWEEN 6 PRECEDING AND CURRENT ROW\n       ) AS rolling_7d_avg\n   FROM daily\n   QUALIFY row_number() OVER (PARTITION BY region ORDER BY order_date DESC) &lt;= 3\n   ORDER BY region, order_date DESC\n\"\"\").show()\n\n\nprint(\"PIVOT table:\")\ncon.sql(\"\"\"\n   PIVOT sales\n   ON region\n   USING SUM(amount)\n   GROUP BY category\n   ORDER BY category\n\"\"\").show()\n\n\nprint()\n\n\nprint(\"=\" * 72)\nprint(\"SECTION 7: Complex \/ Nested Types\")\nprint(\"=\" * 72)\n\n\ncon.sql(\"\"\"\n   CREATE OR REPLACE TABLE users AS\n   SELECT\n       i AS user_id,\n       {'first': 'User_' || i::TEXT, 'last': 'Surname_' || (i % 100)::TEXT}  AS name,\n       [i * 10, i * 20, i * 30]                                              AS scores,\n       MAP {'tier': CASE WHEN i % 2 = 0 THEN 'gold' ELSE 'silver' END,\n            'region': CASE WHEN i % 3 = 0 THEN 'US' ELSE 'EU' END}          AS metadata\n   FROM generate_series(1, 5) t(i)\n\"\"\")\n\n\nprint(\"Struct field access, list indexing, map extraction:\")\ncon.sql(\"\"\"\n   SELECT\n       user_id,\n       name.first                  AS first_name,\n       scores[1]                   AS first_score,\n       list_aggregate(scores,'sum') AS total_score,\n       metadata['tier']            AS tier\n   FROM users\n\"\"\").show()\n\n\nprint(\"Unnesting a list column:\")\ncon.sql(\"\"\"\n   SELECT user_id, unnest(scores) AS individual_score\n   FROM users\n   WHERE user_id &lt;= 3\n\"\"\").show()\n\n\nprint()\n\n\nprint(\"=\" * 72)\nprint(\"SECTION 8: Python UDFs\")\nprint(\"=\" * 72)\n\n\ndef celsius_to_fahrenheit(c):\n   return c * 9 \/ 5 + 32\n\n\ncon.create_function(\"c2f\", celsius_to_fahrenheit, [\"DOUBLE\"], \"DOUBLE\")\ncon.sql(\"SELECT city, temp_c, c2f(temp_c) AS temp_f FROM plf\").show()\n\n\nimport pyarrow.compute as pc\n\n\ndef vectorized_discount(prices):\n   \"\"\"Apply a 15% discount to all prices.\"\"\"\n   return pc.multiply(prices, 0.85)\n\n\ncon.create_function(\n   \"discount\",\n   vectorized_discount,\n   [\"DOUBLE\"],\n   \"DOUBLE\",\n   type=\"arrow\",\n)\nprint(\"Vectorized UDF (discount):\")\ncon.sql(\"SELECT product, price, discount(price) AS sale_price FROM pdf\").show()\n\n\nprint()\n\n\nprint(\"=\" * 72)\nprint(\"SECTION 9: File I\/O\")\nprint(\"=\" * 72)\n\n\ncon.sql(\"COPY (SELECT * FROM sales LIMIT 1000) TO 'sales_sample.csv'  (HEADER, DELIMITER ',')\")\ncon.sql(\"COPY (SELECT * FROM sales LIMIT 1000) TO 'sales_sample.parquet' (FORMAT PARQUET)\")\ncon.sql(\"COPY (SELECT * FROM sales LIMIT 100)  TO 'sales_sample.json'    (FORMAT JSON, ARRAY true)\")\n\n\nprint(\"Files written:\", [f for f in os.listdir('.') if 'sales_sample' in f])\n\n\nprint(\"nCSV row count     :\", con.sql(\"SELECT count(*) FROM 'sales_sample.csv'\").fetchone()[0])\nprint(\"Parquet row count :\", con.sql(\"SELECT count(*) FROM 'sales_sample.parquet'\").fetchone()[0])\nprint(\"JSON row count    :\", con.sql(\"SELECT count(*) FROM 'sales_sample.json'\").fetchone()[0])\n\n\nprint(\"nParquet file metadata:\")\ncon.sql(\"SELECT * FROM parquet_metadata('sales_sample.parquet')\").show()\n\n\nprint()<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We move from basic querying into more expressive analytical patterns using DuckDB\u2019s relational API and advanced SQL features. We work with filtered and aggregated sales relations, apply window functions for cumulative and rolling calculations, and reshape data using a pivot table for cross-category analysis. We also explore complex nested types, list unnesting, Python UDFs, and file I\/O, which help us see how DuckDB handles both structured analytics and practical data engineering tasks in one place.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">print(\"=\" * 72)\nprint(\"SECTION 10: Hive-Partitioned Parquet\")\nprint(\"=\" * 72)\n\n\ncon.sql(\"\"\"\n   COPY sales TO 'partitioned_data' (\n       FORMAT PARQUET,\n       PARTITION_BY (region, category),\n       OVERWRITE_OR_IGNORE\n   )\n\"\"\")\nprint(\"Partitioned directory tree (first 15 items):\")\nfor i, (root, dirs, files) in enumerate(os.walk(\"partitioned_data\")):\n   for f in files:\n       fp = os.path.join(root, f)\n       print(f\"  {fp}\")\n   if i &gt; 15:\n       print(\"  ...\")\n       break\n\n\nprint(\"nReading partitioned data back (US + Electronics only):\")\ncon.sql(\"\"\"\n   SELECT count(*) AS cnt, MIN(amount), MAX(amount)\n   FROM read_parquet('partitioned_data\/**\/*.parquet', hive_partitioning = true)\n   WHERE region = 'US' AND category = 'Electronics'\n\"\"\").show()\n\n\nprint()\n\n\nprint(\"=\" * 72)\nprint(\"SECTION 11: Prepared Statements\")\nprint(\"=\" * 72)\n\n\nresult = con.execute(\n   \"SELECT * FROM sales WHERE category = $1 AND amount &gt; $2 LIMIT 5\",\n   [\"Electronics\", 500.0]\n).fetchdf()\nprint(\"Parameterized query result:n\", result)\n\n\ncon.sql(\"SET VARIABLE target_region = 'EU'\")\ncon.sql(\"\"\"\n   SELECT category, AVG(amount) AS avg_amt\n   FROM sales\n   WHERE region = getvariable('target_region')\n   GROUP BY category\n   ORDER BY avg_amt DESC\n\"\"\").show()\n\n\nprint()\n\n\nprint(\"=\" * 72)\nprint(\"SECTION 12: Transactions\")\nprint(\"=\" * 72)\n\n\ncon.sql(\"CREATE OR REPLACE TABLE accounts(id INT, balance DOUBLE)\")\ncon.sql(\"INSERT INTO accounts VALUES (1, 1000), (2, 500)\")\n\n\ncon.begin()\ntry:\n   con.sql(\"UPDATE accounts SET balance = balance - 200 WHERE id = 1\")\n   con.sql(\"UPDATE accounts SET balance = balance + 200 WHERE id = 2\")\n   con.commit()\n   print(\"Transaction committed successfully\")\nexcept Exception as e:\n   con.rollback()\n   print(f\"Transaction rolled back: {e}\")\n\n\ncon.sql(\"SELECT * FROM accounts\").show()\nprint()\n\n\nprint(\"=\" * 72)\nprint(\"SECTION 13: Appender (Bulk Insert)\")\nprint(\"=\" * 72)\n\n\ncon.sql(\"CREATE OR REPLACE TABLE fast_load(id INT, name TEXT, value DOUBLE)\")\n\n\nbulk_df = pd.DataFrame({\n   \"id\":    range(50_000),\n   \"name\":  [f\"item_{i}\" for i in range(50_000)],\n   \"value\": [float(i) * 1.1 for i in range(50_000)],\n})\n\n\nstart = time.perf_counter()\ncon.append(\"fast_load\", bulk_df)\nelapsed = time.perf_counter() - start\n\n\nprint(f\"Bulk-inserted {con.sql('SELECT count(*) FROM fast_load').fetchone()[0]:,} rows in {elapsed:.4f}s\")\ncon.sql(\"SELECT * FROM fast_load LIMIT 5\").show()\nprint()\n\n\nprint(\"=\" * 72)\nprint(\"SECTION 14: Replacement Scans (query any Python var)\")\nprint(\"=\" * 72)\n\n\nmy_dict_data = pd.DataFrame({\n   \"name\": [\"Alice\", \"Bob\", \"Charlie\"],\n   \"age\":  [30, 25, 35],\n})\ncon.sql(\"SELECT * FROM my_dict_data WHERE age &gt; 28\").show()\nprint()<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We focus on storage and execution patterns that are especially useful in real workflows, starting with Hive-style partitioned Parquet output and selective reads from partitioned data. We then use parameterized queries, runtime variables, and transaction control to make our queries safer, more dynamic, and more reliable. Finally, we test high-speed bulk insertion via the appender interface and use replacement scans to query Python objects directly by name, further reinforcing DuckDB\u2019s tight integration with the Python runtime.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">print(\"=\" * 72)\nprint(\"SECTION 15: SQL Macros\")\nprint(\"=\" * 72)\n\n\ncon.sql(\"\"\"\n   CREATE OR REPLACE MACRO revenue_tier(amt) AS\n       CASE\n           WHEN amt &gt; 800 THEN 'High'\n           WHEN amt &gt; 400 THEN 'Medium'\n           ELSE                'Low'\n       END\n\"\"\")\nprint(\"Scalar macro:\")\ncon.sql(\"SELECT category, amount, revenue_tier(amount) AS tier FROM sales LIMIT 8\").show()\n\n\ncon.sql(\"\"\"\n   CREATE OR REPLACE MACRO top_by_category(cat, n) AS TABLE\n       SELECT * FROM sales WHERE category = cat ORDER BY amount DESC LIMIT n\n\"\"\")\nprint(\"Table macro \u2014 top 3 Electronics:\")\ncon.sql(\"SELECT * FROM top_by_category('Electronics', 3)\").show()\nprint()\n\n\nprint(\"=\" * 72)\nprint(\"SECTION 16: Recursive CTE\")\nprint(\"=\" * 72)\n\n\ncon.sql(\"\"\"\n   CREATE OR REPLACE TABLE employees(id INT, name TEXT, manager_id INT);\n   INSERT INTO employees VALUES\n       (1, 'CEO',     NULL),\n       (2, 'VP Eng',  1),\n       (3, 'VP Sales',1),\n       (4, 'Sr Eng',  2),\n       (5, 'Jr Eng',  4),\n       (6, 'Sales Rep', 3);\n\"\"\")\n\n\nprint(\"Org chart via recursive CTE:\")\ncon.sql(\"\"\"\n   WITH RECURSIVE org AS (\n       SELECT id, name, manager_id, 0 AS depth, name AS path\n       FROM employees WHERE manager_id IS NULL\n       UNION ALL\n       SELECT e.id, e.name, e.manager_id, o.depth + 1,\n              o.path || ' \u2192 ' || e.name\n       FROM employees e JOIN org o ON e.manager_id = o.id\n   )\n   SELECT repeat('  ', depth) || name AS hierarchy, path\n   FROM org\n   ORDER BY path\n\"\"\").show()\nprint()\n\n\nprint(\"=\" * 72)\nprint(\"SECTION 17: Full-Text Search (FTS)\")\nprint(\"=\" * 72)\n\n\ntry:\n   con.install_extension(\"fts\")\n   con.load_extension(\"fts\")\n\n\n   con.sql(\"\"\"\n       CREATE OR REPLACE TABLE documents(id INT, body TEXT);\n       INSERT INTO documents VALUES\n           (1, 'DuckDB is a fast in-process analytical database'),\n           (2, 'Python integration allows querying Pandas DataFrames'),\n           (3, 'Parquet files can be read directly without loading'),\n           (4, 'Window functions and CTEs make complex analytics easy'),\n           (5, 'The columnar engine processes data blazingly fast');\n   \"\"\")\n\n\n   con.sql(\"PRAGMA create_fts_index('documents', 'id', 'body', stemmer='english')\")\n\n\n   print(\"FTS search for 'fast analytical':\")\n   con.sql(\"\"\"\n       SELECT id, body, fts_main_documents.match_bm25(id, 'fast analytical') AS score\n       FROM documents\n       WHERE score IS NOT NULL\n       ORDER BY score DESC\n   \"\"\").show()\nexcept Exception as e:\n   print(f\"(Skipped \u2014 FTS extension not available: {e})\")\nprint()\n\n\nprint(\"=\" * 72)\nprint(\"SECTION 18: AsOf Joins\")\nprint(\"=\" * 72)\n\n\ncon.sql(\"\"\"\n   CREATE OR REPLACE TABLE stock_prices AS\n   SELECT * FROM (VALUES\n       ('2024-01-01 09:30'::TIMESTAMP, 'AAPL', 150.0),\n       ('2024-01-01 10:00'::TIMESTAMP, 'AAPL', 152.5),\n       ('2024-01-01 10:30'::TIMESTAMP, 'AAPL', 151.0),\n       ('2024-01-01 11:00'::TIMESTAMP, 'AAPL', 153.0)\n   ) AS t(ts, ticker, price);\n\n\n   CREATE OR REPLACE TABLE trades AS\n   SELECT * FROM (VALUES\n       ('2024-01-01 09:45'::TIMESTAMP, 'AAPL', 100),\n       ('2024-01-01 10:15'::TIMESTAMP, 'AAPL', 200),\n       ('2024-01-01 10:50'::TIMESTAMP, 'AAPL', 150)\n   ) AS t(trade_ts, ticker, shares);\n\"\"\")\n\n\nprint(\"AsOf Join \u2014 match each trade to the most recent price:\")\ncon.sql(\"\"\"\n   SELECT\n       t.trade_ts,\n       t.shares,\n       p.ts   AS price_ts,\n       p.price,\n       t.shares * p.price AS trade_value\n   FROM trades t\n   ASOF JOIN stock_prices p\n       ON t.ticker = p.ticker AND t.trade_ts &gt;= p.ts\n\"\"\").show()\nprint()<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We build reusable query logic using scalar and table macros, reducing repetition and making our SQL more modular. We then create a recursive CTE to traverse an employee hierarchy, showing how DuckDB can handle structured recursive logic cleanly within SQL. After that, we explore full-text search and AsOf joins, demonstrating that DuckDB supports not only standard analytics but also more advanced search and time-aware matching.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">print(\"=\" * 72)\nprint(\"SECTION 19: Query Profiling\")\nprint(\"=\" * 72)\n\n\nprint(\"EXPLAIN output:\")\ncon.sql(\"EXPLAIN SELECT category, SUM(amount) FROM sales GROUP BY category\").show()\n\n\ncon.sql(\"PRAGMA enable_profiling = 'json'\")\ncon.sql(\"PRAGMA profiling_output = 'profile.json'\")\ncon.sql(\"SELECT category, AVG(amount), STDDEV(amount) FROM sales GROUP BY category\")\ncon.sql(\"PRAGMA disable_profiling\")\n\n\nif os.path.exists(\"profile.json\"):\n   with open(\"profile.json\") as f:\n       profile = json.load(f)\n   print(f\"Query timing: {profile.get('timing', 'N\/A')}s\")\n\n\nprint()\n\n\nprint(\"=\" * 72)\nprint(\"SECTION 20: Multi-Threaded Queries\")\nprint(\"=\" * 72)\n\n\nresults = {}\n\n\ndef worker(thread_id, db_path=None):\n   \"\"\"Each thread creates its own connection.\"\"\"\n   local_con = duckdb.connect()\n   local_con.sql(f\"\"\"\n       CREATE TABLE t AS SELECT i, i*{thread_id} AS val\n       FROM generate_series(1, 10000) t(i)\n   \"\"\")\n   total = local_con.sql(\"SELECT SUM(val) FROM t\").fetchone()[0]\n   results[thread_id] = total\n   local_con.close()\n\n\nthreads = [threading.Thread(target=worker, args=(tid,)) for tid in range(4)]\nfor t in threads:\n   t.start()\nfor t in threads:\n   t.join()\n\n\nprint(\"Thread results:\", results)\nprint()\n\n\nprint(\"=\" * 72)\nprint(\"SECTION 21: Quick Benchmark \u2014 DuckDB vs Pandas\")\nprint(\"=\" * 72)\n\n\nN = 1_000_000\nbig_df = pd.DataFrame({\n   \"group\": np.random.choice([\"A\",\"B\",\"C\",\"D\",\"E\"], N),\n   \"value\": np.random.randn(N) * 100,\n   \"ts\":    pd.date_range(\"2020-01-01\", periods=N, freq=\"s\"),\n})\n\n\nstart = time.perf_counter()\n_ = big_df.groupby(\"group\")[\"value\"].agg([\"sum\",\"mean\",\"std\",\"min\",\"max\"])\npd_time = time.perf_counter() - start\n\n\nstart = time.perf_counter()\n_ = con.sql(\"\"\"\n   SELECT \"group\",\n          SUM(value), AVG(value), STDDEV(value), MIN(value), MAX(value)\n   FROM big_df\n   GROUP BY \"group\"\n\"\"\").fetchall()\nduck_time = time.perf_counter() - start\n\n\nprint(f\"Pandas  : {pd_time:.4f}s\")\nprint(f\"DuckDB  : {duck_time:.4f}s\")\nprint(f\"Speedup : {pd_time\/duck_time:.1f}x\")\nprint()\n\n\nprint(\"=\" * 72)\nprint(\"SECTION 22: Reading Remote Parquet over HTTPS\")\nprint(\"=\" * 72)\n\n\ntry:\n   con.install_extension(\"httpfs\")\n   con.load_extension(\"httpfs\")\n\n\n   print(\"Querying a remote Parquet file (DuckDB taxi sample):\")\n   con.sql(\"\"\"\n       SELECT count(*) AS total_rows\n       FROM read_parquet(\n           'https:\/\/blobs.duckdb.org\/data\/yellow_tripdata_2010-01.parquet'\n       )\n   \"\"\").show()\nexcept Exception as e:\n   print(f\"(Skipped \u2014 httpfs not available or network restricted: {e})\")\n\n\nprint()\n\n\nprint(\"=\" * 72)\nprint(\"SECTION 23: Custom Enum Types\")\nprint(\"=\" * 72)\n\n\ncon.sql(\"CREATE TYPE mood AS ENUM ('happy', 'neutral', 'sad')\")\ncon.sql(\"\"\"\n   CREATE OR REPLACE TABLE survey(\n       respondent_id INT,\n       feeling       mood,\n       comment       TEXT\n   )\n\"\"\")\ncon.sql(\"\"\"\n   INSERT INTO survey VALUES\n       (1, 'happy',   'Great product!'),\n       (2, 'neutral', 'It is okay'),\n       (3, 'sad',     'Did not work'),\n       (4, 'happy',   'Love it')\n\"\"\")\ncon.sql(\"SELECT feeling, count(*) AS cnt FROM survey GROUP BY feeling ORDER BY feeling\").show()\nprint()<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We inspect how DuckDB executes queries by using EXPLAIN plans and JSON profiling output to better understand performance behavior. We also demonstrate thread-based parallel execution by creating separate DuckDB connections in multiple worker threads and safely collecting their results. To complete the performance angle, we benchmark DuckDB against Pandas on a large grouped aggregation, then extend the workflow with remote Parquet access over HTTPS and custom ENUM type creation.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">print(\"=\" * 72)\nprint(\"SECTION 24: Lambda Functions in SQL\")\nprint(\"=\" * 72)\n\n\ncon.sql(\"\"\"\n   SELECT\n       [1, 2, 3, 4, 5]                                   AS original,\n       list_transform([1, 2, 3, 4, 5], x -&gt; x * x)       AS squared,\n       list_filter([1, 2, 3, 4, 5], x -&gt; x % 2 = 0)     AS evens_only,\n       list_reduce([1, 2, 3, 4, 5], (a, b) -&gt; a + b)     AS total\n\"\"\").show()\nprint()\n\n\nprint(\"=\" * 72)\nprint(\"SECTION 25: Multi-Format Export\")\nprint(\"=\" * 72)\n\n\nsummary_q = \"\"\"\n   SELECT\n       category,\n       region,\n       COUNT(*)       AS orders,\n       SUM(amount)    AS revenue,\n       AVG(amount)    AS avg_order\n   FROM sales\n   WHERE NOT returned\n   GROUP BY category, region\n   ORDER BY revenue DESC\n\"\"\"\n\n\ncon.sql(f\"COPY ({summary_q}) TO 'summary.csv'     (HEADER)\")\ncon.sql(f\"COPY ({summary_q}) TO 'summary.parquet'  (FORMAT PARQUET, COMPRESSION ZSTD)\")\ncon.sql(f\"COPY ({summary_q}) TO 'summary.json'     (FORMAT JSON, ARRAY TRUE)\")\n\n\nsizes = {f: os.path.getsize(f) for f in ['summary.csv','summary.parquet','summary.json']}\nprint(\"File sizes:\")\nfor name, sz in sizes.items():\n   print(f\"  {name:25s}  {sz:&gt;8,} bytes\")\n\n\nprint()\n\n\ncon.close()\n\n\nprint(\"=\" * 72)\nprint(\"TUTORIAL COMPLETE\")\nprint(\"=\" * 72)\nprint(\"\"\"\nSections covered:\n 1.  Connection management (in-memory, persistent, config)\n 2.  Synthetic data generation with generate_series()\n 3.  Zero-copy querying of Pandas, Polars, PyArrow\n 4.  Result conversion (.df(), .pl(), .arrow(), .fetchnumpy())\n 5.  Relational API (filter \/ aggregate \/ order \/ limit)\n 6.  Window functions, QUALIFY, PIVOT\n 7.  Complex types (struct, list, map, unnest)\n 8.  Python UDFs (scalar &amp; vectorized)\n 9.  File I\/O (CSV, Parquet, JSON)\n 10. Hive-partitioned Parquet writes &amp; reads\n 11. Prepared statements &amp; parameterized queries\n 12. Transactions (BEGIN \/ COMMIT \/ ROLLBACK)\n 13. Appender for high-speed bulk insert\n 14. Replacement scans (query Python dicts by name)\n 15. SQL macros (scalar + table macros)\n 16. Recursive CTEs\n 17. Full-text search (FTS extension)\n 18. AsOf joins\n 19. Query profiling &amp; EXPLAIN\n 20. Multi-threaded access\n 21. Performance benchmark (DuckDB vs Pandas)\n 22. Remote Parquet over HTTPS (httpfs)\n 23. Custom ENUM types\n 24. Lambda functions in SQL\n 25. Multi-format export with compression\n\"\"\")<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We explore DuckDB\u2019s support for lambda-powered list transformations directly in SQL, providing a concise way to manipulate list data within queries. We then export a summarized analytical result in multiple file formats, including CSV, Parquet, and JSON, highlighting DuckDB\u2019s flexibility for downstream data delivery. We close the tutorial by disconnecting and printing a structured recap of all covered sections, providing a clear end-to-end picture of the full DuckDB-Python workflow.<\/p>\n<p>In conclusion, we provided a practical, code-level view of how DuckDB-Python fits into modern data workflows. We saw that it is not just a SQL engine embedded in Python, but a highly flexible analytical system that works smoothly with DataFrames, Arrow tables, local files, remote datasets, custom functions, and advanced SQL constructs in a single environment. We moved from simple in-memory queries to more production-relevant patterns such as persistent databases, parameterized execution, transactions, partitioned Parquet, profiling, and performance benchmarking, which helps us understand both the breadth and depth of DuckDB\u2019s capabilities. Also, we came away with a reusable Colab-ready workflow that shows us how to combine Python\u2019s ecosystem with DuckDB\u2019s speed, SQL expressiveness, and interoperability to build fast, elegant, and scalable data analysis pipelines.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out\u00a0the<strong><a href=\"https:\/\/arxiv.org\/pdf\/2604.06425\" target=\"_blank\" rel=\"noreferrer noopener\">\u00a0<\/a><a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\/blob\/main\/Data%20Science\/duckdb_python_tutorial_Marktechpost.ipynb\" target=\"_blank\" rel=\"noreferrer noopener\">Full Implementation Codes here<\/a>.\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">130k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/MTNLpmJtsFA3VRVd9\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/04\/13\/an-implementation-guide-to-building-a-duckdb-python-analytics-pipeline-with-sql-dataframes-parquet-udfs-and-performance-profiling\/\">An Implementation Guide to Building a DuckDB-Python Analytics Pipeline with SQL, DataFrames, Parquet, UDFs, and Performance Profiling<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>In this tutorial, we build a c&hellip;<\/p>\n","protected":false},"author":1,"featured_media":29,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-709","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/709","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=709"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/709\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/29"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=709"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=709"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=709"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}