{"id":248,"date":"2026-01-09T22:50:15","date_gmt":"2026-01-09T14:50:15","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=248"},"modified":"2026-01-09T22:50:15","modified_gmt":"2026-01-09T14:50:15","slug":"how-to-build-portable-in-database-feature-engineering-pipelines-with-ibis-using-lazy-python-apis-and-duckdb-execution","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=248","title":{"rendered":"How to Build Portable, In-Database Feature Engineering Pipelines with Ibis Using Lazy Python APIs and DuckDB Execution"},"content":{"rendered":"<p>In this tutorial, we demonstrate how we use <a href=\"https:\/\/github.com\/ibis-project\/ibis\"><strong>Ibis<\/strong><\/a> to build a portable, in-database feature engineering pipeline that looks and feels like Pandas but executes entirely inside the database. We show how we connect to DuckDB, register data safely inside the backend, and define complex transformations using window functions and aggregations without ever pulling raw data into local memory. By keeping all transformations lazy and backend-agnostic, we demonstrate how to write analytics code once in Python and rely on Ibis to translate it into efficient SQL. Check out the\u00a0<strong><a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\/blob\/main\/Data%20Science\/ibis_portable_in_database_feature_engineering_Marktechpost.ipynb\" target=\"_blank\" rel=\"noreferrer noopener\">FULL CODES here<\/a><\/strong>.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">!pip -q install \"ibis-framework[duckdb,examples]\" duckdb pyarrow pandas\n\n\nimport ibis\nfrom ibis import _\n\n\nprint(\"Ibis version:\", ibis.__version__)\n\n\ncon = ibis.duckdb.connect()\nibis.options.interactive = True<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We install the required libraries and initialize the Ibis environment. We establish a DuckDB connection and enable interactive execution so that all subsequent operations remain lazy and backend-driven. Check out the\u00a0<strong><a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\/blob\/main\/Data%20Science\/ibis_portable_in_database_feature_engineering_Marktechpost.ipynb\" target=\"_blank\" rel=\"noreferrer noopener\">FULL CODES here<\/a><\/strong>.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">try:\n   base_expr = ibis.examples.penguins.fetch(backend=con)\nexcept TypeError:\n   base_expr = ibis.examples.penguins.fetch()\n\n\nif \"penguins\" not in con.list_tables():\n   try:\n       con.create_table(\"penguins\", base_expr, overwrite=True)\n   except Exception:\n       con.create_table(\"penguins\", base_expr.execute(), overwrite=True)\n\n\nt = con.table(\"penguins\")\nprint(t.schema())<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We load the Penguins dataset and explicitly register it inside the DuckDB catalog to ensure it is available for SQL execution. We verify the table schema and confirm that the data now lives inside the database rather than in local memory. Check out the\u00a0<strong><a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\/blob\/main\/Data%20Science\/ibis_portable_in_database_feature_engineering_Marktechpost.ipynb\" target=\"_blank\" rel=\"noreferrer noopener\">FULL CODES here<\/a><\/strong>.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">def penguin_feature_pipeline(penguins):\n   base = penguins.mutate(\n       bill_ratio=_.bill_length_mm \/ _.bill_depth_mm,\n       is_male=(_.sex == \"male\").ifelse(1, 0),\n   )\n\n\n   cleaned = base.filter(\n       _.bill_length_mm.notnull()\n       &amp; _.bill_depth_mm.notnull()\n       &amp; _.body_mass_g.notnull()\n       &amp; _.flipper_length_mm.notnull()\n       &amp; _.species.notnull()\n       &amp; _.island.notnull()\n       &amp; _.year.notnull()\n   )\n\n\n   w_species = ibis.window(group_by=[cleaned.species])\n   w_island_year = ibis.window(\n       group_by=[cleaned.island],\n       order_by=[cleaned.year],\n       preceding=2,\n       following=0,\n   )\n\n\n   feat = cleaned.mutate(\n       species_avg_mass=cleaned.body_mass_g.mean().over(w_species),\n       species_std_mass=cleaned.body_mass_g.std().over(w_species),\n       mass_z=(\n           cleaned.body_mass_g\n           - cleaned.body_mass_g.mean().over(w_species)\n       ) \/ cleaned.body_mass_g.std().over(w_species),\n       island_mass_rank=cleaned.body_mass_g.rank().over(\n           ibis.window(group_by=[cleaned.island])\n       ),\n       rolling_3yr_island_avg_mass=cleaned.body_mass_g.mean().over(\n           w_island_year\n       ),\n   )\n\n\n   return feat.group_by([\"species\", \"island\", \"year\"]).agg(\n       n=feat.count(),\n       avg_mass=feat.body_mass_g.mean(),\n       avg_flipper=feat.flipper_length_mm.mean(),\n       avg_bill_ratio=feat.bill_ratio.mean(),\n       avg_mass_z=feat.mass_z.mean(),\n       avg_rolling_3yr_mass=feat.rolling_3yr_island_avg_mass.mean(),\n       pct_male=feat.is_male.mean(),\n   ).order_by([\"species\", \"island\", \"year\"])<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We define a reusable feature engineering pipeline using pure Ibis expressions. We compute derived features, apply data cleaning, and use window functions and grouped aggregations to build advanced, database-native features while keeping the entire pipeline lazy. Check out the\u00a0<strong><a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\/blob\/main\/Data%20Science\/ibis_portable_in_database_feature_engineering_Marktechpost.ipynb\" target=\"_blank\" rel=\"noreferrer noopener\">FULL CODES here<\/a><\/strong>.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">features = penguin_feature_pipeline(t)\nprint(con.compile(features))\n\n\ntry:\n   df = features.to_pandas()\nexcept Exception:\n   df = features.execute()\n\n\ndisplay(df.head())<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We invoke the feature pipeline and compile it into DuckDB SQL to validate that all transformations are pushed down to the database. We then run the pipeline and return only the final aggregated results for inspection. Check out the\u00a0<strong><a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\/blob\/main\/Data%20Science\/ibis_portable_in_database_feature_engineering_Marktechpost.ipynb\" target=\"_blank\" rel=\"noreferrer noopener\">FULL CODES here<\/a><\/strong>.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">con.create_table(\"penguin_features\", features, overwrite=True)\n\n\nfeat_tbl = con.table(\"penguin_features\")\n\n\ntry:\n   preview = feat_tbl.limit(10).to_pandas()\nexcept Exception:\n   preview = feat_tbl.limit(10).execute()\n\n\ndisplay(preview)\n\n\nout_path = \"\/content\/penguin_features.parquet\"\ncon.raw_sql(f\"COPY penguin_features TO '{out_path}' (FORMAT PARQUET);\")\nprint(out_path)<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We materialize the engineered features as a table directly inside DuckDB and query it lazily for verification. We also export the results to a Parquet file, demonstrating how we can hand off database-computed features to downstream analytics or machine learning workflows.<\/p>\n<p>In conclusion, we constructed, compiled, and executed an advanced feature engineering workflow fully inside DuckDB using Ibis. We demonstrated how to inspect the generated SQL, materialized results directly in the database, and exported them for downstream use while preserving portability across analytical backends. This approach reinforces the core idea behind Ibis: we keep computation close to the data, minimize unnecessary data movement, and maintain a single, reusable Python codebase that scales from local experimentation to production databases.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out the\u00a0<strong><a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\/blob\/main\/Data%20Science\/ibis_portable_in_database_feature_engineering_Marktechpost.ipynb\" target=\"_blank\" rel=\"noreferrer noopener\">FULL CODES here<\/a><\/strong>.\u00a0Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">100k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>Check out our latest release of\u00a0<a href=\"https:\/\/ai2025.dev\/\" target=\"_blank\" rel=\"noreferrer noopener\"><strong><mark>ai2025.dev<\/mark><\/strong><\/a>, a 2025-focused analytics platform that turns model launches, benchmarks, and ecosystem activity into a structured dataset you can filter, compare, and export.<\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/01\/09\/how-to-build-portable-in-database-feature-engineering-pipelines-with-ibis-using-lazy-python-apis-and-duckdb-execution\/\">How to Build Portable, In-Database Feature Engineering Pipelines with Ibis Using Lazy Python APIs and DuckDB Execution<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>In this tutorial, we demonstra&hellip;<\/p>\n","protected":false},"author":1,"featured_media":29,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-248","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/248","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=248"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/248\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/29"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=248"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=248"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=248"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}