{"id":513,"date":"2026-03-06T07:07:09","date_gmt":"2026-03-05T23:07:09","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=513"},"modified":"2026-03-06T07:07:09","modified_gmt":"2026-03-05T23:07:09","slug":"a-coding-guide-to-build-a-scalable-end-to-end-machine-learning-data-pipeline-using-daft-for-high-performance-structured-and-image-data-processing","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=513","title":{"rendered":"A Coding Guide to Build a Scalable End-to-End Machine Learning Data Pipeline Using Daft for High-Performance Structured and Image Data Processing"},"content":{"rendered":"<p>In this tutorial, we explore how we use <a href=\"https:\/\/github.com\/Eventual-Inc\/Daft\"><strong>Daft<\/strong><\/a> as a high-performance, Python-native data engine to build an end-to-end analytical pipeline. We start by loading a real-world MNIST dataset, then progressively transform it using UDFs, feature engineering, aggregations, joins, and lazy execution. Also, we demonstrate how to seamlessly combine structured data processing, numerical computation, and machine learning. By the end, we are not just manipulating data, we are building a complete model-ready pipeline powered by Daft\u2019s scalable execution engine.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">!pip -q install daft pyarrow pandas numpy scikit-learn\n\n\nimport os\nos.environ[\"DO_NOT_TRACK\"] = \"true\"\n\n\nimport numpy as np\nimport pandas as pd\nimport daft\nfrom daft import col\n\n\nprint(\"Daft version:\", getattr(daft, \"__version__\", \"unknown\"))\n\n\nURL = \"https:\/\/github.com\/Eventual-Inc\/mnist-json\/raw\/master\/mnist_handwritten_test.json.gz\"\n\n\ndf = daft.read_json(URL)\nprint(\"nSchema (sampled):\")\nprint(df.schema())\n\n\nprint(\"nPeek:\")\ndf.show(5)<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We install Daft and its supporting libraries directly in Google Colab to ensure a clean, reproducible environment. We configure optional settings and verify the installed version to confirm everything is working correctly. By doing this, we establish a stable foundation for building our end-to-end data pipeline.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">def to_28x28(pixels):\n   arr = np.array(pixels, dtype=np.float32)\n   if arr.size != 784:\n       return None\n   return arr.reshape(28, 28)\n\n\ndf2 = (\n   df\n   .with_column(\n       \"img_28x28\",\n       col(\"image\").apply(to_28x28, return_dtype=daft.DataType.python())\n   )\n   .with_column(\n       \"pixel_mean\",\n       col(\"img_28x28\").apply(lambda x: float(np.mean(x)) if x is not None else None,\n                              return_dtype=daft.DataType.float32())\n   )\n   .with_column(\n       \"pixel_std\",\n       col(\"img_28x28\").apply(lambda x: float(np.std(x)) if x is not None else None,\n                              return_dtype=daft.DataType.float32())\n   )\n)\n\n\nprint(\"nAfter reshaping + simple features:\")\ndf2.select(\"label\", \"pixel_mean\", \"pixel_std\").show(5)<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We load a real-world MNIST JSON dataset directly from a remote URL using Daft\u2019s native reader. We inspect the schema and preview the data to understand its structure and column types. It allows us to validate the dataset before applying transformations and feature engineering.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">@daft.udf(return_dtype=daft.DataType.list(daft.DataType.float32()), batch_size=512)\ndef featurize(images_28x28):\n   out = []\n   for img in images_28x28.to_pylist():\n       if img is None:\n           out.append(None)\n           continue\n       img = np.asarray(img, dtype=np.float32)\n       row_sums = img.sum(axis=1) \/ 255.0\n       col_sums = img.sum(axis=0) \/ 255.0\n       total = img.sum() + 1e-6\n       ys, xs = np.indices(img.shape)\n       cy = float((ys * img).sum() \/ total) \/ 28.0\n       cx = float((xs * img).sum() \/ total) \/ 28.0\n       vec = np.concatenate([row_sums, col_sums, np.array([cy, cx, img.mean()\/255.0, img.std()\/255.0], dtype=np.float32)])\n       out.append(vec.astype(np.float32).tolist())\n   return out\n\n\ndf3 = df2.with_column(\"features\", featurize(col(\"img_28x28\")))\n\n\nprint(\"nFeature column created (list[float]):\")\ndf3.select(\"label\", \"features\").show(2)<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We reshape the raw pixel arrays into structured 28\u00d728 images using a row-wise UDF. We compute statistical features, such as the mean and standard deviation, to enrich the dataset. By applying these transformations, we convert raw image data into structured and model-friendly representations.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">label_stats = (\n   df3.groupby(\"label\")\n      .agg(\n          col(\"label\").count().alias(\"n\"),\n          col(\"pixel_mean\").mean().alias(\"mean_pixel_mean\"),\n          col(\"pixel_std\").mean().alias(\"mean_pixel_std\"),\n      )\n      .sort(\"label\")\n)\n\n\nprint(\"nLabel distribution + summary stats:\")\nlabel_stats.show(10)\n\n\ndf4 = df3.join(label_stats, on=\"label\", how=\"left\")\n\n\nprint(\"nJoined label stats back onto each row:\")\ndf4.select(\"label\", \"n\", \"mean_pixel_mean\", \"mean_pixel_std\").show(5)<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We implement a batch UDF to extract richer feature vectors from the reshaped images. We perform group-by aggregations and join summary statistics back to the dataset for contextual enrichment. This demonstrates how we combine scalable computation with advanced analytics within Daft.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">small = df4.select(\"label\", \"features\").collect().to_pandas()\n\n\nsmall = small.dropna(subset=[\"label\", \"features\"]).reset_index(drop=True)\n\n\nX = np.vstack(small[\"features\"].apply(np.array).values).astype(np.float32)\ny = small[\"label\"].astype(int).values\n\n\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.metrics import accuracy_score, classification_report\n\n\nclf = LogisticRegression(max_iter=1000, n_jobs=None)\nclf.fit(X_train, y_train)\n\n\npred = clf.predict(X_test)\nacc = accuracy_score(y_test, pred)\n\n\nprint(\"nBaseline accuracy (feature-engineered LogisticRegression):\", round(acc, 4))\nprint(\"nClassification report:\")\nprint(classification_report(y_test, pred, digits=4))\n\n\nout_df = df4.select(\"label\", \"features\", \"pixel_mean\", \"pixel_std\", \"n\")\nout_path = \"\/content\/daft_mnist_features.parquet\"\nout_df.write_parquet(out_path)\n\n\nprint(\"nWrote parquet to:\", out_path)\n\n\ndf_back = daft.read_parquet(out_path)\nprint(\"nRead-back check:\")\ndf_back.show(3)<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We materialize selected columns into pandas and train a baseline Logistic Regression model. We evaluate performance to validate the usefulness of our engineered features. Also, we persist the processed dataset to Parquet format, completing our end-to-end pipeline from raw data ingestion to production-ready storage.<\/p>\n<p>In this tutorial, we built a production-style data workflow using Daft, moving from raw JSON ingestion to feature engineering, aggregation, model training, and Parquet persistence. We demonstrated how to integrate advanced UDF logic, perform efficient groupby and join operations, and materialize results for downstream machine learning, all within a clean, scalable framework. Through this process, we saw how Daft enables us to handle complex transformations while remaining Pythonic and efficient. We finished with a reusable, end-to-end pipeline that showcases how we can combine modern data engineering and machine learning workflows in a unified environment.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out the\u00a0<strong><a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\/blob\/main\/Data%20Science\/daft_end_to_end_ml_pipeline_marktechpost.py\" target=\"_blank\" rel=\"noreferrer noopener\">Full Codes here<\/a>.\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">120k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/03\/05\/a-coding-guide-to-build-a-scalable-end-to-end-machine-learning-data-pipeline-using-daft-for-high-performance-structured-and-image-data-processing\/\">A Coding Guide to Build a Scalable End-to-End Machine Learning Data Pipeline Using Daft for High-Performance Structured and Image Data Processing<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>In this tutorial, we explore h&hellip;<\/p>\n","protected":false},"author":1,"featured_media":29,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-513","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/513","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=513"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/513\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/29"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=513"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=513"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=513"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}