{"id":412,"date":"2026-02-14T04:40:18","date_gmt":"2026-02-13T20:40:18","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=412"},"modified":"2026-02-14T04:40:18","modified_gmt":"2026-02-13T20:40:18","slug":"in-depth-guide-the-complete-ctgan-sdv-pipeline-for-high-fidelity-synthetic-data","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=412","title":{"rendered":"[In-Depth Guide] The Complete CTGAN + SDV Pipeline for High-Fidelity Synthetic Data"},"content":{"rendered":"<p>In this tutorial, we build a complete, production-grade synthetic data pipeline using <a href=\"https:\/\/github.com\/sdv-dev\/CTGAN\"><strong>CTGAN<\/strong><\/a> and the SDV ecosystem. We start from raw mixed-type tabular data and progressively move toward constrained generation, conditional sampling, statistical validation, and downstream utility testing. Rather than stopping at sample generation, we focus on understanding how well synthetic data preserves structure, distributions, and predictive signal. This tutorial demonstrates how CTGAN can be used responsibly and rigorously in real-world data science workflows.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">!pip -q install \"ctgan\" \"sdv\" \"sdmetrics\" \"scikit-learn\" \"pandas\" \"numpy\" \"matplotlib\"\n\n\nimport numpy as np\nimport pandas as pd\nimport warnings\nwarnings.filterwarnings(\"ignore\")\n\n\nimport ctgan, sdv, sdmetrics\nfrom ctgan import load_demo, CTGAN\n\n\nfrom sdv.metadata import SingleTableMetadata\nfrom sdv.single_table import CTGANSynthesizer\n\n\nfrom sdv.cag import Inequality, FixedCombinations\nfrom sdv.sampling import Condition\n\n\nfrom sdmetrics.reports.single_table import DiagnosticReport, QualityReport\n\n\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.metrics import roc_auc_score\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.preprocessing import OneHotEncoder\nfrom sklearn.compose import ColumnTransformer\nfrom sklearn.pipeline import Pipeline\n\n\nimport matplotlib.pyplot as plt\n\n\nprint(\"Versions:\")\nprint(\"ctgan:\", ctgan.__version__)\nprint(\"sdv:\", sdv.__version__)\nprint(\"sdmetrics:\", sdmetrics.__version__)<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We set up the environment by installing all required libraries and importing the full dependency stack. We explicitly load CTGAN, SDV, SDMetrics, and downstream ML tooling to ensure compatibility across the pipeline. We also surface library versions to make the experiment reproducible and debuggable.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">real = load_demo().copy()\nreal.columns = [c.strip().replace(\" \", \"_\") for c in real.columns]\n\n\ntarget_col = \"income\"\nreal[target_col] = real[target_col].astype(str)\n\n\ncategorical_cols = real.select_dtypes(include=[\"object\"]).columns.tolist()\nnumerical_cols = [c for c in real.columns if c not in categorical_cols]\n\n\nprint(\"Rows:\", len(real), \"Cols:\", len(real.columns))\nprint(\"Categorical:\", len(categorical_cols), \"Numerical:\", len(numerical_cols))\ndisplay(real.head())\n\n\nctgan_model = CTGAN(\n   epochs=30,\n   batch_size=500,\n   verbose=True\n)\nctgan_model.fit(real, discrete_columns=categorical_cols)\nsynthetic_ctgan = ctgan_model.sample(5000)\nprint(\"Standalone CTGAN sample:\")\ndisplay(synthetic_ctgan.head())<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We load the CTGAN Adult demo dataset and perform minimal normalization on column names and data types. We explicitly identify categorical and numerical columns, which is critical for both CTGAN training and evaluation. We then train a baseline standalone CTGAN model and generate synthetic samples for comparison.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">metadata = SingleTableMetadata()\nmetadata.detect_from_dataframe(data=real)\nmetadata.update_column(column_name=target_col, sdtype=\"categorical\")\n\n\nconstraints = []\n\n\nif len(numerical_cols) &gt;= 2:\n   col_lo, col_hi = numerical_cols[0], numerical_cols[1]\n   constraints.append(Inequality(low_column_name=col_lo, high_column_name=col_hi))\n   print(f\"Added Inequality constraint: {col_hi} &gt; {col_lo}\")\n\n\nif len(categorical_cols) &gt;= 2:\n   c1, c2 = categorical_cols[0], categorical_cols[1]\n   constraints.append(FixedCombinations(column_names=[c1, c2]))\n   print(f\"Added FixedCombinations constraint on: [{c1}, {c2}]\")\n\n\nsynth = CTGANSynthesizer(\n   metadata=metadata,\n   epochs=30,\n   batch_size=500\n)\n\n\nif constraints:\n   synth.add_constraints(constraints)\n\n\nsynth.fit(real)\n\n\nsynthetic_sdv = synth.sample(num_rows=5000)\nprint(\"SDV CTGANSynthesizer sample:\")\ndisplay(synthetic_sdv.head())<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We construct a formal metadata object and attach explicit semantic types to the dataset. We introduce structural constraints using SDV\u2019s constraint graph system, enforcing numeric inequalities and validity of categorical combinations. We then train a CTGAN-based SDV synthesizer that respects these constraints during generation.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">loss_df = synth.get_loss_values()\ndisplay(loss_df.tail())\n\n\nx_candidates = [\"epoch\", \"step\", \"steps\", \"iteration\", \"iter\", \"batch\", \"update\"]\nxcol = next((c for c in x_candidates if c in loss_df.columns), None)\n\n\ng_candidates = [\"generator_loss\", \"gen_loss\", \"g_loss\"]\nd_candidates = [\"discriminator_loss\", \"disc_loss\", \"d_loss\"]\ngcol = next((c for c in g_candidates if c in loss_df.columns), None)\ndcol = next((c for c in d_candidates if c in loss_df.columns), None)\n\n\nplt.figure(figsize=(10,4))\n\n\nif xcol is None:\n   x = np.arange(len(loss_df))\nelse:\n   x = loss_df[xcol].to_numpy()\n\n\nif gcol is not None:\n   plt.plot(x, loss_df[gcol].to_numpy(), label=gcol)\nif dcol is not None:\n   plt.plot(x, loss_df[dcol].to_numpy(), label=dcol)\n\n\nplt.xlabel(xcol if xcol is not None else \"index\")\nplt.ylabel(\"loss\")\nplt.legend()\nplt.title(\"CTGAN training losses (SDV wrapper)\")\nplt.show()\n\n\ncond_col = categorical_cols[0]\ncommon_value = real[cond_col].value_counts().index[0]\nconditions = [Condition({cond_col: common_value}, num_rows=2000)]\n\n\nsynthetic_cond = synth.sample_from_conditions(\n   conditions=conditions,\n   max_tries_per_batch=200,\n   batch_size=5000\n)\n\n\nprint(\"Conditional sampling requested:\", 2000, \"got:\", len(synthetic_cond))\nprint(\"Conditional sample distribution (top 5):\")\nprint(synthetic_cond[cond_col].value_counts().head(5))\ndisplay(synthetic_cond.head())<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We extract and visualize the dynamics of generator and discriminator losses using a version-robust plotting strategy. We perform conditional sampling to generate data under specific attribute constraints and verify that the conditions are satisfied. This demonstrates how CTGAN behaves under guided generation scenarios.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">metadata_dict = metadata.to_dict()\n\n\ndiagnostic = DiagnosticReport()\ndiagnostic.generate(real_data=real, synthetic_data=synthetic_sdv, metadata=metadata_dict, verbose=True)\nprint(\"Diagnostic score:\", diagnostic.get_score())\n\n\nquality = QualityReport()\nquality.generate(real_data=real, synthetic_data=synthetic_sdv, metadata=metadata_dict, verbose=True)\nprint(\"Quality score:\", quality.get_score())\n\n\ndef show_report_details(report, title):\n   print(f\"n===== {title} details =====\")\n   props = report.get_properties()\n   for p in props:\n       print(f\"n--- {p} ---\")\n       details = report.get_details(property_name=p)\n       try:\n           display(details.head(10))\n       except Exception:\n           display(details)\n\n\nshow_report_details(diagnostic, \"DiagnosticReport\")\nshow_report_details(quality, \"QualityReport\")\n\n\ntrain_real, test_real = train_test_split(\n   real, test_size=0.25, random_state=42, stratify=real[target_col]\n)\n\n\ndef make_pipeline(cat_cols, num_cols):\n   pre = ColumnTransformer(\n       transformers=[\n           (\"cat\", OneHotEncoder(handle_unknown=\"ignore\"), cat_cols),\n           (\"num\", \"passthrough\", num_cols),\n       ],\n       remainder=\"drop\"\n   )\n   clf = LogisticRegression(max_iter=200)\n   return Pipeline([(\"pre\", pre), (\"clf\", clf)])\n\n\npipe_syn = make_pipeline(categorical_cols, numerical_cols)\npipe_syn.fit(synthetic_sdv.drop(columns=[target_col]), synthetic_sdv[target_col])\n\n\nproba_syn = pipe_syn.predict_proba(test_real.drop(columns=[target_col]))[:, 1]\ny_true = (test_real[target_col].astype(str).str.contains(\"&gt;\")).astype(int)\nauc_syn = roc_auc_score(y_true, proba_syn)\nprint(\"Synthetic-train -&gt; Real-test AUC:\", auc_syn)\n\n\npipe_real = make_pipeline(categorical_cols, numerical_cols)\npipe_real.fit(train_real.drop(columns=[target_col]), train_real[target_col])\n\n\nproba_real = pipe_real.predict_proba(test_real.drop(columns=[target_col]))[:, 1]\nauc_real = roc_auc_score(y_true, proba_real)\nprint(\"Real-train -&gt; Real-test AUC:\", auc_real)\n\n\nmodel_path = \"ctgan_sdv_synth.pkl\"\nsynth.save(model_path)\nprint(\"Saved synthesizer to:\", model_path)\n\n\nfrom sdv.utils import load_synthesizer\nsynth_loaded = load_synthesizer(model_path)\n\n\nsynthetic_loaded = synth_loaded.sample(1000)\nprint(\"Loaded synthesizer sample:\")\ndisplay(synthetic_loaded.head())<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We evaluate synthetic data using SDMetrics diagnostic and quality reports and a property-level inspection. We validate downstream usefulness by training a classifier on synthetic data and testing it on real data. Finally, we serialize the trained synthesizer and confirm that it can be reloaded and sampled reliably.<\/p>\n<p>In conclusion, we demonstrated that synthetic data generation with CTGAN becomes significantly more powerful when paired with metadata, constraints, and rigorous evaluation. By validating both statistical similarity and downstream task performance, we ensured that the synthetic data is not only realistic but also useful. This pipeline serves as a strong foundation for privacy-preserving analytics, data sharing, and simulation workflows. With careful configuration and evaluation, CTGAN can be safely deployed in real-world data science systems.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out the\u00a0<strong><a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\/blob\/main\/Data%20Science\/ctgan_sdv_synthetic_data_pipeline_marktechpost.py\" target=\"_blank\" rel=\"noreferrer noopener\">Full Codes here<\/a>.\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">100k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/02\/13\/in-depth-guide-the-complete-ctgan-sdv-pipeline-for-high-fidelity-synthetic-data\/\">[In-Depth Guide] The Complete CTGAN + SDV Pipeline for High-Fidelity Synthetic Data<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>In this tutorial, we build a c&hellip;<\/p>\n","protected":false},"author":1,"featured_media":29,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-412","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/412","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=412"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/412\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/29"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=412"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=412"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=412"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}