{"id":714,"date":"2026-04-13T04:17:15","date_gmt":"2026-04-12T20:17:15","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=714"},"modified":"2026-04-13T04:17:15","modified_gmt":"2026-04-12T20:17:15","slug":"a-coding-implementation-of-molmoact-for-depth-aware-spatial-reasoning-visual-trajectory-tracing-and-robotic-action-prediction","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=714","title":{"rendered":"A Coding Implementation of MolmoAct for Depth-Aware Spatial Reasoning, Visual Trajectory Tracing, and Robotic Action Prediction"},"content":{"rendered":"<p>In this tutorial, we walk through <a href=\"https:\/\/github.com\/allenai\/molmoact\"><strong>MolmoAct<\/strong><\/a> step by step and build a practical understanding of how action-reasoning models can reason in space from visual observations. We set up the environment, load the model, prepare multi-view image inputs, and explore how MolmoAct produces depth-aware reasoning, visual traces, and actionable robot outputs from natural language instructions. As we move through the workflow, we run inference and also examine how the model parses actions, visualizes trajectories, and supports more advanced processing pipelines for robotics-oriented tasks.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">print(\"=\" * 80)\nprint(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f527.png\" alt=\"\ud83d\udd27\" class=\"wp-smiley\" \/> SECTION 1: INSTALLATION AND SETUP\")\nprint(\"=\" * 80)\n\n\nimport subprocess\nimport sys\n\n\ndef install_packages():\n   \"\"\"Install all required packages for MolmoAct\"\"\"\n   packages = [\n       \"torch&gt;=2.0.0\",\n       \"torchvision\",\n       \"transformers==4.52\",\n       \"accelerate\",\n       \"einops\",\n       \"Pillow\",\n       \"numpy\",\n       \"matplotlib\",\n       \"requests\",\n       \"scipy\",\n       \"huggingface_hub\",\n   ]\n  \n   for package in packages:\n       print(f\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f4e6.png\" alt=\"\ud83d\udce6\" class=\"wp-smiley\" \/> Installing {package}...\")\n       subprocess.check_call([sys.executable, \"-m\", \"pip\", \"install\", \"-q\", package])\n  \n   print(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2705.png\" alt=\"\u2705\" class=\"wp-smiley\" \/> All packages installed successfully!\")\n\n\ninstall_packages()\n\n\nprint(\"n\" + \"=\" * 80)\nprint(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f4da.png\" alt=\"\ud83d\udcda\" class=\"wp-smiley\" \/> SECTION 2: IMPORTS AND CONFIGURATION\")\nprint(\"=\" * 80)\n\n\nimport torch\nimport numpy as np\nimport matplotlib.pyplot as plt\nfrom PIL import Image\nimport requests\nfrom io import BytesIO\nfrom typing import List, Tuple, Dict, Optional, Union\nimport json\nimport time\nfrom dataclasses import dataclass\nimport warnings\nimport re\n\n\nwarnings.filterwarnings(\"ignore\", category=FutureWarning)\nwarnings.filterwarnings(\"ignore\", category=UserWarning)\n\n\nprint(f\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f5a5.png\" alt=\"\ud83d\udda5\" class=\"wp-smiley\" \/>  Device: {device}\")\nif torch.cuda.is_available():\n   print(f\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f3ae.png\" alt=\"\ud83c\udfae\" class=\"wp-smiley\" \/> GPU: {torch.cuda.get_device_name(0)}\")\n   print(f\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f4be.png\" alt=\"\ud83d\udcbe\" class=\"wp-smiley\" \/> GPU Memory: {torch.cuda.get_device_properties(0).total_memory \/ 1e9:.2f} GB\")\n\n\nprint(\"n\" + \"=\" * 80)\nprint(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f916.png\" alt=\"\ud83e\udd16\" class=\"wp-smiley\" \/> SECTION 3: MOLMOACT MODEL LOADER\")\nprint(\"=\" * 80)\n\n\n@dataclass\nclass MolmoActConfig:\n   \"\"\"Configuration for MolmoAct model\"\"\"\n   model_name: str = \"allenai\/MolmoAct-7B-D-0812\"\n   torch_dtype: str = \"bfloat16\"\n   device_map: str = \"auto\"\n   max_new_tokens: int = 256\n   temperature: float = 0.0\n   do_sample: bool = False<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We set up the tutorial and prepared the environment needed to run MolmoAct in Google Colab. We install all required packages, import the core libraries, and configure the runtime to detect whether GPU acceleration is available. We also define the base configuration class that stores the main model settings we use throughout the rest of the tutorial.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">class MolmoActModel:\n   \"\"\"\n   MolmoAct Model Wrapper for Easy Inference\n  \n   This class provides a high-level interface for:\n   - Loading and managing the model\n   - Running inference with proper prompting\n   - Parsing outputs (depth, trace, actions)\n   - Batch processing\n   \"\"\"\n  \n   def __init__(self, config: Optional[MolmoActConfig] = None):\n       self.config = config or MolmoActConfig()\n       self.model = None\n       self.processor = None\n       self._loaded = False\n      \n   def load(self) -&gt; None:\n       \"\"\"Load the MolmoAct model and processor\"\"\"\n       if self._loaded:\n           print(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/26a0.png\" alt=\"\u26a0\" class=\"wp-smiley\" \/> Model already loaded!\")\n           return\n          \n       print(f\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f504.png\" alt=\"\ud83d\udd04\" class=\"wp-smiley\" \/> Loading MolmoAct model: {self.config.model_name}\")\n       print(\"   This may take a few minutes on first run...\")\n      \n       from transformers import AutoModelForImageTextToText, AutoProcessor\n      \n       dtype = getattr(torch, self.config.torch_dtype)\n      \n       print(\"   <img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f4e5.png\" alt=\"\ud83d\udce5\" class=\"wp-smiley\" \/> Loading model weights...\")\n       self.model = AutoModelForImageTextToText.from_pretrained(\n           self.config.model_name,\n           trust_remote_code=True,\n           torch_dtype=dtype,\n           device_map=self.config.device_map,\n       )\n      \n       print(\"   <img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f4e5.png\" alt=\"\ud83d\udce5\" class=\"wp-smiley\" \/> Loading processor...\")\n       try:\n           self.processor = AutoProcessor.from_pretrained(\n               self.config.model_name,\n               trust_remote_code=True,\n           )\n           if hasattr(self.processor, 'tokenizer'):\n               self.processor.tokenizer.padding_side = \"left\"\n       except TypeError as e:\n           if \"prompt_templates\" in str(e):\n               print(\"   <img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/26a0.png\" alt=\"\u26a0\" class=\"wp-smiley\" \/> Handling custom processor configuration...\")\n               from transformers.dynamic_module_utils import get_class_from_dynamic_module\n              \n               processor_class = get_class_from_dynamic_module(\n                   \"processing_molmoact.MolmoActProcessor\",\n                   self.config.model_name,\n                   trust_remote_code=True,\n               )\n              \n               from transformers import AutoTokenizer, AutoImageProcessor\n              \n               tokenizer = AutoTokenizer.from_pretrained(\n                   self.config.model_name,\n                   trust_remote_code=True,\n                   padding_side=\"left\",\n               )\n              \n               image_processor = AutoImageProcessor.from_pretrained(\n                   self.config.model_name,\n                   trust_remote_code=True,\n               )\n              \n               self.processor = processor_class(\n                   image_processor=image_processor,\n                   tokenizer=tokenizer,\n               )\n           else:\n               raise e\n      \n       self._loaded = True\n       print(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2705.png\" alt=\"\u2705\" class=\"wp-smiley\" \/> Model loaded successfully!\")\n       self._print_model_info()\n      \n   def _print_model_info(self) -&gt; None:\n       \"\"\"Print model information\"\"\"\n       total_params = sum(p.numel() for p in self.model.parameters())\n       trainable_params = sum(p.numel() for p in self.model.parameters() if p.requires_grad)\n       print(f\"n<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f4ca.png\" alt=\"\ud83d\udcca\" class=\"wp-smiley\" \/> Model Statistics:\")\n       print(f\"   Total Parameters: {total_params \/ 1e9:.2f}B\")\n       print(f\"   Trainable Parameters: {trainable_params \/ 1e9:.2f}B\")\n       print(f\"   Model dtype: {next(self.model.parameters()).dtype}\")\n      \n   def build_prompt(self, instruction: str) -&gt; str:\n       \"\"\"\n       Build the reasoning prompt for MolmoAct\n      \n       The prompt structure is crucial for MolmoAct to generate:\n       1. Depth perception tokens\n       2. Visual trajectory trace\n       3. Action predictions\n       \"\"\"\n       prompt = (\n           f\"The task is {instruction}. \"\n           \"What is the action that the robot should take. \"\n           f\"To figure out the action that the robot should take to {instruction}, \"\n           \"let's think through it step by step. \"\n           \"First, what is the depth map for the first image? \"\n           \"Second, what is the trajectory of the end effector in the first image? \"\n           \"Based on the depth map of the first image and the trajectory of the end effector in the first image, \"\n           \"along with other images from different camera views as additional information, \"\n           \"what is the action that the robot should take?\"\n       )\n       return prompt<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We begin building the main MolmoAct model wrapper that makes inference easier to manage. We load the model and processor, handle custom processor initialization logic, and print useful model statistics once loading is complete. We also define a prompt-building method that helps us structure the reasoning query to guide the model toward depth, trace, and action generation.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">   @torch.inference_mode()\n   def generate(\n       self,\n       images: List[Image.Image],\n       instruction: str,\n       max_new_tokens: Optional[int] = None,\n   ) -&gt; Dict:\n       \"\"\"\n       Generate action reasoning from images and instruction\n      \n       Args:\n           images: List of PIL Images\n           instruction: Task instruction\n           max_new_tokens: Override default max tokens\n          \n       Returns:\n           Dictionary containing:\n           - text: Generated reasoning text\n           - depth: Parsed depth tokens\n           - trace: Parsed visual trace coordinates\n           - action: Parsed action values\n       \"\"\"\n       if not self._loaded:\n           raise RuntimeError(\"Model not loaded! Call .load() first.\")\n      \n       prompt = self.build_prompt(instruction)\n       max_tokens = max_new_tokens or self.config.max_new_tokens\n      \n       text = self.processor.apply_chat_template(\n           [{\"role\": \"user\", \"content\": [dict(type=\"text\", text=prompt)]}],\n           tokenize=False,\n           add_generation_prompt=True,\n       )\n      \n       inputs = self.processor(\n           images=[images],\n           text=text,\n           padding=True,\n           return_tensors=\"pt\",\n       )\n      \n       inputs = {k: v.to(self.model.device) for k, v in inputs.items()}\n      \n       with torch.autocast(\"cuda\", enabled=True, dtype=torch.bfloat16):\n           generated_ids = self.model.generate(\n               **inputs,\n               max_new_tokens=max_tokens,\n               do_sample=self.config.do_sample,\n           )\n      \n       generated_tokens = generated_ids[:, inputs['input_ids'].size(1):]\n       generated_text = self.processor.batch_decode(\n           generated_tokens,\n           skip_special_tokens=True,\n           clean_up_tokenization_spaces=False\n       )[0]\n      \n       result = {\n           \"text\": generated_text,\n           \"depth\": self._safe_parse_depth(generated_text),\n           \"trace\": self._safe_parse_trace(generated_text),\n           \"action\": self._safe_parse_action(generated_text, unnorm_key=\"molmoact\"),\n           \"action_raw\": self._safe_parse_action(generated_text, unnorm_key=None),\n       }\n      \n       return result\n  \n   def _safe_parse_depth(self, text: str) -&gt; List[str]:\n       \"\"\"Safely parse depth tokens from generated text\"\"\"\n       try:\n           if hasattr(self.model, 'parse_depth'):\n               return self.model.parse_depth(text)\n       except Exception:\n           pass\n      \n       depth_pattern = r'&lt;DEPTH_START&gt;.*?&lt;DEPTH_END&gt;'\n       matches = re.findall(depth_pattern, text, re.DOTALL)\n       return matches if matches else []\n  \n   def _safe_parse_trace(self, text: str) -&gt; List[List[List[int]]]:\n       \"\"\"Safely parse visual trace coordinates from generated text\"\"\"\n       try:\n           if hasattr(self.model, 'parse_trace'):\n               return self.model.parse_trace(text)\n       except Exception:\n           pass\n      \n       coord_pattern = r'[(d+),s*(d+)]|((d+),s*(d+))'\n       matches = re.findall(coord_pattern, text)\n      \n       traces = []\n       current_trace = []\n       for match in matches:\n           x = int(match[0] or match[2])\n           y = int(match[1] or match[3])\n           if 0 &lt;= x &lt;= 256 and 0 &lt;= y &lt;= 256:\n               current_trace.append([x, y])\n      \n       if current_trace:\n           traces.append(current_trace)\n      \n       return traces\n  \n   def _safe_parse_action(self, text: str, unnorm_key: Optional[str] = None) -&gt; List[List[float]]:\n       \"\"\"Safely parse action values from generated text\"\"\"\n       try:\n           if hasattr(self.model, 'parse_action'):\n               return self.model.parse_action(text, unnorm_key=unnorm_key)\n       except Exception:\n           pass\n      \n       float_pattern = r'[-+]?d*.?d+(?:[eE][-+]?d+)?'\n       all_floats = re.findall(float_pattern, text)\n      \n       actions = []\n       floats = [float(f) for f in all_floats]\n      \n       for i in range(len(floats) - 6):\n           potential_action = floats[i:i+7]\n           if all(-5 &lt; v &lt; 5 for v in potential_action[:6]):\n               actions.append(potential_action)\n               break\n      \n       return actions\n  \n   def batch_generate(\n       self,\n       batch_data: List[Tuple[List[Image.Image], str]],\n       progress: bool = True\n   ) -&gt; List[Dict]:\n       \"\"\"\n       Process multiple observations in batch\n       \"\"\"\n       results = []\n       total = len(batch_data)\n      \n       for i, (images, instruction) in enumerate(batch_data):\n           if progress:\n               print(f\"r<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f504.png\" alt=\"\ud83d\udd04\" class=\"wp-smiley\" \/> Processing {i+1}\/{total}...\", end=\"\", flush=True)\n          \n           result = self.generate(images, instruction)\n           results.append(result)\n      \n       if progress:\n           print(f\"r<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2705.png\" alt=\"\u2705\" class=\"wp-smiley\" \/> Processed {total} observations!\")\n      \n       return results\n\n\nprint(\"n\" + \"=\" * 80)\nprint(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f3a8.png\" alt=\"\ud83c\udfa8\" class=\"wp-smiley\" \/> SECTION 4: VISUALIZATION UTILITIES\")\nprint(\"=\" * 80)<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We implement the core generation pipeline that takes images and an instruction and produces structured reasoning outputs. We process the inputs, run inference, decode the generated response, and extract depth, trace, and action information from the model output. We also add safe parsing methods and batch-processing support, enabling us to handle multiple observations more reliably and efficiently.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">class MolmoActVisualizer:\n   \"\"\"Visualization utilities for MolmoAct outputs\"\"\"\n  \n   def __init__(self, figsize: Tuple[int, int] = (12, 8)):\n       self.figsize = figsize\n       self.colors = plt.cm.viridis(np.linspace(0, 1, 10))\n  \n   def plot_trace(\n       self,\n       image: Image.Image,\n       trace: List[List[int]],\n       title: str = \"Visual Reasoning Trace\",\n       save_path: Optional[str] = None\n   ) -&gt; None:\n       \"\"\"Plot visual trace overlaid on image\"\"\"\n       fig, ax = plt.subplots(figsize=self.figsize)\n      \n       img_array = np.array(image)\n       ax.imshow(img_array)\n      \n       if trace and len(trace) &gt; 0:\n           h, w = img_array.shape[:2]\n           trace_array = np.array(trace)\n          \n           x_coords = trace_array[:, 0] * w \/ 256\n           y_coords = trace_array[:, 1] * h \/ 256\n          \n           ax.plot(x_coords, y_coords, 'w-', linewidth=2, alpha=0.7)\n           ax.plot(x_coords, y_coords, 'c-', linewidth=1, alpha=0.9)\n          \n           for i, (x, y) in enumerate(zip(x_coords, y_coords)):\n               color_idx = int(i * 9 \/ max(len(x_coords) - 1, 1))\n               ax.scatter(x, y, c=[self.colors[color_idx]], s=100,\n                         edgecolors='white', linewidths=2, zorder=5)\n               ax.annotate(f'{i+1}', (x, y), textcoords=\"offset points\",\n                          xytext=(5, 5), fontsize=10, color='white',\n                          fontweight='bold')\n          \n           ax.scatter(x_coords[0], y_coords[0], c='lime', s=200,\n                     marker='o', edgecolors='white', linewidths=3,\n                     zorder=6, label='Start')\n           ax.scatter(x_coords[-1], y_coords[-1], c='red', s=200,\n                     marker='X', edgecolors='white', linewidths=3,\n                     zorder=6, label='End')\n      \n       ax.set_title(title, fontsize=14, fontweight='bold')\n       ax.axis('off')\n       ax.legend(loc='upper right')\n      \n       plt.tight_layout()\n      \n       if save_path:\n           plt.savefig(save_path, dpi=150, bbox_inches='tight')\n           print(f\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f4be.png\" alt=\"\ud83d\udcbe\" class=\"wp-smiley\" \/> Saved visualization to {save_path}\")\n      \n       plt.show()\n  \n   def plot_action(\n       self,\n       action: List[float],\n       action_labels: Optional[List[str]] = None,\n       title: str = \"Predicted Robot Action\",\n       save_path: Optional[str] = None\n   ) -&gt; None:\n       \"\"\"Plot action values as a bar chart\"\"\"\n       if action_labels is None:\n           action_labels = [\n               '\u0394x (forward)', '\u0394y (left)', '\u0394z (up)',\n               'Rx (roll)', 'Ry (pitch)', 'Rz (yaw)',\n               'Gripper'\n           ]\n      \n       fig, ax = plt.subplots(figsize=(10, 5))\n      \n       colors = ['#3498db', '#3498db', '#3498db',\n                 '#e74c3c', '#e74c3c', '#e74c3c',\n                 '#2ecc71']\n      \n       x = np.arange(len(action))\n       bars = ax.bar(x, action, color=colors, edgecolor='white', linewidth=1.5)\n      \n       for bar, val in zip(bars, action):\n           height = bar.get_height()\n           ax.annotate(f'{val:.3f}',\n                      xy=(bar.get_x() + bar.get_width() \/ 2, height),\n                      xytext=(0, 3 if height &gt;= 0 else -12),\n                      textcoords=\"offset points\",\n                      ha='center', va='bottom' if height &gt;= 0 else 'top',\n                      fontsize=9, fontweight='bold')\n      \n       ax.set_xticks(x)\n       ax.set_xticklabels(action_labels, rotation=45, ha='right')\n       ax.set_ylabel('Value', fontsize=12)\n       ax.set_title(title, fontsize=14, fontweight='bold')\n       ax.axhline(y=0, color='gray', linestyle='--', alpha=0.5)\n       ax.grid(axis='y', alpha=0.3)\n      \n       from matplotlib.patches import Patch\n       legend_elements = [\n           Patch(facecolor='#3498db', label='Position'),\n           Patch(facecolor='#e74c3c', label='Rotation'),\n           Patch(facecolor='#2ecc71', label='Gripper')\n       ]\n       ax.legend(handles=legend_elements, loc='upper right')\n      \n       plt.tight_layout()\n      \n       if save_path:\n           plt.savefig(save_path, dpi=150, bbox_inches='tight')\n      \n       plt.show()<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We create visualization utilities that help us inspect the model\u2019s reasoning outputs intuitively. We overlay predicted traces onto images and build action plots to better understand the model\u2019s spatial and control decisions. We use these visual tools to make the output easier to interpret and analyze during experimentation.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\"> def plot_comparison(\n       self,\n       images: List[Image.Image],\n       traces: List[List[List[int]]],\n       titles: Optional[List[str]] = None,\n       save_path: Optional[str] = None\n   ) -&gt; None:\n       \"\"\"Plot multiple images with their traces side by side\"\"\"\n       n = len(images)\n       fig, axes = plt.subplots(1, n, figsize=(5*n, 5))\n      \n       if n == 1:\n           axes = [axes]\n      \n       for idx, (ax, img, trace) in enumerate(zip(axes, images, traces)):\n           img_array = np.array(img)\n           ax.imshow(img_array)\n          \n           if trace and len(trace) &gt; 0:\n               h, w = img_array.shape[:2]\n               trace_array = np.array(trace)\n               x_coords = trace_array[:, 0] * w \/ 256\n               y_coords = trace_array[:, 1] * h \/ 256\n              \n               ax.plot(x_coords, y_coords, 'c-', linewidth=2, alpha=0.9)\n               ax.scatter(x_coords, y_coords, c='yellow', s=50,\n                         edgecolors='white', linewidths=1, zorder=5)\n          \n           title = titles[idx] if titles else f\"View {idx+1}\"\n           ax.set_title(title, fontsize=12, fontweight='bold')\n           ax.axis('off')\n      \n       plt.tight_layout()\n      \n       if save_path:\n           plt.savefig(save_path, dpi=150, bbox_inches='tight')\n      \n       plt.show()\n\n\nprint(\"n\" + \"=\" * 80)\nprint(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2699.png\" alt=\"\u2699\" class=\"wp-smiley\" \/> SECTION 5: ACTION PROCESSING UTILITIES\")\nprint(\"=\" * 80)\n\n\nclass ActionProcessor:\n   \"\"\"Utilities for processing MolmoAct action outputs\"\"\"\n  \n   DEFAULT_STATS = {\n       \"molmoact\": {\n           \"mean\": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.5],\n           \"std\": [0.1, 0.1, 0.1, 0.5, 0.5, 0.5, 0.5],\n       }\n   }\n  \n   def __init__(self, stats: Optional[Dict] = None):\n       self.stats = stats or self.DEFAULT_STATS\n  \n   def unnormalize(self, action: List[float], key: str = \"molmoact\") -&gt; np.ndarray:\n       \"\"\"Unnormalize action values\"\"\"\n       action = np.array(action)\n      \n       if key and key in self.stats:\n           mean = np.array(self.stats[key][\"mean\"])\n           std = np.array(self.stats[key][\"std\"])\n           action = action * std + mean\n      \n       return action\n  \n   def normalize(self, action: np.ndarray, key: str = \"molmoact\") -&gt; np.ndarray:\n       \"\"\"Normalize action values\"\"\"\n       action = np.array(action)\n      \n       if key and key in self.stats:\n           mean = np.array(self.stats[key][\"mean\"])\n           std = np.array(self.stats[key][\"std\"])\n           action = (action - mean) \/ std\n      \n       return action\n  \n   def process_gripper(self, action: np.ndarray, threshold: float = 0.5) -&gt; Tuple[np.ndarray, bool]:\n       \"\"\"Process gripper action value\"\"\"\n       gripper_value = action[-1]\n       gripper_open = gripper_value &gt; threshold\n       return action[:-1], gripper_open\n  \n   def smooth_actions(self, actions: List[np.ndarray], window_size: int = 3) -&gt; List[np.ndarray]:\n       \"\"\"Smooth action sequence using moving average\"\"\"\n       if len(actions) &lt; window_size:\n           return actions\n      \n       actions_array = np.array(actions)\n       smoothed = np.zeros_like(actions_array)\n      \n       for i in range(len(actions)):\n           start = max(0, i - window_size \/\/ 2)\n           end = min(len(actions), i + window_size \/\/ 2 + 1)\n           smoothed[i] = actions_array[start:end].mean(axis=0)\n      \n       return [smoothed[i] for i in range(len(smoothed))]\n  \n   @staticmethod\n   def action_to_pose_delta(action: np.ndarray, scale: float = 1.0) -&gt; Dict[str, np.ndarray]:\n       \"\"\"Convert action to position and rotation deltas\"\"\"\n       return {\n           \"position_delta\": action[:3] * scale,\n           \"rotation_delta\": action[3:6],\n           \"gripper\": action[6] if len(action) &gt; 6 else 1.0\n       }\n\n\nprint(\"n\" + \"=\" * 80)\nprint(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f680.png\" alt=\"\ud83d\ude80\" class=\"wp-smiley\" \/> SECTION 6: EXAMPLE USAGE AND DEMO\")\nprint(\"=\" * 80)<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We complete the visualization module and then introduce utilities for processing predicted actions. We define functions for normalization, unnormalization, gripper-state handling, smoothing, and conversion of actions into pose deltas. We use these utilities to transform raw model outputs into forms that are more useful for robotics analysis and downstream control.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">def load_example_images() -&gt; Tuple[Image.Image, Image.Image]:\n   \"\"\"Load example images from HuggingFace\"\"\"\n   print(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f4e5.png\" alt=\"\ud83d\udce5\" class=\"wp-smiley\" \/> Loading example images...\")\n  \n   url1 = \"https:\/\/huggingface.co\/allenai\/MolmoAct-7B-D-0812\/resolve\/main\/example_1.png\"\n   url2 = \"https:\/\/huggingface.co\/allenai\/MolmoAct-7B-D-0812\/resolve\/main\/example_2.png\"\n  \n   headers = {\"User-Agent\": \"python-requests\"}\n  \n   r1 = requests.get(url1, headers=headers, timeout=30)\n   r1.raise_for_status()\n   r2 = requests.get(url2, headers=headers, timeout=30)\n   r2.raise_for_status()\n  \n   img1 = Image.open(BytesIO(r1.content)).convert(\"RGB\")\n   img2 = Image.open(BytesIO(r2.content)).convert(\"RGB\")\n  \n   print(f\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2705.png\" alt=\"\u2705\" class=\"wp-smiley\" \/> Loaded images: {img1.size} and {img2.size}\")\n  \n   return img1, img2\n\n\n\n\ndef display_images(img1: Image.Image, img2: Image.Image) -&gt; None:\n   \"\"\"Display the example images\"\"\"\n   fig, axes = plt.subplots(1, 2, figsize=(12, 5))\n  \n   axes[0].imshow(img1)\n   axes[0].set_title(\"Side View (Exocentric)\", fontsize=12, fontweight='bold')\n   axes[0].axis('off')\n  \n   axes[1].imshow(img2)\n   axes[1].set_title(\"Wrist View (Egocentric)\", fontsize=12, fontweight='bold')\n   axes[1].axis('off')\n  \n   plt.tight_layout()\n   plt.show()\n\n\n\n\ndef run_demo():\n   \"\"\"\n   Run the complete MolmoAct demo\n   \"\"\"\n   print(\"n\" + \"=\" * 80)\n   print(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f3ac.png\" alt=\"\ud83c\udfac\" class=\"wp-smiley\" \/> RUNNING MOLMOACT DEMO\")\n   print(\"=\" * 80)\n  \n   img1, img2 = load_example_images()\n   display_images(img1, img2)\n  \n   print(\"n<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f4e6.png\" alt=\"\ud83d\udce6\" class=\"wp-smiley\" \/> Initializing MolmoAct...\")\n   config = MolmoActConfig(\n       model_name=\"allenai\/MolmoAct-7B-D-0812\",\n       torch_dtype=\"bfloat16\",\n       max_new_tokens=256,\n   )\n   model = MolmoActModel(config)\n  \n   model.load()\n  \n   instruction = \"close the box\"\n   print(f\"n<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f3af.png\" alt=\"\ud83c\udfaf\" class=\"wp-smiley\" \/> Task Instruction: '{instruction}'\")\n   print(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f504.png\" alt=\"\ud83d\udd04\" class=\"wp-smiley\" \/> Generating action reasoning...\")\n  \n   start_time = time.time()\n   result = model.generate([img1, img2], instruction)\n   inference_time = time.time() - start_time\n  \n   print(f\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/23f1.png\" alt=\"\u23f1\" class=\"wp-smiley\" \/>  Inference time: {inference_time:.2f}s\")\n  \n   print(\"n\" + \"-\" * 60)\n   print(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f4dd.png\" alt=\"\ud83d\udcdd\" class=\"wp-smiley\" \/> GENERATED REASONING:\")\n   print(\"-\" * 60)\n   print(result['text'][:500] + \"...\" if len(result['text']) &gt; 500 else result['text'])\n  \n   print(\"n\" + \"-\" * 60)\n   print(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f50d.png\" alt=\"\ud83d\udd0d\" class=\"wp-smiley\" \/> PARSED OUTPUTS:\")\n   print(\"-\" * 60)\n  \n   print(f\"n<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f30a.png\" alt=\"\ud83c\udf0a\" class=\"wp-smiley\" \/> Depth Tokens: {result['depth'][0][:50]}...\" if result['depth'] else \"No depth tokens\")\n   print(f\"n<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f4cd.png\" alt=\"\ud83d\udccd\" class=\"wp-smiley\" \/> Visual Trace: {result['trace']}\")\n   print(f\"n<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f3ae.png\" alt=\"\ud83c\udfae\" class=\"wp-smiley\" \/> Action (unnormalized): {result['action']}\")\n   print(f\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f3ae.png\" alt=\"\ud83c\udfae\" class=\"wp-smiley\" \/> Action (raw): {result['action_raw']}\")\n  \n   print(\"n\" + \"-\" * 60)\n   print(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f3a8.png\" alt=\"\ud83c\udfa8\" class=\"wp-smiley\" \/> VISUALIZATIONS:\")\n   print(\"-\" * 60)\n  \n   visualizer = MolmoActVisualizer()\n  \n   if result['trace'] and len(result['trace']) &gt; 0:\n       visualizer.plot_trace(\n           img1,\n           result['trace'][0],\n           title=f\"Visual Trace for: '{instruction}'\"\n       )\n  \n   if result['action'] and len(result['action']) &gt; 0:\n       visualizer.plot_action(\n           result['action'][0],\n           title=f\"Predicted Action for: '{instruction}'\"\n       )\n  \n   print(\"n\" + \"-\" * 60)\n   print(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2699.png\" alt=\"\u2699\" class=\"wp-smiley\" \/>  ACTION PROCESSING:\")\n   print(\"-\" * 60)\n  \n   if result['action'] and len(result['action']) &gt; 0:\n       processor = ActionProcessor()\n       action = np.array(result['action'][0])\n      \n       pose_delta = processor.action_to_pose_delta(action)\n       print(f\"n<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f4d0.png\" alt=\"\ud83d\udcd0\" class=\"wp-smiley\" \/> Position Delta: {pose_delta['position_delta']}\")\n       print(f\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f504.png\" alt=\"\ud83d\udd04\" class=\"wp-smiley\" \/> Rotation Delta: {pose_delta['rotation_delta']}\")\n       print(f\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/270b.png\" alt=\"\u270b\" class=\"wp-smiley\" \/> Gripper State: {'OPEN' if pose_delta['gripper'] &gt; 0.5 else 'CLOSED'}\")\n  \n   print(\"n\" + \"=\" * 80)\n   print(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2705.png\" alt=\"\u2705\" class=\"wp-smiley\" \/> DEMO COMPLETED!\")\n   print(\"=\" * 80)\n  \n   return model, result\n\n\n\n\nprint(\"n\" + \"=\" * 80)\nprint(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f52c.png\" alt=\"\ud83d\udd2c\" class=\"wp-smiley\" \/> SECTION 7: ADVANCED FEATURES\")\nprint(\"=\" * 80)\n\n\nclass MolmoActRollout:\n   \"\"\"Rollout controller for continuous action generation\"\"\"\n  \n   def __init__(\n       self,\n       model: MolmoActModel,\n       action_chunk_size: int = 8,\n       smoothing_window: int = 3\n   ):\n       self.model = model\n       self.action_chunk_size = action_chunk_size\n       self.smoothing_window = smoothing_window\n       self.processor = ActionProcessor()\n       self.action_history = []\n       self.reset()\n  \n   def reset(self):\n       \"\"\"Reset rollout state\"\"\"\n       self.action_history = []\n       self.step_count = 0\n  \n   def step(self, images: List[Image.Image], instruction: str) -&gt; Dict:\n       \"\"\"Execute one step of the rollout\"\"\"\n       result = self.model.generate(images, instruction)\n      \n       if result['action'] and len(result['action']) &gt; 0:\n           action = np.array(result['action'][0])\n           self.action_history.append(action)\n           self.step_count += 1\n          \n           if len(self.action_history) &gt;= self.smoothing_window:\n               smoothed = self.processor.smooth_actions(\n                   self.action_history[-self.smoothing_window:],\n                   self.smoothing_window\n               )[-1]\n           else:\n               smoothed = action\n          \n           result['smoothed_action'] = smoothed\n           result['step'] = self.step_count\n      \n       return result\n  \n   def get_action_statistics(self) -&gt; Dict:\n       \"\"\"Get statistics of collected actions\"\"\"\n       if not self.action_history:\n           return {}\n      \n       actions = np.array(self.action_history)\n      \n       return {\n           \"mean\": actions.mean(axis=0).tolist(),\n           \"std\": actions.std(axis=0).tolist(),\n           \"min\": actions.min(axis=0).tolist(),\n           \"max\": actions.max(axis=0).tolist(),\n           \"num_steps\": len(self.action_history)\n       }\n\n\n\n\ndef demonstrate_custom_stats():\n   \"\"\"Demonstrate using custom normalization statistics\"\"\"\n   print(\"n\" + \"-\" * 60)\n   print(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f4d0.png\" alt=\"\ud83d\udcd0\" class=\"wp-smiley\" \/> CUSTOM STATISTICS DEMO\")\n   print(\"-\" * 60)\n  \n   custom_stats = {\n       \"franka\": {\n           \"mean\": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.5],\n           \"std\": [0.05, 0.05, 0.05, 0.3, 0.3, 0.3, 0.5],\n       },\n       \"ur5\": {\n           \"mean\": [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.5],\n           \"std\": [0.08, 0.08, 0.08, 0.4, 0.4, 0.4, 0.5],\n       }\n   }\n  \n   processor = ActionProcessor(custom_stats)\n  \n   normalized_action = np.array([0.5, -0.3, 0.2, 0.1, -0.1, 0.05, 0.8])\n  \n   print(\"Normalized action:\", normalized_action)\n   print(\"nUnnormalized for different robots:\")\n  \n   for robot in [\"franka\", \"ur5\"]:\n       unnorm = processor.unnormalize(normalized_action, key=robot)\n       print(f\"  {robot}: {unnorm}\")\n\n\n\n\nprint(\"n\" + \"=\" * 80)\nprint(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f4a1.png\" alt=\"\ud83d\udca1\" class=\"wp-smiley\" \/> SECTION 8: TIPS AND BEST PRACTICES\")\nprint(\"=\" * 80)\n\n\ntips = \"\"\"\n\u2554\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2557\n\u2551                       MolmoAct Best Practices                                 \u2551\n\u2560\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2563\n\u2551                                                                              \u2551\n\u2551  <img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f5bc.png\" alt=\"\ud83d\uddbc\" class=\"wp-smiley\" \/>  IMAGE INPUTS:                                                           \u2551\n\u2551  \u2022 Use 2 camera views: side (exocentric) + wrist (egocentric)               \u2551\n\u2551  \u2022 Ensure good lighting and clear visibility                                \u2551\n\u2551  \u2022 Match camera setup to training distribution (Franka\/DROID-like)         \u2551\n\u2551                                                                              \u2551\n\u2551  <img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f4dd.png\" alt=\"\ud83d\udcdd\" class=\"wp-smiley\" \/> INSTRUCTIONS:                                                            \u2551\n\u2551  \u2022 Keep instructions clear and concise                                       \u2551\n\u2551  \u2022 Use action-oriented language (\"pick up\", \"push\", \"close\")                \u2551\n\u2551  \u2022 Avoid ambiguous references                                               \u2551\n\u2551                                                                              \u2551\n\u2551  <img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/26a1.png\" alt=\"\u26a1\" class=\"wp-smiley\" \/> PERFORMANCE:                                                              \u2551\n\u2551  \u2022 Use bfloat16 for faster inference                                        \u2551\n\u2551  \u2022 Batch similar observations when possible                                  \u2551\n\u2551  \u2022 Consider vLLM for production deployment                                  \u2551\n\u2551                                                                              \u2551\n\u2551  <img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f527.png\" alt=\"\ud83d\udd27\" class=\"wp-smiley\" \/> FINE-TUNING:                                                             \u2551\n\u2551  \u2022 Collect 50-100 demonstrations for new tasks                              \u2551\n\u2551  \u2022 Use LoRA for efficient adaptation                                        \u2551\n\u2551  \u2022 Include depth perception in training data                                \u2551\n\u2551                                                                              \u2551\n\u2551  <img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/26a0.png\" alt=\"\u26a0\" class=\"wp-smiley\" \/>  SAFETY:                                                                  \u2551\n\u2551  \u2022 Always inspect visual traces before execution                            \u2551\n\u2551  \u2022 Implement force limits and collision detection                           \u2551\n\u2551  \u2022 Test in simulation before real-world deployment                          \u2551\n\u2551                                                                              \u2551\n\u255a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u255d\n\"\"\"\n\n\nprint(tips)\n\n\nif __name__ == \"__main__\":\n   print(\"n\" + \"=\" * 80)\n   print(\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f916.png\" alt=\"\ud83e\udd16\" class=\"wp-smiley\" \/> MOLMOACT ADVANCED TUTORIAL - MAIN EXECUTION\")\n   print(\"=\" * 80)\n  \n   print(\"\"\"\n   This tutorial provides a comprehensive guide to MolmoAct.\n  \n   To run the full demo (requires GPU with ~16GB VRAM):\n       model, result = run_demo()\n  \n   To just load images and explore:\n       img1, img2 = load_example_images()\n       display_images(img1, img2)\n  \n   For advanced features:\n       demonstrate_custom_stats()\n  \n   Happy robotics! <img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f916.png\" alt=\"\ud83e\udd16\" class=\"wp-smiley\" \/>\n   \"\"\")\n  \n   model, result = run_demo()\n  \n   print(\"n<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/1f4f7.png\" alt=\"\ud83d\udcf7\" class=\"wp-smiley\" \/> Loading and displaying example images...\")\n   try:\n       img1, img2 = load_example_images()\n       display_images(img1, img2)\n       print(\"n<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2705.png\" alt=\"\u2705\" class=\"wp-smiley\" \/> Images loaded! Uncomment 'run_demo()' to run full inference.\")\n   except Exception as e:\n       print(f\"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/26a0.png\" alt=\"\u26a0\" class=\"wp-smiley\" \/> Could not load images: {e}\")\n       print(\"This is expected in environments without internet access.\")<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We bring everything together through example image loading, demo execution, rollout logic, and best-practice guidance. We run the end-to-end workflow, visualize outputs, process predicted actions, and extend the setup to support continuous rollout and custom statistics. We conclude by presenting the main execution block, which enables us to explore MolmoAct as a complete practical pipeline for spatial reasoning and robot action generation.<\/p>\n<p>In conclusion, we gained a comprehensive, hands-on view of how MolmoAct can be used for spatial reasoning and action generation in a structured, interpretable way. We went beyond basic inference by visualizing traces, processing action outputs, experimenting with rollout-style control, and understanding how the model can fit into broader simulation and robotics workflows. Through this end-to-end implementation, we saw how MolmoAct brings together vision, reasoning, and action prediction into a single practical pipeline that we can study, adapt, and extend for more advanced embodied AI applications.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out\u00a0the<strong><a href=\"https:\/\/arxiv.org\/pdf\/2604.04921\" target=\"_blank\" rel=\"noreferrer noopener\">\u00a0<\/a><a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\/blob\/main\/Computer%20Vision\/molmoact_spatial_reasoning_action_prediction_marktechpost.py\" target=\"_blank\" rel=\"noreferrer noopener\">Full Codes here<\/a>.\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">130k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/MTNLpmJtsFA3VRVd9\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/04\/12\/a-coding-implementation-of-molmoact-for-depth-aware-spatial-reasoning-visual-trajectory-tracing-and-robotic-action-prediction\/\">A Coding Implementation of MolmoAct for Depth-Aware Spatial Reasoning, Visual Trajectory Tracing, and Robotic Action Prediction<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>In this tutorial, we walk thro&hellip;<\/p>\n","protected":false},"author":1,"featured_media":29,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-714","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/714","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=714"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/714\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/29"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=714"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=714"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=714"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}