{"id":612,"date":"2026-03-26T07:13:43","date_gmt":"2026-03-25T23:13:43","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=612"},"modified":"2026-03-26T07:13:43","modified_gmt":"2026-03-25T23:13:43","slug":"how-to-build-a-vision-guided-web-ai-agent-with-molmoweb-4b-using-multimodal-reasoning-and-action-prediction","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=612","title":{"rendered":"How to Build a Vision-Guided Web AI Agent with MolmoWeb-4B Using Multimodal Reasoning and Action Prediction"},"content":{"rendered":"<p>In this tutorial, we explore <a href=\"https:\/\/huggingface.co\/collections\/allenai\/molmoweb\"><strong>MolmoWeb<\/strong><\/a>, Ai2\u2019s open multimodal web agent that understands and interacts with websites directly from screenshots, without relying on HTML or DOM parsing. We set up the full environment in Colab, load the MolmoWeb-4B model with efficient 4-bit quantization, and build the exact prompting workflow that lets the model reason about a web task and predict browser actions. Also, we test the model on blank pages, synthetic web screenshots, and multi-step browsing scenarios to understand how screenshot-based web agents actually think, act, and maintain context across steps.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">print(\"=\" * 70)\nprint(\"SECTION 1: Installing dependencies...\")\nprint(\"=\" * 70)\n\n\nimport subprocess, sys\n\n\ndef pip_install(*packages):\n   subprocess.check_call(\n       [sys.executable, \"-m\", \"pip\", \"install\", \"-q\"] + list(packages)\n   )\n\n\npip_install(\n   \"transformers&gt;=4.48.0\",\n   \"accelerate\",\n   \"bitsandbytes\",\n   \"jinja2\",\n   \"Pillow\",\n   \"requests\",\n   \"datasets\",\n   \"matplotlib\",\n   \"torch\",\n)\n\n\nimport torch\nimport re\nimport json\nimport textwrap\nfrom PIL import Image, ImageDraw, ImageFont\nimport requests\nfrom io import BytesIO\nfrom jinja2 import Template\nimport matplotlib.pyplot as plt\nimport matplotlib.patches as patches\nfrom transformers import AutoProcessor, AutoModelForImageTextToText, BitsAndBytesConfig\n\n\nprint(f\"PyTorch {torch.__version__}  |  CUDA available: {torch.cuda.is_available()}\")\nif torch.cuda.is_available():\n   print(f\"   GPU: {torch.cuda.get_device_name(0)}\")\n   mem_gb = torch.cuda.get_device_properties(0).total_mem \/ 1e9\n   print(f\"   VRAM: {mem_gb:.1f} GB\")\n\n\n\n\nprint(\"n\" + \"=\" * 70)\nprint(\"SECTION 2: Loading MolmoWeb-4B model...\")\nprint(\"=\" * 70)\n\n\nCHECKPOINT = \"allenai\/MolmoWeb-4B\"\n\n\nQUANTIZE = True\n\n\nif QUANTIZE:\n   print(\"Using 4-bit NF4 quantization (fits ~6 GB VRAM)\")\n   bnb_config = BitsAndBytesConfig(\n       load_in_4bit=True,\n       bnb_4bit_quant_type=\"nf4\",\n       bnb_4bit_compute_dtype=torch.bfloat16,\n       bnb_4bit_use_double_quant=True,\n   )\n   model = AutoModelForImageTextToText.from_pretrained(\n       CHECKPOINT,\n       trust_remote_code=True,\n       quantization_config=bnb_config,\n       device_map=\"auto\",\n   )\nelse:\n   print(\"Loading in full bfloat16 precision\")\n   model = AutoModelForImageTextToText.from_pretrained(\n       CHECKPOINT,\n       trust_remote_code=True,\n       torch_dtype=torch.bfloat16,\n       device_map=\"auto\",\n   )\n\n\nprocessor = AutoProcessor.from_pretrained(\n   CHECKPOINT,\n   trust_remote_code=True,\n   padding_side=\"left\",\n)\n\n\nprint(f\"Model loaded: {CHECKPOINT}\")\nprint(f\"   Device map: {model.hf_device_map if hasattr(model, 'hf_device_map') else 'single device'}\")<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We set up the entire environment by installing all required dependencies and importing the core libraries needed for the tutorial. We ensure the runtime is properly configured for GPU usage and verify CUDA availability and device details. By the end of this step, we will have established a stable foundation for running MolmoWeb efficiently in Colab.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">print(\"n\" + \"=\" * 70)\nprint(\"SECTION 3: Understanding the prompt template &amp; action space\")\nprint(\"=\" * 70)\n\n\nMOLMOWEB_THINK_TEMPLATE = Template(\"\"\"\n# GOAL\n{{ task_description }}\n\n\n# PREVIOUS STEPS\n{% for action in past_actions -%}\n## Step {{ action['index'] }}\nTHOUGHT: {{ action['thought'] }}\nACTION: {{ action['action'] }}\n{% endfor %}\n# CURRENTLY ACTIVE PAGE\nPage {{ page_index }}: {{ page_title }} | {{ page_url }}\n\n\n# NEXT STEP\n\n\n\"\"\")\n\n\nSYSTEM_MESSAGE = \"molmo_web_think\"\n\n\nprint(\"\"\"\nMolmoWeb Action Space:\n goto(url)        - Navigate to a URL\n click(x, y)      - Click at normalised coordinates (0.0-1.0)\n type(\"text\")     - Type text into focused element\n scroll(dir)      - Scroll the page (up\/down)\n press(\"key\")     - Press a key (Enter, Tab, etc.)\n new_tab()        - Open a new tab\n switch_tab(n)    - Switch to tab n\n go_back()        - Navigate back\n send_msg(\"text\") - Reply to the user with an answer\n\"\"\")\n\n\n\n\nprint(\"=\" * 70)\nprint(\"SECTION 4: Defining helper functions\")\nprint(\"=\" * 70)\n\n\n\n\ndef build_prompt(task_description, past_actions=None, page_title=None,\n                page_url=\"about:blank\", page_index=0):\n   \"\"\"Build the full MolmoWeb prompt from components.\"\"\"\n   if past_actions is None:\n       past_actions = []\n   user_message = MOLMOWEB_THINK_TEMPLATE.render(\n       task_description=task_description,\n       past_actions=past_actions,\n       page_title=page_title,\n       page_url=page_url,\n       page_index=page_index,\n   )\n   return f\"{SYSTEM_MESSAGE}: {user_message}\"\n\n\n\n\ndef run_inference(prompt, image, max_new_tokens=300):\n   \"\"\"Run a single forward pass through MolmoWeb and return decoded text.\"\"\"\n   messages = [\n       {\n           \"role\": \"user\",\n           \"content\": [\n               {\"type\": \"text\", \"text\": prompt},\n               {\"type\": \"image\", \"image\": image},\n           ],\n       }\n   ]\n   inputs = processor.apply_chat_template(\n       messages,\n       tokenize=True,\n       add_generation_prompt=True,\n       return_tensors=\"pt\",\n       return_dict=True,\n       padding=True,\n   )\n   inputs = {k: v.to(model.device) for k, v in inputs.items()}\n\n\n   with torch.inference_mode(), torch.autocast(\"cuda\", dtype=torch.bfloat16):\n       output = model.generate(**inputs, max_new_tokens=max_new_tokens)\n\n\n   generated_tokens = output[0, inputs[\"input_ids\"].size(1):]\n   return processor.decode(generated_tokens, skip_special_tokens=True)\n\n\n\n\ndef parse_thought_and_action(raw_output):\n   \"\"\"\n   Parse MolmoWeb output into thought and action components.\n\n\n   MolmoWeb outputs typically look like:\n       THOUGHT: I need to navigate to arxiv.org to find the paper.\n       ACTION: goto(\"https:\/\/arxiv.org\")\n\n\n   Returns a dict with 'thought' and 'action' keys.\n   \"\"\"\n   thought = \"\"\n   action = \"\"\n\n\n   thought_match = re.search(r\"THOUGHT:s*(.+?)(?=nACTION:|Z)\", raw_output, re.DOTALL)\n   action_match = re.search(r\"ACTION:s*(.+?)(?=n|$)\", raw_output, re.DOTALL)\n\n\n   if thought_match:\n       thought = thought_match.group(1).strip()\n   if action_match:\n       action = action_match.group(1).strip()\n\n\n   if not thought and not action:\n       lines = raw_output.strip().split(\"n\")\n       if len(lines) &gt;= 2:\n           thought = lines[0].strip()\n           action = lines[-1].strip()\n       else:\n           thought = raw_output.strip()\n\n\n   return {\"thought\": thought, \"action\": action}<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We load the MolmoWeb-4B model with 4-bit quantization to fit within the memory constraints of a free-tier GPU. We configure the model with BitsAndBytes for efficient inference and initialize the processor required for multimodal inputs. This step ensures that the model is ready to accept both text prompts and screenshot inputs for web agent reasoning.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">def parse_click_coords(action_str):\n   \"\"\"\n   Extract normalised (x, y) coordinates from a click action string.\n   e.g., 'click(0.45, 0.32)' -&gt; (0.45, 0.32)\n   Returns None if the action is not a click.\n   \"\"\"\n   match = re.search(r\"click(s*([d.]+)s*,s*([d.]+)s*)\", action_str)\n   if match:\n       return float(match.group(1)), float(match.group(2))\n   return None\n\n\n\n\ndef parse_action_details(action_str):\n   \"\"\"\n   Parse a MolmoWeb action string into a structured dict.\n   Returns:  {\"type\": \"click\", \"x\": 0.45, \"y\": 0.32}\n             {\"type\": \"goto\", \"url\": \"https:\/\/...\"}\n             {\"type\": \"type\", \"text\": \"query text\"}\n             {\"type\": \"scroll\", \"direction\": \"down\"}\n             {\"type\": \"press\", \"key\": \"Enter\"}\n             {\"type\": \"send_msg\", \"message\": \"The answer is ...\"}\n             {\"type\": \"unknown\", \"raw\": \"...\"}\n   \"\"\"\n   action_str = action_str.strip()\n\n\n   m = re.match(r'click(s*([d.]+)s*,s*([d.]+)s*)', action_str)\n   if m:\n       return {\"type\": \"click\", \"x\": float(m.group(1)), \"y\": float(m.group(2))}\n\n\n   m = re.match(r'goto(s*[\"'](.+?)[\"']s*)', action_str)\n   if m:\n       return {\"type\": \"goto\", \"url\": m.group(1)}\n\n\n   m = re.match(r'type(s*[\"'](.+?)[\"']s*)', action_str)\n   if m:\n       return {\"type\": \"type\", \"text\": m.group(1)}\n\n\n   m = re.match(r'scroll(s*[\"']?(up|down)[\"']?s*)', action_str)\n   if m:\n       return {\"type\": \"scroll\", \"direction\": m.group(1)}\n\n\n   m = re.match(r'press(s*[\"'](.+?)[\"']s*)', action_str)\n   if m:\n       return {\"type\": \"press\", \"key\": m.group(1)}\n\n\n   m = re.match(r'send_msg(s*[\"'](.+?)[\"']s*)', action_str, re.DOTALL)\n   if m:\n       return {\"type\": \"send_msg\", \"message\": m.group(1)}\n\n\n   m = re.match(r'(new_tab|go_back|switch_tab)(s*(d*)s*)', action_str)\n   if m:\n       result = {\"type\": m.group(1)}\n       if m.group(2):\n           result[\"tab\"] = int(m.group(2))\n       return result\n\n\n   return {\"type\": \"unknown\", \"raw\": action_str}\n\n\n\n\ndef visualise_click(image, action_str, title=\"MolmoWeb Prediction\"):\n   \"\"\"\n   Draw the predicted click location on the screenshot and display it.\n   Coordinates are normalised (0-1); we convert to pixel space.\n   \"\"\"\n   coords = parse_click_coords(action_str)\n\n\n   fig, ax = plt.subplots(1, 1, figsize=(12, 7))\n   ax.imshow(image)\n   ax.set_title(title, fontsize=14)\n\n\n   if coords:\n       x_norm, y_norm = coords\n       w, h = image.size\n       x_px, y_px = x_norm * w, y_norm * h\n\n\n       circle = patches.Circle(\n           (x_px, y_px), radius=18, linewidth=3,\n           edgecolor=\"red\", facecolor=\"none\"\n       )\n       ax.add_patch(circle)\n       ax.plot(x_px, y_px, \"r+\", markersize=20, markeredgewidth=3)\n\n\n       ax.annotate(\n           f\"click({x_norm:.3f}, {y_norm:.3f})\",\n           (x_px, y_px), xytext=(x_px + 25, y_px - 25),\n           fontsize=11, color=\"white\",\n           bbox=dict(boxstyle=\"round,pad=0.3\", facecolor=\"red\", alpha=0.8),\n           arrowprops=dict(arrowstyle=\"-&gt;\", color=\"red\", lw=2),\n       )\n   else:\n       ax.text(\n           0.5, 0.02, f\"Action: {action_str}\", transform=ax.transAxes,\n           fontsize=12, ha=\"center\", color=\"white\",\n           bbox=dict(boxstyle=\"round,pad=0.4\", facecolor=\"blue\", alpha=0.8),\n       )\n\n\n   ax.axis(\"off\")\n   plt.tight_layout()\n   plt.show()\n\n\n\n\ndef download_image(url, size=(1280, 720)):\n   \"\"\"Download an image from a URL and resize to browser viewport dimensions.\"\"\"\n   response = requests.get(url, timeout=15)\n   img = Image.open(BytesIO(response.content)).convert(\"RGB\")\n   img = img.resize(size, Image.LANCZOS)\n   return img\n\n\n\n\ndef create_synthetic_webpage(title=\"Example Page\", elements=None):\n   \"\"\"\n   Create a synthetic webpage screenshot for testing.\n   'elements' is a list of dicts: {\"type\": \"button\"|\"input\"|\"text\"|\"link\",\n                                    \"text\": str, \"pos\": (x, y)}\n   \"\"\"\n   img = Image.new(\"RGB\", (1280, 720), color=(255, 255, 255))\n   draw = ImageDraw.Draw(img)\n\n\n   draw.rectangle([0, 0, 1280, 50], fill=(240, 240, 240))\n   draw.rectangle([180, 10, 900, 40], outline=(200, 200, 200), width=1, fill=\"white\")\n   draw.text((200, 16), f\"https:\/\/www.example.com\", fill=(100, 100, 100))\n\n\n   for cx in [30, 60, 90]:\n       draw.ellipse([cx - 8, 17, cx + 8, 33], fill=(200, 200, 200))\n\n\n   draw.text((50, 70), title, fill=\"black\")\n\n\n   if elements:\n       for el in elements:\n           x, y = el[\"pos\"]\n           if el[\"type\"] == \"button\":\n               draw.rectangle([x, y, x + 150, y + 35], fill=(66, 133, 244))\n               draw.text((x + 10, y + 8), el[\"text\"], fill=\"white\")\n           elif el[\"type\"] == \"input\":\n               draw.rectangle([x, y, x + 300, y + 35], outline=(180, 180, 180), width=2)\n               draw.text((x + 10, y + 8), el[\"text\"], fill=(150, 150, 150))\n           elif el[\"type\"] == \"text\":\n               draw.text((x, y), el[\"text\"], fill=\"black\")\n           elif el[\"type\"] == \"link\":\n               draw.text((x, y), el[\"text\"], fill=(66, 133, 244))\n\n\n   return img\n\n\n\n\nprint(\"Helper functions defined successfully.\")\n\n\n\n\nprint(\"n\" + \"=\" * 70)\nprint(\"SECTION 5: Single-step inference - blank page (cold start)\")\nprint(\"=\" * 70)\nprint(\"The agent starts at about:blank and must decide its first action.n\")\n\n\nblank_image = Image.new(\"RGB\", (1280, 720), color=\"white\")\n\n\ntask = \"Go to arxiv.org and find the latest paper about Molmo from Ai2\"\n\n\nprompt = build_prompt(\n   task_description=task,\n   page_url=\"about:blank\",\n   page_index=0,\n)\n\n\nprint(f\"Task: {task}\")\nprint(\"Screenshot: blank white image (about:blank)\")\nprint(\"Running inference...n\")\n\n\nraw_output = run_inference(prompt, blank_image)\n\n\nprint(f\"Raw model output:n{raw_output}n\")\n\n\nparsed = parse_thought_and_action(raw_output)\nprint(f\"Thought: {parsed['thought']}\")\nprint(f\"Action:  {parsed['action']}\")\n\n\naction_details = parse_action_details(parsed[\"action\"])\nprint(f\"Parsed:  {action_details}\")<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We define the structured prompt template and system message that guide the model\u2019s reasoning and action generation. We clearly establish how tasks, past actions, and current page context are formatted before being sent to the model. This forms the core interface that allows MolmoWeb to behave like a step-by-step web agent.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">print(\"n\" + \"=\" * 70)\nprint(\"SECTION 6: Single-step inference - webpage screenshot\")\nprint(\"=\" * 70)\n\n\nsearch_page = create_synthetic_webpage(\n   title=\"Google\",\n   elements=[\n       {\"type\": \"text\", \"text\": \"Google\", \"pos\": (560, 200)},\n       {\"type\": \"input\", \"text\": \"Search Google or type a URL\", \"pos\": (390, 340)},\n       {\"type\": \"button\", \"text\": \"Google Search\", \"pos\": (490, 400)},\n       {\"type\": \"button\", \"text\": \"I'm Feeling Lucky\", \"pos\": (660, 400)},\n   ]\n)\n\n\ntask_search = \"Search Google for 'MolmoWeb Ai2 open source web agent'\"\n\n\nprompt_search = build_prompt(\n   task_description=task_search,\n   page_title=\"Google\",\n   page_url=\"https:\/\/www.google.com\",\n   page_index=1,\n   past_actions=[\n       {\n           \"index\": 1,\n           \"thought\": \"I need to go to Google to perform a search.\",\n           \"action\": 'goto(\"https:\/\/www.google.com\")',\n       }\n   ],\n)\n\n\nprint(f\"Task: {task_search}\")\nprint(\"Screenshot: synthetic Google search page\")\nprint(\"Running inference...n\")\n\n\nraw_search = run_inference(prompt_search, search_page)\n\n\nprint(f\"Raw model output:n{raw_search}n\")\n\n\nparsed_search = parse_thought_and_action(raw_search)\nprint(f\"Thought: {parsed_search['thought']}\")\nprint(f\"Action:  {parsed_search['action']}\")\n\n\nvisualise_click(search_page, parsed_search[\"action\"], title=\"MolmoWeb -&gt; Google Search\")\n\n\n\n\nprint(\"n\" + \"=\" * 70)\nprint(\"SECTION 7: Multi-step agent loop (simulated)\")\nprint(\"=\" * 70)\nprint(\"\"\"\nIn production, MolmoWeb runs in a loop:\n 1. Capture screenshot from browser\n 2. Build prompt with task + action history\n 3. Run model -&gt; get thought + action\n 4. Execute action in browser (Playwright)\n 5. Repeat until send_msg() or max steps\n\n\nBelow we simulate 3 steps with synthetic screenshots.\n\"\"\")\n\n\ntask_multi = \"Go to the Ai2 website and find information about MolmoWeb\"\n\n\nprint(\"--- Step 1: about:blank ---\")\nstep1_img = Image.new(\"RGB\", (1280, 720), color=\"white\")\nstep1_prompt = build_prompt(task_multi, page_url=\"about:blank\", page_index=0)\nstep1_raw = run_inference(step1_prompt, step1_img)\nstep1_parsed = parse_thought_and_action(step1_raw)\nprint(f\"  Thought: {step1_parsed['thought']}\")\nprint(f\"  Action:  {step1_parsed['action']}\")\n\n\nhistory = [{\"index\": 1, \"thought\": step1_parsed[\"thought\"], \"action\": step1_parsed[\"action\"]}]\n\n\nprint(\"n--- Step 2: Ai2 homepage ---\")\nstep2_img = create_synthetic_webpage(\n   title=\"Allen Institute for AI\",\n   elements=[\n       {\"type\": \"text\", \"text\": \"AI for the Common Good\", \"pos\": (50, 120)},\n       {\"type\": \"link\", \"text\": \"Open Models\", \"pos\": (50, 180)},\n       {\"type\": \"link\", \"text\": \"Molmo\", \"pos\": (50, 210)},\n       {\"type\": \"link\", \"text\": \"MolmoWeb\", \"pos\": (50, 240)},\n       {\"type\": \"link\", \"text\": \"OLMo\", \"pos\": (50, 270)},\n       {\"type\": \"link\", \"text\": \"Research\", \"pos\": (50, 310)},\n       {\"type\": \"link\", \"text\": \"News\", \"pos\": (50, 340)},\n       {\"type\": \"input\", \"text\": \"Search...\", \"pos\": (800, 70)},\n   ]\n)\n\n\nstep2_prompt = build_prompt(\n   task_multi,\n   past_actions=history,\n   page_title=\"Allen Institute for AI\",\n   page_url=\"https:\/\/allenai.org\",\n   page_index=1,\n)\nstep2_raw = run_inference(step2_prompt, step2_img)\nstep2_parsed = parse_thought_and_action(step2_raw)\nprint(f\"  Thought: {step2_parsed['thought']}\")\nprint(f\"  Action:  {step2_parsed['action']}\")\n\n\nvisualise_click(step2_img, step2_parsed[\"action\"], title=\"Step 2: Ai2 Homepage\")\n\n\nhistory.append({\"index\": 2, \"thought\": step2_parsed[\"thought\"], \"action\": step2_parsed[\"action\"]})\n\n\nprint(\"n--- Step 3: MolmoWeb blog page ---\")\nstep3_img = create_synthetic_webpage(\n   title=\"MolmoWeb: An open agent for automating web tasks\",\n   elements=[\n       {\"type\": \"text\", \"text\": \"March 24, 2026 | Ai2\", \"pos\": (50, 110)},\n       {\"type\": \"text\", \"text\": \"Web agents that navigate and complete tasks\", \"pos\": (50, 160)},\n       {\"type\": \"text\", \"text\": \"in a browser on your behalf.\", \"pos\": (50, 185)},\n       {\"type\": \"link\", \"text\": \"Models on HuggingFace\", \"pos\": (50, 240)},\n       {\"type\": \"link\", \"text\": \"Tech Report (PDF)\", \"pos\": (50, 270)},\n       {\"type\": \"link\", \"text\": \"Training Data\", \"pos\": (50, 300)},\n       {\"type\": \"link\", \"text\": \"GitHub Code\", \"pos\": (50, 330)},\n       {\"type\": \"link\", \"text\": \"Live Demo\", \"pos\": (50, 360)},\n       {\"type\": \"text\", \"text\": \"MolmoWeb-8B achieves 78.2% pass@1 on WebVoyager\", \"pos\": (50, 420)},\n       {\"type\": \"text\", \"text\": \"94.7% pass@4 with test-time scaling\", \"pos\": (50, 450)},\n   ]\n)\n\n\nstep3_prompt = build_prompt(\n   task_multi,\n   past_actions=history,\n   page_title=\"MolmoWeb: An open agent for automating web tasks\",\n   page_url=\"https:\/\/allenai.org\/blog\/molmoweb\",\n   page_index=2,\n)\nstep3_raw = run_inference(step3_prompt, step3_img)\nstep3_parsed = parse_thought_and_action(step3_raw)\nprint(f\"  Thought: {step3_parsed['thought']}\")\nprint(f\"  Action:  {step3_parsed['action']}\")\n\n\nprint(f\"nFull action history after 3 steps:\")\nhistory.append({\"index\": 3, \"thought\": step3_parsed[\"thought\"], \"action\": step3_parsed[\"action\"]})\nfor a in history:\n   print(f\"  Step {a['index']}: {a['action']}\")\n\n\n\n\nprint(\"n\" + \"=\" * 70)\nprint(\"SECTION 8: Action parsing &amp; routing demo\")\nprint(\"=\" * 70)\n\n\ndemo_actions = [\n   'click(0.45, 0.32)',\n   'goto(\"https:\/\/arxiv.org\")',\n   'type(\"MolmoWeb Ai2 web agent\")',\n   'scroll(down)',\n   'press(\"Enter\")',\n   'send_msg(\"The latest paper is titled Molmo2.\")',\n   'go_back()',\n   'new_tab()',\n]\n\n\nprint(\"nParsing various MolmoWeb action strings:n\")\nfor a in demo_actions:\n   parsed_a = parse_action_details(a)\n   print(f\"  Input:  {a}\")\n   print(f\"  Output: {parsed_a}n\")<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We implement helper functions for prompt construction, model inference, and parsing outputs into structured thoughts and actions. We also build utilities for extracting click coordinates, interpreting action types, and visualizing model predictions on screenshots. These components, collectively, enable us to simulate and analyze the agent\u2019s behavior in a controlled environment.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">print(\"=\" * 70)\nprint(\"SECTION 9: Batch inference on multiple tasks\")\nprint(\"=\" * 70)\nprint(\"Running the model on several different cold-start tasks.n\")\n\n\nbatch_tasks = [\n   \"What is the weather in Seattle right now?\",\n   \"Find the cheapest nonstop flights from NYC to London\",\n   \"Look up the Ai2 careers page and list open positions\",\n   \"Search Amazon for a USB-C hub with at least 4 ports\",\n]\n\n\nblank = Image.new(\"RGB\", (1280, 720), color=\"white\")\n\n\nfor i, task_text in enumerate(batch_tasks, 1):\n   prompt_b = build_prompt(task_description=task_text, page_url=\"about:blank\")\n   raw_b = run_inference(prompt_b, blank, max_new_tokens=200)\n   parsed_b = parse_thought_and_action(raw_b)\n   action_d = parse_action_details(parsed_b[\"action\"])\n\n\n   print(f\"Task {i}: {task_text}\")\n   print(f\"  Thought: {parsed_b['thought']}\")\n   print(f\"  Action:  {parsed_b['action']}\")\n   print(f\"  Parsed:  {action_d}n\")\n\n\n\n\nprint(\"=\" * 70)\nprint(\"SECTION 10: Exploring the MolmoWebMix training dataset\")\nprint(\"=\" * 70)\nprint(\"\"\"\nMolmoWebMix consists of three main subsets:\n 1. MolmoWeb-HumanTrajs    - 30k human-recorded web task trajectories\n 2. MolmoWeb-SyntheticTrajs - Synthetic trajectories from axtree agents\n 3. MolmoWeb-SyntheticQA    - 2.2M screenshot QA pairs for visual grounding\n\"\"\")\n\n\ntry:\n   from datasets import load_dataset\n\n\n   print(\"Loading a sample from MolmoWeb-HumanTrajs (streaming mode)...n\")\n   ds = load_dataset(\n       \"allenai\/MolmoWeb-HumanTrajs\",\n       split=\"train\",\n       streaming=True,\n   )\n\n\n   print(\"Sample entries from MolmoWeb-HumanTrajs:n\")\n   for i, example in enumerate(ds):\n       if i &gt;= 3:\n           break\n\n\n       print(f\"  Example {i + 1}:\")\n       keys = list(example.keys())\n       print(f\"    Keys: {keys}\")\n\n\n       for k in keys:\n           val = example[k]\n           if isinstance(val, str):\n               display = val[:120] + (\"...\" if len(val) &gt; 120 else \"\")\n               print(f\"    {k}: {display}\")\n           elif isinstance(val, list):\n               print(f\"    {k}: list of {len(val)} items\")\n           elif isinstance(val, dict):\n               print(f\"    {k}: dict with keys {list(val.keys())[:5]}\")\n           elif isinstance(val, (bytes, bytearray)):\n               print(f\"    {k}: binary data ({len(val)} bytes)\")\n           else:\n               print(f\"    {k}: {val}\")\n       print()\n\n\n   print(\"Dataset exploration complete.\")\n   print(\"Full datasets: https:\/\/huggingface.co\/collections\/allenai\/molmoweb-data\")\n\n\nexcept Exception as e:\n   print(f\"Could not load dataset: {e}\")\n   print(\"You can explore it at: https:\/\/huggingface.co\/collections\/allenai\/molmoweb-data\")\n\n\n\n\nprint(\"n\" + \"=\" * 70)\nprint(\"BONUS: Full production agent loop (reference, not runnable in Colab)\")\nprint(\"=\" * 70)\n\n\nprint('''\nimport asyncio\nfrom playwright.async_api import async_playwright\n\n\nasync def run_molmoweb_agent(task: str, max_steps: int = 15):\n   \"\"\"Full MolmoWeb agent loop with a live Chromium browser.\"\"\"\n\n\n   async with async_playwright() as pw:\n       browser = await pw.chromium.launch(headless=True)\n       page = await browser.new_page(viewport={\"width\": 1280, \"height\": 720})\n\n\n       action_history = []\n\n\n       for step in range(1, max_steps + 1):\n           screenshot_bytes = await page.screenshot()\n           screenshot = Image.open(BytesIO(screenshot_bytes)).convert(\"RGB\")\n\n\n           prompt = build_prompt(\n               task_description=task,\n               past_actions=action_history,\n               page_title=await page.title(),\n               page_url=page.url,\n               page_index=step,\n           )\n\n\n           raw = run_inference(prompt, screenshot)\n           parsed = parse_thought_and_action(raw)\n           action = parse_action_details(parsed[\"action\"])\n\n\n           print(f\"Step {step}: {parsed['thought']}\")\n           print(f\"  -&gt; {parsed['action']}\")\n\n\n           if action[\"type\"] == \"goto\":\n               await page.goto(action[\"url\"], wait_until=\"domcontentloaded\")\n           elif action[\"type\"] == \"click\":\n               x_px = int(action[\"x\"] * 1280)\n               y_px = int(action[\"y\"] * 720)\n               await page.mouse.click(x_px, y_px)\n           elif action[\"type\"] == \"type\":\n               await page.keyboard.type(action[\"text\"])\n           elif action[\"type\"] == \"press\":\n               await page.keyboard.press(action[\"key\"])\n           elif action[\"type\"] == \"scroll\":\n               delta = -500 if action[\"direction\"] == \"up\" else 500\n               await page.mouse.wheel(0, delta)\n           elif action[\"type\"] == \"go_back\":\n               await page.go_back()\n           elif action[\"type\"] == \"send_msg\":\n               print(f\"\\nAgent answer: {action['message']}\")\n               break\n\n\n           action_history.append({\n               \"index\": step,\n               \"thought\": parsed[\"thought\"],\n               \"action\": parsed[\"action\"],\n           })\n\n\n           await asyncio.sleep(1.5)\n\n\n       await browser.close()\n       return action_history\n\n\n# Usage:\n# asyncio.run(run_molmoweb_agent(\"Find the latest Ai2 research papers\"))\n''')\n\n\n\n\nprint(\"=\" * 70)\nprint(\"Tutorial Complete!\")\nprint(\"=\" * 70)\nprint(\"\"\"\nWhat you learned:\n - Loading MolmoWeb-4B with 4-bit quantization on a free Colab T4\n - The structured prompt template (GOAL \/ PREVIOUS STEPS \/ ACTIVE PAGE)\n - Single-step inference on blank and real-looking screenshots\n - Multi-step agent loop with accumulated action history\n - Parsing model outputs into structured action dictionaries\n - Visualising click coordinates overlaid on screenshots\n - Batch inference across different task types\n - Exploring the MolmoWebMix training dataset\n - Production agent architecture with Playwright\n\n\nResources:\n Models:  https:\/\/huggingface.co\/collections\/allenai\/molmoweb\n Data:    https:\/\/huggingface.co\/collections\/allenai\/molmoweb-data\n Code:    https:\/\/github.com\/allenai\/molmoweb\n Paper:   https:\/\/allenai.org\/papers\/molmoweb\n Blog:    https:\/\/allenai.org\/blog\/molmoweb\n Demo:    https:\/\/molmoweb.allen.ai\/\n\"\"\")<\/code><\/pre>\n<\/div>\n<\/div>\n<p>We run full demonstrations including single-step inference, multi-step agent loops, batch task execution, and dataset exploration. We simulate realistic browsing scenarios, track action history, and observe how the model evolves its decisions across steps. This completes the end-to-end pipeline and gives us a clear understanding of how MolmoWeb operates as a functional web agent.<\/p>\n<p>In conclusion, we built a strong practical understanding of how MolmoWeb works as a screenshot-driven web agent in a Colab-friendly Python workflow. We saw how to structure prompts, run inference on visual browser states, parse reasoning and actions, visualize predicted click locations, and simulate multi-step task execution with accumulated history. We also extended the tutorial beyond basic inference by exploring batch predictions, inspecting the MolmoWebMix training data, and studying a production-style browser loop that connects the model to a live Playwright session. Through this process, we run the model and also understand the full pipeline required to turn a multimodal model into a functioning web agent.<\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out\u00a0the\u00a0<strong><a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\/blob\/main\/Agentic%20AI%20Codes\/molmoweb_multimodal_web_agent_marktechpost.py\" target=\"_blank\" rel=\"noreferrer noopener\">Notebook here<\/a>.\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">120k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/03\/25\/how-to-build-a-vision-guided-web-ai-agent-with-molmoweb-4b-using-multimodal-reasoning-and-action-prediction\/\">How to Build a Vision-Guided Web AI Agent with MolmoWeb-4B Using Multimodal Reasoning and Action Prediction<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>In this tutorial, we explore M&hellip;<\/p>\n","protected":false},"author":1,"featured_media":29,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-612","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/612","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=612"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/612\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/29"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=612"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=612"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=612"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}