{"id":593,"date":"2026-03-23T09:33:41","date_gmt":"2026-03-23T01:33:41","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=593"},"modified":"2026-03-23T09:33:41","modified_gmt":"2026-03-23T01:33:41","slug":"how-bm25-and-rag-retrieve-information-differently","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=593","title":{"rendered":"How BM25 and RAG Retrieve Information Differently?"},"content":{"rendered":"<p>When you type a query into a search engine, something has to decide which documents are actually relevant \u2014 and how to rank them. <strong>BM25 (Best Matching 25)<\/strong>, the algorithm powering search engines like Elasticsearch and Lucene, has been the dominant answer to that question for decades.\u00a0<\/p>\n<p>It scores documents by looking at three things: how often your query terms appear in a document, how rare those terms are across the entire collection, and whether a document is unusually long. The clever part is that BM25 doesn\u2019t reward keyword stuffing \u2014 a word appearing 20 times doesn\u2019t make a document 20 times more relevant, thanks to term frequency saturation. But BM25 has a fundamental blind spot: it only matches the words you typed, not what you meant. Search for <em>\u201cfinding similar content without exact word overlap\u201d<\/em> and BM25 returns a blank stare.\u00a0<\/p>\n<p>This is exactly the gap that <strong>Retrieval-Augmented Generation (RAG)<\/strong> with vector embeddings was built to fill \u2014 by matching meaning, not just keywords. In this article, we\u2019ll break down how each approach works, where each one wins, and why production systems increasingly use both together.<\/p>\n<h2 class=\"wp-block-heading\"><strong>How BM25 Works<\/strong><\/h2>\n<p>At its core, BM25 assigns a relevance score to every document in the collection for a given query, then ranks documents by that score. For each term in your query, BM25 asks three questions: <em>How often does this term appear in the document? How rare is this term across all documents? And is this document unusually long?<\/em> The final score is the sum of weighted answers to these questions across all query terms.<\/p>\n<p>The term frequency component is where BM25 gets clever. Rather than counting raw occurrences, it applies <strong>saturation<\/strong> \u2014 the score grows quickly at first but flattens out as frequency increases. A term appearing 5 times contributes much more than a term appearing once, but a term appearing 50 times contributes barely more than one appearing 20 times. This is controlled by the parameter <strong>k\u2081<\/strong> (typically set between 1.2 and 2.0). Set it low and the saturation kicks in fast; set it high and raw frequency matters more. This single design choice is what makes BM25 resistant to keyword stuffing \u2014 repeating a word a hundred times in a document won\u2019t game the score.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Length Normalization and IDF<\/strong><\/h3>\n<p>The second tuning parameter, <strong>b<\/strong> (typically 0.75), controls how much a document\u2019s length is penalized. A long document naturally contains more words, so it has more chances to include your query term \u2014 not because it\u2019s more relevant, but simply because it\u2019s longer. BM25 compares each document\u2019s length to the average document length in the collection and scales the term frequency score down accordingly. Setting <strong>b = 0<\/strong> disables this penalty entirely; <strong>b = 1<\/strong> applies full normalization.<\/p>\n<p>Finally, <strong>IDF<\/strong> (Inverse Document Frequency) ensures that rare terms carry more weight than common ones. If the word <em>\u201cretrieval\u201d<\/em> appears in only 3 out of 10,000 documents, it\u2019s a strong signal of relevance when matched. If the word <em>\u201cthe\u201d<\/em> appears in all 10,000, matching it tells you almost nothing. IDF is what makes BM25 pay attention to the words that actually discriminate between documents. One important caveat: because BM25 operates purely on term frequency, it has no awareness of word order, context, or meaning \u2014 matching <em>\u201cbank\u201d<\/em> in a query about finance and <em>\u201cbank\u201d<\/em> in a document about rivers looks identical to BM25. That bag-of-words limitation is fundamental, not a tuning problem.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"999\" height=\"525\" data-attachment-id=\"78528\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/03\/22\/how-bm25-and-rag-retrieve-information-differently\/image-375\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/image-41.png\" data-orig-size=\"999,525\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"image\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/image-41-300x158.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/image-41.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/image-41.png\" alt=\"\" class=\"wp-image-78528\" \/><\/figure>\n<\/div>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"996\" height=\"402\" data-attachment-id=\"78534\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/03\/22\/how-bm25-and-rag-retrieve-information-differently\/image-381\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/image-45.png\" data-orig-size=\"996,402\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"image\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/image-45-300x121.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/image-45.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/image-45.png\" alt=\"\" class=\"wp-image-78534\" \/><\/figure>\n<\/div>\n<h2 class=\"wp-block-heading\"><strong>How is BM25 different from Vector Search<\/strong><\/h2>\n<p>BM25 and vector search answer the same question \u2014 <em>which documents are relevant to this query?<\/em> \u2014 but through fundamentally different lenses. BM25 is a keyword-matching algorithm: it looks for the exact words from your query inside each document, scores them based on frequency and rarity, and ranks accordingly. It has no understanding of language \u2014 it sees text as a bag of tokens, not meaning.\u00a0<\/p>\n<p>Vector search, by contrast, converts both the query and every document into dense numerical vectors using an embedding model, then finds documents whose vectors point in the same direction as the query vector \u2014 measured by cosine similarity. This means vector search can match <em>\u201ccardiac arrest\u201d<\/em> to a document about <em>\u201cheart failure\u201d<\/em> even though none of the words overlap, because the embedding model has learned that these concepts live close together in semantic space.\u00a0<\/p>\n<p>The tradeoff is practical: BM25 requires no model, no GPU, and no API call \u2014 it\u2019s fast, lightweight, and fully explainable. Vector search requires an embedding model at index time and query time, adds latency and cost, and produces scores that are harder to interpret. Neither is strictly better; they fail in opposite directions, which is exactly why hybrid search \u2014 combining both \u2014 has become the production standard.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"999\" height=\"525\" data-attachment-id=\"78530\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/03\/22\/how-bm25-and-rag-retrieve-information-differently\/image-377\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/image-41.png\" data-orig-size=\"999,525\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"image\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/image-41-300x158.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/image-41.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/image-41.png\" alt=\"\" class=\"wp-image-78530\" \/><\/figure>\n<\/div>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1868\" height=\"780\" data-attachment-id=\"78536\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/03\/22\/how-bm25-and-rag-retrieve-information-differently\/screenshot-2026-03-22-at-5-50-51-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-22-at-5.50.51-PM-1.png\" data-orig-size=\"1868,780\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-03-22 at 5.50.51\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-22-at-5.50.51-PM-1-300x125.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-22-at-5.50.51-PM-1-1024x428.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-22-at-5.50.51-PM-1.png\" alt=\"\" class=\"wp-image-78536\" \/><\/figure>\n<\/div>\n<h2 class=\"wp-block-heading\"><strong>Comparing BM25 and Vector Search in Python<\/strong><\/h2>\n<h3 class=\"wp-block-heading\"><strong>Installing the dependencies<\/strong><\/h3>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">pip install rank_bm25 openai numpy<\/code><\/pre>\n<\/div>\n<\/div>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">import math\nimport re\nimport numpy as np\nfrom collections import Counter\nfrom rank_bm25 import BM25Okapi\nfrom openai import OpenAI<\/code><\/pre>\n<\/div>\n<\/div>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">import os\nfrom getpass import getpass \nos.environ['OPENAI_API_KEY'] = getpass('Enter OpenAI API Key: ')<\/code><\/pre>\n<\/div>\n<\/div>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">client = OpenAI()<\/code><\/pre>\n<\/div>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>Defining the Corpus<\/strong><\/h3>\n<p>Before comparing BM25 and vector search, we need a shared knowledge base to search over. We define 12 short text chunks covering a range of topics \u2014 Python, machine learning, BM25, transformers, embeddings, RAG, databases, and more. The topics are deliberately varied: some chunks are closely related (BM25 and TF-IDF, embeddings and cosine similarity), while others are completely unrelated (PostgreSQL, Django). This variety is what makes the comparison meaningful \u2014 a retrieval method that works well should surface the relevant chunks and ignore the noise.<\/p>\n<p>This corpus acts as our stand-in for a real document store. In a production RAG pipeline, these chunks would come from splitting and cleaning actual documents \u2014 PDFs, wikis, knowledge bases. Here, we keep them short and hand-crafted so the retrieval behaviour is easy to trace and reason about.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">CHUNKS = [\n    # 0\n    \"Python is a high-level, interpreted programming language known for its simple and readable syntax. \"\n    \"It supports multiple programming paradigms including procedural, object-oriented, and functional programming.\",\n \n    # 1\n    \"Machine learning is a subset of artificial intelligence that enables systems to learn from data \"\n    \"without being explicitly programmed. Common algorithms include linear regression, decision trees, and neural networks.\",\n \n    # 2\n    \"BM25 stands for Best Match 25. It is a bag-of-words retrieval function used by search engines \"\n    \"to rank documents based on the query terms appearing in each document. \"\n    \"BM25 uses term frequency and inverse document frequency with length normalization.\",\n \n    # 3\n    \"Transformer architecture introduced the self-attention mechanism, which allows the model to weigh \"\n    \"the importance of different words in a sentence regardless of their position. \"\n    \"BERT and GPT are both based on the Transformer architecture.\",\n \n    # 4\n    \"Vector embeddings represent text as dense numerical vectors in a high-dimensional space. \"\n    \"Similar texts are placed closer together. This allows semantic search -- finding documents \"\n    \"that mean the same thing even if they use different words.\",\n \n    # 5\n    \"TF-IDF stands for Term Frequency-Inverse Document Frequency. It reflects how important a word is \"\n    \"to a document relative to the entire corpus. Rare words get higher scores than common ones like 'the'.\",\n \n    # 6\n    \"Retrieval-Augmented Generation (RAG) combines a retrieval system with a language model. \"\n    \"The retriever finds relevant documents; the generator uses them as context to produce an answer. \"\n    \"This reduces hallucinations and allows the model to cite sources.\",\n \n    # 7\n    \"Django is a high-level Python web framework that encourages rapid development and clean, pragmatic design. \"\n    \"It includes an ORM, authentication system, and admin panel out of the box.\",\n \n    # 8\n    \"Cosine similarity measures the angle between two vectors. A score of 1 means identical direction, \"\n    \"0 means orthogonal, and -1 means opposite. It is commonly used to compare text embeddings.\",\n \n    # 9\n    \"Gradient descent is an optimization algorithm used to minimize a loss function by iteratively \"\n    \"moving in the direction of the steepest descent. It is the backbone of training neural networks.\",\n \n    # 10\n    \"PostgreSQL is an open-source relational database known for its robustness and support for advanced \"\n    \"SQL features like window functions, CTEs, and JSON storage.\",\n \n    # 11\n    \"Sparse retrieval methods like BM25 rely on exact keyword matches and fail when the query uses \"\n    \"synonyms or paraphrases not present in the document. Dense retrieval using embeddings handles \"\n    \"this by matching semantic meaning rather than surface form.\",\n]\n \nprint(f\"Corpus loaded: {len(CHUNKS)} chunks\")\nfor i, c in enumerate(CHUNKS):\n    print(f\"  [{i:02d}] {c[:75]}...\")<\/code><\/pre>\n<\/div>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>Building the BM25 Retriever<\/strong><\/h3>\n<p>With the corpus defined, we can build the BM25 index. The process has two steps: tokenization and indexing. The tokenize function lowercases the text and splits on any non-alphanumeric character \u2014 so \u201cTF-IDF\u201d becomes [\u201ctf\u201d, \u201cidf\u201d] and \u201cbag-of-words\u201d becomes [\u201cbag\u201d, \u201cof\u201d, \u201cwords\u201d]. This is intentionally simple: BM25 is a bag-of-words model, so there is no stemming, no stopword removal, and no linguistic preprocessing. Every word is treated as an independent token.<\/p>\n<p>Once every chunk is tokenized, BM25Okapi builds the index \u2014 computing document lengths, average document length, and IDF scores for every unique term in the corpus. This happens once at startup. At query time, bm25_search tokenizes the incoming query the same way, calls get_scores to compute a BM25 relevance score for every chunk in parallel, then sorts and returns the top-k results. The sanity check at the bottom runs a test query to confirm the index is working before we move on to the embedding retriever.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">def tokenize(text: str) -&gt; list[str]:\n    \"\"\"Lowercase and split on non-alphanumeric characters.\"\"\"\n    return re.findall(r'w+', text.lower())\n \n# Build BM25 index over the corpus\ntokenized_corpus = [tokenize(chunk) for chunk in CHUNKS]\nbm25 = BM25Okapi(tokenized_corpus)\n \ndef bm25_search(query: str, top_k: int = 3) -&gt; list[dict]:\n    \"\"\"Return top-k chunks ranked by BM25 score.\"\"\"\n    tokens = tokenize(query)\n    scores = bm25.get_scores(tokens)\n    ranked = np.argsort(scores)[::-1][:top_k]\n    return [\n        {\"chunk_id\": int(i), \"score\": round(float(scores[i]), 4), \"text\": CHUNKS[i]}\n        for i in ranked\n    ]\n \n# Quick sanity check\nresults = bm25_search(\"how does BM25 rank documents\", top_k=3)\nprint(\"BM25 test -- query: 'how does BM25 rank documents'\")\nfor r in results:\n    print(f\"  [{r['chunk_id']}] score={r['score']}  {r['text'][:70]}...\")<\/code><\/pre>\n<\/div>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>Building the Embedding Retriever<\/strong><\/h3>\n<p>The embedding retriever works differently from BM25 at every step. Instead of counting tokens, it converts each chunk into a dense numerical vector \u2014 a list of 1,536 numbers \u2014 using OpenAI\u2019s text-embedding-3-small model. Each number represents a dimension in semantic space, and chunks that mean similar things end up with vectors that point in similar directions, regardless of the words they use.<\/p>\n<p>The index build step calls the embedding API once per chunk and stores the resulting vectors in memory. This is the key cost difference from BM25: building the BM25 index is pure arithmetic on your own machine, while building the embedding index requires one API call per chunk and produces vectors you need to store. For 12 chunks this is trivial; at a million chunks, this becomes a real infrastructure decision.<\/p>\n<p>At query time, embedding_search embeds the incoming query using the same model \u2014 this is important, the query and the chunks must live in the same vector space \u2014 then computes cosine similarity between the query vector and every stored chunk vector. Cosine similarity measures the angle between two vectors: a score of 1 means identical direction, 0 means completely unrelated, and negative values mean opposite meaning. The chunks are then ranked by this score and the top-k are returned. The same sanity check query from the BM25 section runs here too, so you can see the first direct comparison between the two approaches on identical input.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">EMBED_MODEL = \"text-embedding-3-small\"\n \ndef get_embedding(text: str) -&gt; np.ndarray:\n    response = client.embeddings.create(model=EMBED_MODEL, input=text)\n    return np.array(response.data[0].embedding)\n \ndef cosine_similarity(a: np.ndarray, b: np.ndarray) -&gt; float:\n    return float(np.dot(a, b) \/ (np.linalg.norm(a) * np.linalg.norm(b)))\n \n# Embed all chunks once (this is the \"index build\" step in RAG)\nprint(\"Building embedding index... (one API call per chunk)\")\nchunk_embeddings = [get_embedding(chunk) for chunk in CHUNKS]\nprint(f\"Done. Each embedding has {len(chunk_embeddings[0])} dimensions.\")\n \ndef embedding_search(query: str, top_k: int = 3) -&gt; list[dict]:\n    \"\"\"Return top-k chunks ranked by cosine similarity to the query embedding.\"\"\"\n    query_emb = get_embedding(query)\n    scores = [cosine_similarity(query_emb, emb) for emb in chunk_embeddings]\n    ranked = np.argsort(scores)[::-1][:top_k]\n    return [\n        {\"chunk_id\": int(i), \"score\": round(float(scores[i]), 4), \"text\": CHUNKS[i]}\n        for i in ranked\n    ]\n# Quick sanity check\nresults = embedding_search(\"how does BM25 rank documents\", top_k=3)\nprint(\"nEmbedding test -- query: 'how does BM25 rank documents'\")\nfor r in results:\n    print(f\"  [{r['chunk_id']}] score={r['score']}  {r['text'][:70]}...\")<\/code><\/pre>\n<\/div>\n<\/div>\n<h3 class=\"wp-block-heading\"><strong>Side-by-Side Comparison Function<\/strong><\/h3>\n<p>This is the core of the experiment. The compare function runs the same query through both retrievers simultaneously and prints the results in a two-column layout \u2014 BM25 on the left, embeddings on the right \u2014 so the differences are immediately visible at the same rank position.<\/p>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">def compare(query: str, top_k: int = 3):\n    bm25_results    = bm25_search(query, top_k)\n    embed_results   = embedding_search(query, top_k)\n \n    print(f\"n{'\u2550'*70}\")\n    print(f\"  QUERY: \"{query}\"\")\n    print(f\"{'\u2550'*70}\")\n \n    print(f\"n  {'BM25 (keyword)':&lt;35}  {'Embedding RAG (semantic)'}\")\n    print(f\"  {'\u2500'*33}  {'\u2500'*33}\")\n \n    for rank, (b, e) in enumerate(zip(bm25_results, embed_results), 1):\n        b_preview = b['text'][:55].replace('n', ' ')\n        e_preview = e['text'][:55].replace('n', ' ')\n        same = \"<img decoding=\"async\" src=\"https:\/\/s.w.org\/images\/core\/emoji\/17.0.2\/72x72\/2b05.png\" alt=\"\u2b05\" class=\"wp-smiley\" \/> same\" if b['chunk_id'] == e['chunk_id'] else \"\"\n        print(f\"  #{rank} [{b['chunk_id']:02d}] {b['score']:.4f}  {b_preview}...\")\n        print(f\"     [{e['chunk_id']:02d}] {e['score']:.4f}  {e_preview}...  {same}\")\n        print()<\/code><\/pre>\n<\/div>\n<\/div>\n<div class=\"dm-code-snippet dark dm-normal-version default no-background-mobile\">\n<div class=\"control-language\">\n<div class=\"dm-buttons\">\n<div class=\"dm-buttons-left\">\n<div class=\"dm-button-snippet red-button\"><\/div>\n<div class=\"dm-button-snippet orange-button\"><\/div>\n<div class=\"dm-button-snippet green-button\"><\/div>\n<\/div>\n<div class=\"dm-buttons-right\"><a><span class=\"dm-copy-text\">Copy Code<\/span><span class=\"dm-copy-confirmed\">Copied<\/span><span class=\"dm-error-message\">Use a different Browser<\/span><\/a><\/div>\n<\/div>\n<pre class=\" no-line-numbers\"><code class=\" no-wrap language-php\">compare(\"BM25 term frequency inverse document frequency\")\ncompare(\"what is RAG and why does it reduce hallucinations\")\ncompare(\"cosine similarity between vectors\")<\/code><\/pre>\n<\/div>\n<\/div>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1002\" height=\"438\" data-attachment-id=\"78531\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/03\/22\/how-bm25-and-rag-retrieve-information-differently\/image-378\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/image-42.png\" data-orig-size=\"1002,438\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"image\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/image-42-300x131.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/image-42.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/image-42.png\" alt=\"\" class=\"wp-image-78531\" \/><\/figure>\n<\/div>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1008\" height=\"443\" data-attachment-id=\"78533\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/03\/22\/how-bm25-and-rag-retrieve-information-differently\/image-380\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/image-44.png\" data-orig-size=\"1008,443\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"image\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/image-44-300x132.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/image-44.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/image-44.png\" alt=\"\" class=\"wp-image-78533\" \/><\/figure>\n<\/div>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1006\" height=\"397\" data-attachment-id=\"78532\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/03\/22\/how-bm25-and-rag-retrieve-information-differently\/image-379\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/image-43.png\" data-orig-size=\"1006,397\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"image\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/image-43-300x118.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/image-43.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/image-43.png\" alt=\"\" class=\"wp-image-78532\" \/><\/figure>\n<\/div>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1754\" height=\"520\" data-attachment-id=\"78538\" data-permalink=\"https:\/\/www.marktechpost.com\/2026\/03\/22\/how-bm25-and-rag-retrieve-information-differently\/screenshot-2026-03-22-at-5-55-15-pm-2\/\" data-orig-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-22-at-5.55.15-PM-1.png\" data-orig-size=\"1754,520\" data-comments-opened=\"1\" data-image-meta='{\"aperture\":\"0\",\"credit\":\"\",\"camera\":\"\",\"caption\":\"\",\"created_timestamp\":\"0\",\"copyright\":\"\",\"focal_length\":\"0\",\"iso\":\"0\",\"shutter_speed\":\"0\",\"title\":\"\",\"orientation\":\"0\"}' data-image-title=\"Screenshot 2026-03-22 at 5.55.15\u202fPM\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-22-at-5.55.15-PM-1-300x89.png\" data-large-file=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-22-at-5.55.15-PM-1-1024x304.png\" src=\"https:\/\/www.marktechpost.com\/wp-content\/uploads\/2026\/03\/Screenshot-2026-03-22-at-5.55.15-PM-1.png\" alt=\"\" class=\"wp-image-78538\" \/><\/figure>\n<\/div>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<p>Check out\u00a0the\u00a0<strong><a href=\"https:\/\/github.com\/Marktechpost\/AI-Tutorial-Codes-Included\/blob\/main\/RAG\/BM25_Vector_Search.ipynb\" target=\"_blank\" rel=\"noreferrer noopener\">Full Notebook here<\/a>.\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">120k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/03\/22\/how-bm25-and-rag-retrieve-information-differently\/\">How BM25 and RAG Retrieve Information Differently?<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>When you type a query into a s&hellip;<\/p>\n","protected":false},"author":1,"featured_media":594,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-593","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/593","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=593"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/593\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/594"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=593"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=593"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=593"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}