{"id":987,"date":"2026-05-28T17:08:20","date_gmt":"2026-05-28T09:08:20","guid":{"rendered":"https:\/\/connectword.dpdns.org\/?p=987"},"modified":"2026-05-28T17:08:20","modified_gmt":"2026-05-28T09:08:20","slug":"perplexity-ai-open-sources-unigram-tokenizer-that-achieves-5x-lower-p50-latency-than-hugging-face-tokenizers-crate","status":"publish","type":"post","link":"https:\/\/connectword.dpdns.org\/?p=987","title":{"rendered":"Perplexity AI Open-Sources Unigram Tokenizer That Achieves 5x Lower p50 Latency Than Hugging Face tokenizers Crate"},"content":{"rendered":"<p class=\"wp-block-paragraph\">Perplexity AI\u2019s research team reimplemented their Unigram tokenizer from scratch in Rust and open-sourced the code in <a href=\"https:\/\/github.com\/perplexityai\/pplx-garden\">pplx-garden<\/a>, their inference technology repository.<\/p>\n<p class=\"wp-block-paragraph\">At production input lengths, the new encoder cuts p50 latency by roughly 5x versus the Hugging Face <code>tokenizers<\/code> crate, ~2x versus SentencePiece (C++), and ~1.5x versus IREE\u2019s tokenizer (C), with zero steady-state heap allocations. In production, it reduced CPU utilization in Perplexity\u2019s inference stack by 5-6x and shaved double-digit milliseconds off reranker latency.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Why Tokenization Became a Bottleneck<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">LLM inference cost is typically framed around GPU work: KV caches, attention kernels, expert routing. But smaller models, such as embedding models, classifiers, and rerankers, tell a different story. These models are two to three orders of magnitude smaller than frontier transformers.<\/p>\n<p class=\"wp-block-paragraph\">A reranker scoring hundreds of candidate documents per request is a clear example. With a small model, GPU compute often finishes in single-digit milliseconds. Every input still passes through CPU-side tokenization first. When batch sizes are large, tokenization becomes a meaningful fraction of total request latency.<\/p>\n<p class=\"wp-block-paragraph\">Perplexity\u2019s work targets <strong>XLM-RoBERTa<\/strong>, a model with a 250K-token Unigram vocabulary trained with SentencePiece. Fine-tuned RoBERTa-family encoders are a common production choice for ranking, retrieval, and similarity tasks.<\/p>\n<h2 class=\"wp-block-heading\"><strong>What is Unigram Tokenization?<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">Unigram tokenization was introduced by Kudo in 2018 and is implemented in SentencePiece. It frames segmentation as a <strong>most-probable-path problem<\/strong>. Each vocabulary token has a learned log-probability. The tokenizer picks the segmentation whose token scores sum to the highest value.<\/p>\n<p class=\"wp-block-paragraph\">The algorithm used to find that best path is the <strong>Viterbi algorithm<\/strong>, a dynamic programming technique from 1967. Byte positions form graph layers and vocabulary tokens are edges spanning a contiguous byte range. The DP recurrence iterates over byte positions and updates the best-scoring path at each position.<\/p>\n<p class=\"wp-block-paragraph\">The outer loop runs in linear time relative to input length. The inner loop walks a vocabulary <strong>trie<\/strong> (a prefix tree structure) at each byte position. On a 16K-token input, this inner walk executes hundreds of thousands of trie transitions. It is the hot path.<\/p>\n<h2 class=\"wp-block-heading\"><strong>What was Slow in the Hugging Face Implementation<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">The Hugging Face <code>tokenizers<\/code> crate is the default Rust tokenizer most teams reach for. Perplexity used it as the benchmark reference. At 514 tokens (512 + BOS\/EOS injection), <strong>the reference implementation had three costly patterns:<\/strong><\/p>\n<figure class=\"wp-block-table\">\n<table class=\"has-fixed-layout\">\n<thead>\n<tr>\n<th>Bottleneck<\/th>\n<th>Mechanism<\/th>\n<th>Measured impact<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Allocation per match<\/td>\n<td><code>String::from_utf8<\/code> + <code>AHashMap<\/code> lookup per trie match<\/td>\n<td>7,295 allocations at 514 tokens; 299,171 at 16K<\/td>\n<\/tr>\n<tr>\n<td>Pointer chase per byte<\/td>\n<td><code>AHashMap<\/code> at every trie node; 4 dependent loads per byte step<\/td>\n<td>Dependent-load latency dominates the hot path<\/td>\n<\/tr>\n<tr>\n<td>L2 thrashing on long inputs<\/td>\n<td>DP table and output buffers freshly allocated each call<\/td>\n<td>L2 miss rate climbs from 8% at 128 tokens to 50% at 16K<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p class=\"wp-block-paragraph\">Per-token allocation is constant: roughly 2 KB and ~18 allocations per token, regardless of input size. The latency problem becomes severe at longer inputs when cumulative allocations overflow the per-core L2 cache.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Establishing a Baseline Before Changing the Trie<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">Before switching the trie structure, Perplexity first isolated how much cost came from unnecessary work alone. They made a zero-allocation port of the reference: same HashMap trie, but with a caller-owned scratch struct reused across calls and token IDs stored directly in trie nodes (removing the per-match string allocation and secondary hash-map lookup).<\/p>\n<p class=\"wp-block-paragraph\">This baseline already cut p50 latency to 155 \u00b5s at 514 tokens, down from 326 \u00b5s in the reference. Instructions retired dropped 2.4x. The remaining cost was the HashMap pointer chase itself, which the next step addressed.<\/p>\n<h2 class=\"wp-block-heading\"><strong>The Three Optimizations<\/strong><\/h2>\n<h3 class=\"wp-block-heading\"><strong>Optimization 1: Double-Array Trie<\/strong><\/h3>\n<p class=\"wp-block-paragraph\">The Hugging Face trie stores children in a <code>HashMap<\/code> at every node. Each byte step requires a hash computation, two pointer dereferences, and a heap access. Perplexity replaced this with a <strong>double-array trie<\/strong>, the same structure used by SentencePiece and IREE, originally introduced by Aoe in 1989.<\/p>\n<p class=\"wp-block-paragraph\">A double-array trie encodes the entire trie in two flat integer arrays, <code>base<\/code> and <code>check<\/code>. A child lookup is: <code>next = base[node] + byte<\/code>, then verify <code>check[next] == node<\/code>. That is two array reads, one integer add, and one comparison, with no hashing and no pointer chasing. For XLM-RoBERTa\u2019s 250K vocab, the whole trie fits in ~9 MB of contiguous memory. The hot working set per encode is on the order of 100 KB, which fits in L2 cache.<\/p>\n<p class=\"wp-block-paragraph\">Unlike SentencePiece and IREE, which are general-purpose libraries with lattice bookkeeping and multi-stage pipelines, Perplexity inlined the trie directly in the Viterbi loop and dropped that overhead entirely.<\/p>\n<p class=\"wp-block-paragraph\"><strong>Result at 514 tokens:<\/strong> p50 dropped from 155 \u00b5s (zero-allocation baseline) to 68 \u00b5s. Wall-clock fell 4.8x from the original reference.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Optimization 2: Bitmap and Inline Packing<\/strong><\/h3>\n<p class=\"wp-block-paragraph\">The double-array trie still requires two dependent array loads per byte step: first the parent\u2019s <code>base<\/code> offset, then the <code>check<\/code> array to confirm the transition is valid. Perplexity replaced the check array with a <strong>per-node bitmap<\/strong> (four 64-bit words, 32 bytes) that records which of the 256 possible bytes have valid child transitions.<\/p>\n<p class=\"wp-block-paragraph\">A bitmap lookup compiles to a single bit test against one 64-bit word. The check array is used only during trie construction and dropped from the runtime layout entirely.<\/p>\n<p class=\"wp-block-paragraph\">They also packed all four per-node fields (bitmap, base, token ID, and score) into a single <strong>64-byte cache line<\/strong>, matching CPU cache line width exactly. One trie step now loads a single cache line covering the bitmap for the next-byte check, the base offset for the child slot, and the token ID and score at terminal nodes.<\/p>\n<p class=\"wp-block-paragraph\">Trade-off: trie size grows from ~9 MB to ~50 MB (780K nodes x 64 bytes). The hot working set per encode remains ~100 KB.<\/p>\n<p class=\"wp-block-paragraph\"><strong>Result at 514 tokens:<\/strong> Additional 4.5% wall-clock reduction. L2 accesses dropped from 4.6K to 1.8K per encode.<\/p>\n<h3 class=\"wp-block-heading\"><strong>Optimization 3: Huge Pages for the Trie<\/strong><\/h3>\n<p class=\"wp-block-paragraph\">At 50 MB, the trie spans roughly 12,000 virtual pages on a default Linux system using 4 KB pages. The first-level data TLB on Intel Sapphire Rapids holds 96 entries. Each Viterbi step touches a different trie node, so TLB misses accumulate. Over a 512-token encode, Perplexity estimated roughly 9,000 cycles spent in page-table walks, about 3% of per-encode budget.<\/p>\n<p class=\"wp-block-paragraph\">Perplexity backed the trie with <strong>2 MB huge pages<\/strong> via <code>mmap<\/code> with the <code>MAP_HUGETLB<\/code> flag. The same 50 MB now spans 25 pages, well within the TLB. This requires <code>vm.nr_hugepages<\/code> configured at boot. In production, 10,561 huge pages are reserved; the trie uses 24.<\/p>\n<p class=\"wp-block-paragraph\"><strong>Result:<\/strong> 3-12% wall-clock reduction depending on input length. The largest gain is at 4,098 tokens (-12.0%), where page-table traffic was actively competing with trie data for L2 bandwidth. Beyond 4K tokens the gain shrinks because L3 misses dominate.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Final Benchmark Results<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">All measurements are single-threaded, pinned to one core on an Intel Xeon Platinum 8488C, with 10,000 iterations after 1,000 warmup rounds. <strong>At 514 tokens:<\/strong><\/p>\n<figure class=\"wp-block-table\">\n<table class=\"has-fixed-layout\">\n<thead>\n<tr>\n<th>Engine<\/th>\n<th>p50 Latency<\/th>\n<th>Instructions<\/th>\n<th>Allocations<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Hugging Face (<code>tokenizers<\/code> crate)<\/td>\n<td>349 \u00b5s<\/td>\n<td>3.60M<\/td>\n<td>7,295<\/td>\n<\/tr>\n<tr>\n<td>SentencePiece (C++)<\/td>\n<td>128 \u00b5s<\/td>\n<td>1.83M<\/td>\n<td>1,559<\/td>\n<\/tr>\n<tr>\n<td>IREE tokenizer (C)<\/td>\n<td>112 \u00b5s<\/td>\n<td>2.28M<\/td>\n<td>1<\/td>\n<\/tr>\n<tr>\n<td>Perplexity (final, all 3 optimizations)<\/td>\n<td>~63 \u00b5s<\/td>\n<td>1.04M<\/td>\n<td>0<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p class=\"wp-block-paragraph\">Across the full optimization sequence, instructions per encode fell from 3.66M to 1.04M, a 3.5x reduction. Wall-clock matches that ratio at short inputs and widens at long inputs where the reference\u2019s per-token allocations overflow L2 and L3.<\/p>\n<p class=\"wp-block-paragraph\">One additional finding: off-the-shelf Rust wrapper crates around SentencePiece and IREE add 1.6-1.9x latency overhead compared to the native C\/C++ binaries. The sentencepiece crate allocates a fresh list of token pieces on each call. The overhead is measurable but amortizes at long inputs.<\/p>\n<p class=\"wp-block-paragraph\">The final Perplexity encoder produces token-exact output against the reference. In production, it uses <code>rayon<\/code> to parallelize across cores.<\/p>\n<h2 class=\"wp-block-heading\"><strong>Marktechpost\u2019s Visual Explainer<\/strong><\/h2>\n<div role=\"region\" aria-label=\"Perplexity Unigram Tokenizer guide\">\n<div class=\"pplx-track\">\n<p>    <!-- Slide 1: Cover --><\/p>\n<div class=\"pplx-slide\">\n<div class=\"pplx-tag\">Open Source Release<\/div>\n<h2>Perplexity AI Rewrites Its Unigram Tokenizer, Cuts CPU Utilization 5-6x<\/h2>\n<p>Perplexity reimplemented their Unigram tokenizer from scratch in Rust and open-sourced it in <code>pplx-garden<\/code>. Three targeted optimizations removed wasted work from the hot path.<\/p>\n<div class=\"pplx-hero-stat\">\n<div class=\"pplx-stat-pill\"><span class=\"val\">5x<\/span><span class=\"lbl\">Lower p50 vs HuggingFace tokenizers crate<\/span><\/div>\n<div class=\"pplx-stat-pill\"><span class=\"val\">5-6x<\/span><span class=\"lbl\">CPU utilization reduction in production<\/span><\/div>\n<div class=\"pplx-stat-pill\"><span class=\"val\">0<\/span><span class=\"lbl\">Heap allocations on the hot path<\/span><\/div>\n<\/div>\n<p class=\"pplx-source\">Source: research.perplexity.ai<\/p>\n<\/div>\n<p>    <!-- Slide 2: Why It Matters --><\/p>\n<div class=\"pplx-slide\">\n<div class=\"pplx-tag\">The Problem<\/div>\n<h2>Why CPU Tokenization Became a Bottleneck<\/h2>\n<p>LLM inference cost is usually framed around GPU work: KV caches, attention kernels, expert routing. But small models tell a different story.<\/p>\n<div class=\"pplx-step-list\">\n<div class=\"pplx-step\">\n<div class=\"pplx-step-num\">1<\/div>\n<div class=\"pplx-step-body\">\n<div class=\"step-title\">Rerankers and embedders are small<\/div>\n<div class=\"step-desc\">Two to three orders of magnitude smaller than frontier transformers. GPU compute finishes in single-digit milliseconds.<\/div>\n<\/div><\/div>\n<div class=\"pplx-step\">\n<div class=\"pplx-step-num\">2<\/div>\n<div class=\"pplx-step-body\">\n<div class=\"step-title\">Tokenization runs on CPU before each call<\/div>\n<div class=\"step-desc\">Every input passes through CPU-side tokenization first, turning text into vocabulary IDs.<\/div>\n<\/div><\/div>\n<div class=\"pplx-step\">\n<div class=\"pplx-step-num\">3<\/div>\n<div class=\"pplx-step-body\">\n<div class=\"step-title\">Batch size amplifies the cost<\/div>\n<div class=\"step-desc\">A reranker scoring hundreds of documents per request means tokenization runs hundreds of times per query.<\/div>\n<\/div><\/div>\n<\/div>\n<\/div>\n<p>    <!-- Slide 3: What Is Unigram --><\/p>\n<div class=\"pplx-slide\">\n<div class=\"pplx-tag\">Background<\/div>\n<h2>What Is Unigram Tokenization?<\/h2>\n<p>Introduced by Kudo (2018), implemented in SentencePiece. Perplexity targets <strong>XLM-RoBERTa<\/strong> with a 250K-token Unigram vocabulary.<\/p>\n<div class=\"pplx-two-col\">\n<div class=\"pplx-card\">\n<div class=\"card-label\">Most-probable-path problem<\/div>\n<div class=\"card-val\">Each vocabulary token carries a learned log-probability. The tokenizer picks the segmentation whose token scores sum highest.<\/div>\n<\/div>\n<div class=\"pplx-card\">\n<div class=\"card-label\">Viterbi algorithm (1967)<\/div>\n<div class=\"card-val\">A dynamic programming method that finds the best path. Byte positions are graph layers; vocabulary tokens are edges.<\/div>\n<\/div>\n<\/div>\n<div class=\"pplx-highlight-box\">\n<p>The hot path is the inner trie walk at each byte position. On a 16K-token input, this executes hundreds of thousands of trie transitions and retires tens of millions of instructions per encode.<\/p>\n<\/div>\n<\/div>\n<p>    <!-- Slide 4: Bottlenecks --><\/p>\n<div class=\"pplx-slide\">\n<div class=\"pplx-tag\">Root Cause<\/div>\n<h2>Three Bottlenecks in the Hugging Face Reference<\/h2>\n<p>Measured at 514 tokens (512 + BOS\/EOS) on Intel Xeon Platinum 8488C:<\/p>\n<table class=\"pplx-bench\">\n<thead>\n<tr>\n<th>Bottleneck<\/th>\n<th>Mechanism<\/th>\n<th>Impact<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Allocation per match<\/td>\n<td><code>String::from_utf8<\/code> + <code>AHashMap<\/code> lookup per trie match<\/td>\n<td>7,295 allocs at 514 tokens; 299,171 at 16K<\/td>\n<\/tr>\n<tr>\n<td>Pointer chase per byte<\/td>\n<td><code>AHashMap<\/code> at every trie node; 4 dependent loads per step<\/td>\n<td>Dependent-load latency dominates<\/td>\n<\/tr>\n<tr>\n<td>L2 thrashing<\/td>\n<td>DP table and output buffers freshly allocated each call<\/td>\n<td>L2 miss rate: 8% at 128 tokens, 50% at 16K<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<div class=\"pplx-highlight-box\">\n<p>Per-token allocation is constant: ~2 KB and ~18 allocations per token regardless of input size.<\/p>\n<\/div>\n<\/div>\n<p>    <!-- Slide 5: Baseline --><\/p>\n<div class=\"pplx-slide\">\n<div class=\"pplx-tag\">Step 0: Baseline<\/div>\n<h2>Zero-Allocation Port Before Changing the Trie<\/h2>\n<p>Before touching the trie structure, Perplexity isolated how much cost came from unnecessary allocations alone. They kept the same HashMap trie but made two changes:<\/p>\n<ul class=\"pplx-list\">\n<li>Caller-owned scratch struct reused across calls, removing per-encode DP table allocation<\/li>\n<li>Token IDs stored directly in trie nodes, removing per-match <code>String<\/code> allocation and secondary hash-map lookup<\/li>\n<\/ul>\n<div class=\"pplx-two-col\">\n<div class=\"pplx-card\">\n<div class=\"card-label\">Reference p50<\/div>\n<div class=\"card-val\">326 \u00b5s<\/div>\n<\/div>\n<div class=\"pplx-card\">\n<div class=\"card-label\">Baseline p50<\/div>\n<div class=\"card-val\">155 \u00b5s <span>(-2.1x)<\/span><\/div>\n<\/div>\n<\/div>\n<p>Allocations alone were the dominant cost. Instructions retired dropped 2.4x. The HashMap pointer chase was now the remaining bottleneck.<\/p>\n<\/div>\n<p>    <!-- Slide 6: Opt 1 Double Array Trie --><\/p>\n<div class=\"pplx-slide\">\n<div class=\"pplx-tag\">Optimization 1<\/div>\n<h2>Double-Array Trie<\/h2>\n<p>The HashMap trie costs 4 dependent loads per byte step. The <strong>double-array trie<\/strong> (Aoe, 1989) replaces it with flat integer arrays <code>base<\/code> and <code>check<\/code>.<\/p>\n<div class=\"pplx-two-col\">\n<div class=\"pplx-card\">\n<div class=\"card-label\">HashMap trie (reference)<\/div>\n<div class=\"card-val\">Hash byte, load bucket, follow pointer to child, follow pointer to child\u2019s HashMap. 4 dependent loads per step.<\/div>\n<\/div>\n<div class=\"pplx-card\">\n<div class=\"card-label\">Double-array trie<\/div>\n<div class=\"card-val\"><code>next = base[node] + byte<\/code><br \/>Verify <code>check[next] == node<\/code><br \/>2 array reads, 1 add, 1 compare. No hashing.<\/div>\n<\/div>\n<\/div>\n<div class=\"pplx-highlight-box\">\n<p>250K vocab fits in ~9 MB contiguous memory. Hot working set per encode is ~100 KB, fitting in L2 cache. Result: p50 drops from 155 \u00b5s to 68 \u00b5s, wall-clock 4.8x faster than original reference.<\/p>\n<\/div>\n<\/div>\n<p>    <!-- Slide 7: Opt 2 Bitmap --><\/p>\n<div class=\"pplx-slide\">\n<div class=\"pplx-tag\">Optimization 2<\/div>\n<h2>Bitmap + 64-Byte Cache-Line Packing<\/h2>\n<p>The double-array trie still needs two dependent array loads per step. Perplexity replaced the <code>check<\/code> array with a per-node bitmap.<\/p>\n<ul class=\"pplx-list\">\n<li>Per-node bitmap: four 64-bit words (32 bytes), one bit per possible byte value. A single bit test replaces the second array load.<\/li>\n<li>All four per-node fields (bitmap, base, token ID, score) packed into one 64-byte cache line.<\/li>\n<li>One trie step now loads a single cache line covering validity, child offset, and terminal data.<\/li>\n<\/ul>\n<div class=\"pplx-two-col\">\n<div class=\"pplx-card\">\n<div class=\"card-label\">L2 accesses at 514 tokens<\/div>\n<div class=\"card-val\">4,600 (Darts) <span class=\"badge-orange pplx-badge\">vs<\/span> 1,800 (Bitmap)<\/div>\n<\/div>\n<div class=\"pplx-card\">\n<div class=\"card-label\">Trie size trade-off<\/div>\n<div class=\"card-val\">~9 MB (Darts) grows to ~50 MB (780K nodes x 64 bytes)<\/div>\n<\/div>\n<\/div>\n<\/div>\n<p>    <!-- Slide 8: Opt 3 Huge Pages --><\/p>\n<div class=\"pplx-slide\">\n<div class=\"pplx-tag\">Optimization 3<\/div>\n<h2>2 MB Huge Pages for the Trie<\/h2>\n<p>At 50 MB with 4 KB pages, the trie spans ~12,000 virtual pages. Intel Sapphire Rapids holds only 96 entries in the first-level data TLB. TLB misses trigger page-table walks.<\/p>\n<div class=\"pplx-highlight-box\">\n<p>~9,000 cycles spent in page-table walks per 512-token encode, about 3% of the per-encode budget.<\/p>\n<\/div>\n<p>Fix: back the trie with 2 MB huge pages via <code>mmap<\/code> with <code>MAP_HUGETLB<\/code>. The same 50 MB spans 25 pages, well within TLB capacity. In production, 10,561 huge pages are reserved; the trie uses 24.<\/p>\n<div class=\"pplx-two-col\">\n<div class=\"pplx-card\">\n<div class=\"card-label\">At 514 tokens<\/div>\n<div class=\"card-val\">65.4 \u00b5s without huge pages vs 63.1 \u00b5s with (-3.4%)<\/div>\n<\/div>\n<div class=\"pplx-card\">\n<div class=\"card-label\">At 4,098 tokens<\/div>\n<div class=\"card-val\">773 \u00b5s without huge pages vs 679 \u00b5s with (-12.0%)<\/div>\n<\/div><\/div>\n<\/div>\n<p>    <!-- Slide 9: Benchmarks --><\/p>\n<div class=\"pplx-slide\">\n<div class=\"pplx-tag\">Results<\/div>\n<h2>Final Benchmark at 514 Tokens<\/h2>\n<p>Single-threaded, pinned core, Intel Xeon Platinum 8488C. 10,000 iterations after 1,000 warmup rounds.<\/p>\n<table class=\"pplx-bench\">\n<thead>\n<tr>\n<th>Engine<\/th>\n<th>p50 Latency<\/th>\n<th>Instructions<\/th>\n<th>Allocations<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Hugging Face (Rust)<\/td>\n<td>349 \u00b5s<\/td>\n<td>3.60M<\/td>\n<td>7,295<\/td>\n<\/tr>\n<tr>\n<td>SentencePiece (C++)<\/td>\n<td>128 \u00b5s<\/td>\n<td>1.83M<\/td>\n<td>1,559<\/td>\n<\/tr>\n<tr>\n<td>IREE tokenizer (C)<\/td>\n<td>112 \u00b5s<\/td>\n<td>2.28M<\/td>\n<td>1<\/td>\n<\/tr>\n<tr class=\"highlight\">\n<td>Perplexity (final)<\/td>\n<td>~63 \u00b5s<\/td>\n<td>1.04M<\/td>\n<td>0<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Instructions per encode fell from 3.66M to 1.04M, a 3.5x reduction. Note: off-the-shelf Rust wrapper crates around SentencePiece and IREE add 1.6-1.9x overhead vs native binaries due to per-call allocations.<\/p>\n<\/div>\n<p>    <!-- Slide 10: Takeaways --><\/p>\n<div class=\"pplx-slide\">\n<div class=\"pplx-tag\">Key Takeaways<\/div>\n<h2>What Engineers Should Know<\/h2>\n<ul class=\"pplx-list\">\n<li>CPU tokenization is invisible in GPU profiling traces but real in end-to-end latency for small models.<\/li>\n<li>Removing per-encode heap allocations (zero-allocation baseline) cut p50 from 326 \u00b5s to 155 \u00b5s before any trie change.<\/li>\n<li>Double-array trie brought p50 to 68 \u00b5s. Bitmap packing and huge pages brought it to ~63 \u00b5s.<\/li>\n<li>The Rust wrapper crates around SentencePiece and IREE add 1.6-1.9x latency overhead vs native binaries.<\/li>\n<li>Source code is available at <code>github.com\/perplexityai\/pplx-garden<\/code> under MIT license.<\/li>\n<\/ul>\n<div class=\"pplx-two-col\">\n<div class=\"pplx-card\">\n<div class=\"card-label\">Production impact<\/div>\n<div class=\"card-val\">5-6x CPU utilization reduction + double-digit ms off reranker latency<\/div>\n<\/div>\n<div class=\"pplx-card\">\n<div class=\"card-label\">Target model<\/div>\n<div class=\"card-val\">XLM-RoBERTa, 250K-token SentencePiece Unigram vocabulary<\/div>\n<\/div><\/div>\n<\/div>\n<\/div>\n<p>  <button class=\"pplx-arrow pplx-prev\" aria-label=\"Previous slide\"><\/button><\/p>\n<p>  <br \/>\n  <button class=\"pplx-arrow pplx-next\" aria-label=\"Next slide\"><\/button><\/p>\n<p>  <\/p>\n<div class=\"pplx-nav\" role=\"tablist\" aria-label=\"Slide navigation\"><\/div>\n<\/div>\n<h2 class=\"wp-block-heading\"><strong>Key Takeaways<\/strong><\/h2>\n<ul class=\"wp-block-list\">\n<li>Perplexity rebuilt their Unigram tokenizer targeting XLM-RoBERTa&#8217;s 250K-token SentencePiece vocabulary<\/li>\n<li>The new encoder achieves zero steady-state heap allocations and ~63 \u00b5s p50 at 514 tokens<\/li>\n<li>Three optimizations: double-array trie, bitmap + 64-byte cache-line packing, and 2 MB huge pages for the trie<\/li>\n<li>Intermediate result: a zero-allocation HashMap port alone cut p50 from 326 \u00b5s to 155 \u00b5s before the trie was changed<\/li>\n<li>Production impact: 5-6x CPU utilization reduction and double-digit ms reduction in reranker latency<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<\/p><p class=\"wp-block-paragraph\">Check out\u00a0the\u00a0<strong><a href=\"https:\/\/github.com\/perplexityai\/pplx-garden\" target=\"_blank\" rel=\"noreferrer noopener\">Repo<\/a>\u00a0<\/strong>and<strong>\u00a0<a href=\"https:\/\/research.perplexity.ai\/articles\/improving-unigram-tokenizer-cpu-performance\" target=\"_blank\" rel=\"noreferrer noopener\">Technical details<\/a>.\u00a0<\/strong>Also,\u00a0feel free to follow us on\u00a0<strong><a href=\"https:\/\/x.com\/intent\/follow?screen_name=marktechpost\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Twitter<\/mark><\/a><\/strong>\u00a0and don\u2019t forget to join our\u00a0<strong><a href=\"https:\/\/www.reddit.com\/r\/machinelearningnews\/\" target=\"_blank\" rel=\"noreferrer noopener\">150k+ ML SubReddit<\/a><\/strong>\u00a0and Subscribe to\u00a0<strong><a href=\"https:\/\/www.aidevsignals.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">our Newsletter<\/a><\/strong>. Wait! are you on telegram?\u00a0<strong><a href=\"https:\/\/t.me\/machinelearningresearchnews\" target=\"_blank\" rel=\"noreferrer noopener\">now you can join us on telegram as well.<\/a><\/strong><\/p>\n<p class=\"wp-block-paragraph\">Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.?\u00a0<strong><a href=\"https:\/\/forms.gle\/wbash1wF6efRj8G58\" target=\"_blank\" rel=\"noreferrer noopener\"><mark>Connect with us<\/mark><\/a><\/strong><\/p>\n<p>The post <a href=\"https:\/\/www.marktechpost.com\/2026\/05\/28\/perplexity-ai-open-sources-unigram-tokenizer-that-achieves-5x-lower-p50-latency-than-hugging-face-tokenizers-crate\/\">Perplexity AI Open-Sources Unigram Tokenizer That Achieves 5x Lower p50 Latency Than Hugging Face tokenizers Crate<\/a> appeared first on <a href=\"https:\/\/www.marktechpost.com\/\">MarkTechPost<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Perplexity AI\u2019s research team &hellip;<\/p>\n","protected":false},"author":1,"featured_media":29,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-987","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/987","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=987"}],"version-history":[{"count":0,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/posts\/987\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=\/wp\/v2\/media\/29"}],"wp:attachment":[{"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=987"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=987"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/connectword.dpdns.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=987"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}