***

title: Document Chunking and Snippetization Overview
slug: ai-assistant/enterprise-search/document-chunking-and-snippetization-overview
----------------------------------------------------------------------------------

When you connect a knowledge source — a PDF, a wiki article, a SharePoint file — Moveworks breaks it into searchable pieces called **snippets or chunks**. This two-stage process ensures that search results are relevant, appropriately sized, and structured around your content's natural layout.

## Stage 1: Parsing Your Document

Before any chunking happens, Moveworks reads and interprets the document based on its file type.

### PDFs

PDF text is extracted using one of three engines (selected automatically or per-configuration):

* **PDFMiner**\*(default)\* — Splits the document into page ranges and extracts text in parallel for speed.
* **PyPDF** — Extracts text one page at a time.
* **PDFium** — A high-fidelity parser backed by Google's PDF rendering engine. After extraction, it identifies paragraph boundaries by detecting sentence-ending punctuation (`.`, `!`, `?`) followed by a capitalized new sentence.

**Limits that apply to all PDFs:**

* Maximum file size: **25 MB**
* Maximum pages processed: **100**
* Certain PDF generators (e.g., ArchiCAD) produce non-standard files and are skipped with an error.

### HTML / Web Articles / Knowledge Base Pages

Moveworks walks the HTML structure and maps it to a document tree:

| **HTML Element**                    | **How It's Treated**            |
| ----------------------------------- | ------------------------------- |
| Headings (H1–H6)                    | Recognized as section titles    |
| Paragraphs, divs, sections          | Become structural groupings     |
| Lists (ordered & unordered)         | Preserved as list structures    |
| Tables                              | Preserved as table structures   |
| Code / pre-formatted blocks         | Preserved with exact formatting |
| Links                               | Extracted with their target URL |
| Images                              | Represented by their alt text   |
| Navigation, footers, scripts, forms | **Skipped entirely**            |

Confluence-specific macros (tabs, panels, etc.) are converted to standard HTML before processing.

### Other File Types

* **PowerPoint (PPTX)** — Each slide is treated as its own unit.
* **Word documents (DOCX)** — Extracted with heading structure preserved.
* **Plain text** — Processed directly without structural parsing.

## Stage 2: Chunking into Snippets

Once the document is parsed into a structured representation, it is divided into snippets — the units that get indexed and returned in search results. Two chunking strategies are available.

### Strategy A: Fixed-Size Chunking

*Used for PDFs, plain text, and knowledge base articles.*

Text is split using a **cascading hierarchy of splitters**, from coarse to fine. A splitter is only applied if the current chunk still exceeds the token limit after the previous level:

| **Level**     | **Splits on**                 | **Applied when**          |
| ------------- | ----------------------------- | ------------------------- |
| 1 — Paragraph | Blank lines / double newlines | Chunk exceeds limit       |
| 2 — Sentence  | Sentence boundaries           | Chunk still exceeds limit |
| 3 — Line      | Single newlines               | Chunk still exceeds limit |
| 4 — Word      | Word boundaries               | Chunk still exceeds limit |
| 5 — Character | Hard character cutoff         | Last resort only          |

**Token limit:** **200 tokens** per chunk by default (configurable). Token counting uses the same tokenizer as GPT-3.5 Turbo.

Segments are then **greedily packed** — consecutive segments are joined together until the next one would push the chunk over the limit.

### Strategy B: Structure-Aware Dynamic Chunking

*Used for HTML documents when structure-aware mode is enabled.*

Instead of splitting blindly by token count, this strategy uses the **document's own structure** to find natural chunk boundaries, prioritized in tiers:

| **Priority** | **Boundary Type**                    | **Examples**                |
| ------------ | ------------------------------------ | --------------------------- |
| Highest      | Headings, horizontal rules, sections | `<h2>`, `<hr>`, `<section>` |
| High         | Generic containers                   | `<div>`                     |
| Medium       | Paragraphs                           | `<p>`                       |
| Lower        | Tables, lists, code blocks           | `<table>`, `<ul>`, `<pre>`  |

**Token thresholds:**

* Minimum chunk size: **8 tokens** (smaller chunks are merged)
* Target chunk size: **256 tokens**
* Hard maximum: **512 tokens** (tables and lists: **1,024 tokens**)

If a structural block still exceeds the hard maximum, it is recursively split further.

## What Happens After Chunking

| **Optional Step**   | **What It Does**                                                                          |
| ------------------- | ----------------------------------------------------------------------------------------- |
| Language detection  | Identifies the language of the document for multilingual search routing                   |
| Sentence annotation | Each snippet is further annotated with individual sentences for more precise highlighting |

## Guardrails

* A **per-request timeout** is enforced throughout the pipeline. If parsing or chunking takes too long, the request fails gracefully rather than hanging.
* Empty documents, password-protected PDFs, and oversized files all return specific error codes rather than silently producing empty results.
* The strategy that produces the **most snippets wins** when multiple strategies are eligible — maximizing coverage of your document's content.
