Document Store Connector¶

The DocumentStoreConnector ingests PDF, DOCX, Markdown, and text manuals and exposes them to the agent via semantic search. It is designed for technical documentation — equipment manuals, troubleshooting guides, work procedures, datasheets — where a single chunk is often less useful than the surrounding section.

The connector is layered so each capability is opt-in. The base path (keyword fallback) has no heavy dependencies; layout-aware parsing, hybrid retrieval, and cross-encoder reranking are pulled in via extras when the use-case justifies the install cost.

Capability	Extra	Default	Use when
Keyword fallback	none	✅	smoke tests, quickstart
Dense vector search	`[docs-rag]`	✅ when installed	semantic retrieval
Hybrid BM25 + dense (RRF)	`[docs-rag-hybrid]`	✅ when installed	technical IDs (`SKF 6310-2RS`), part numbers, acronyms
Cross-encoder reranker	`[docs-rag-rerank]`	opt-in (`reranker_model=`)	precision-critical retrieval
Layout-aware parsing (Docling)	`[docs-rag-parsing]`	✅ when installed	PDFs with tables, multi-column layouts, figure captions

Install everything with pip install machina-ai[docs-rag-pro].

Quickstart¶

from machina.connectors.docs import DocumentStoreConnector

docs = DocumentStoreConnector(paths=["manuals/", "procedures/"])
await docs.connect()
results = await docs.search("bearing replacement", asset_id="P-201")
for chunk in results:
    print(f"[{chunk.source} p.{chunk.page}] § {chunk.section_title}")
    print(chunk.content[:200])

When langchain + chromadb are not installed the connector silently falls back to an in-memory keyword search so the quickstart works without a heavy install.

Metadata schema¶

Every chunk carries a small, structured metadata record that powers pre-retrieval filtering. Metadata can come from three sources, applied in order:

Sidecar file — manual.pdf next to manual.pdf.meta.yaml
YAML frontmatter — at the top of .md / .txt files
Path inference — best-effort guess from filename / parent directory

# manual.pdf.meta.yaml
asset_id: P-201
equipment_class_code: PU      # ISO 14224 Annex A
doc_type: manual              # manual | procedure | datasheet | troubleshooting | other
section_title: Bearing Replacement  # optional default

Frontmatter format inside Markdown / text:

---
asset_id: P-201
doc_type: procedure
---
# Bearing Replacement Procedure
...

Filter by any indexed field at query time:

await docs.search(
    "torque",
    filters={"asset_id": "P-201", "doc_type": "procedure"},
)

asset_id= is a shortcut for filters={"asset_id": ...}.

Section-aware chunking¶

The splitter detects sections from three signals and keeps each section's body as a single parent, while indexing smaller match chunks for embedding / BM25 / rerank.

Markdown — ATX headings (#, ##, ...), fence-aware so code samples don't spawn phantom sections
Numbered headings — 1. Introduction, 2.1 Bearing Replacement — only when set off by a blank line, to avoid mistaking body list items for headings
ALL-CAPS headings — BEARING REPLACEMENT PROCEDURE, also with blank-line context

At query time, the retrieved match chunk is replaced by its parent section before being handed to the LLM, so a query for "Step 3" returns the entire multi-step procedure, not a partial chunk. Sections that exceed max_parent_chars (default 8000) are windowed around the matched offset.

Layout-aware parsing¶

Install machina-ai[docs-rag-parsing] to enable Docling-based parsing. PDFs and DOCX files are parsed into a structured ParsedDocument with:

Sections — heading title, level, body text, page range
Tables — rendered as Markdown and indexed as one atomic chunk that retrieval can never split mid-row

Layout-aware parsing is best-effort: if Docling isn't installed, or if it raises on a specific file, the connector logs a warning and falls back to PyPDFLoader / Docx2txtLoader. A single bad PDF never blocks the rest of a corpus.

Hybrid retrieval¶

With [docs-rag-hybrid] installed, every dense Chroma query is paired with a BM25 query over the same chunk corpus and the two rankings are fused with Reciprocal Rank Fusion (k=60). Hybrid retrieval recovers technical identifiers — SKF 6310-2RS, WO-2026-0087, ISO codes — that dense embeddings often miss while keeping semantic recall for prose queries.

Cross-encoder reranking¶

Pass reranker_model= to wrap the fused candidate set in a cross-encoder for a final reordering pass:

DocumentStoreConnector(
    paths=["manuals/"],
    reranker_model="BAAI/bge-reranker-base",
)

Requires [docs-rag-rerank]. The model is loaded lazily on first query so connect time stays fast; if loading fails the connector keeps the RRF order rather than returning an empty result.

Swappable embedder¶

Pass embedder= to use a custom sentence-transformers model for the dense index instead of Chroma's default:

DocumentStoreConnector(
    paths=["manuals/"],
    embedder="BAAI/bge-m3",
)

bge-m3 works well for multilingual technical content (Italian, English, German manuals in the same corpus). Requires [docs-rag-rerank] for the sentence-transformers runtime. If the model fails to load — extra missing, model not downloaded, GPU / CPU mismatch — the connector falls back to Chroma's default embedder so ingest is never blocked.

Citation contract¶

Search results carry stable identifiers the agent can cite:

@dataclass
class DocumentChunk:
    content: str
    source: str            # file path
    page: int              # 1-based, or 0 if unknown
    chunk_id: str          # deterministic across runs
    parent_id: str         # join key for the parent section
    section_title: str     # detected by splitter or sidecar
    asset_id: str
    equipment_class_code: str
    doc_type: str
    score: float

The agent runtime registers retrieved chunk_ids per turn so the LLM can reference them in its answer; the parser validates each citation against the registry to reject hallucinated chunk ids.

Failure modes¶

Failure	Behavior
`langchain` / `chromadb` not installed	Falls back to in-memory keyword search
`rank_bm25` not installed	Dense-only retrieval
Reranker model fails to load	Returns RRF order (no rerank)
Layout parser fails on a file	Logs warning, falls back to `PyPDFLoader` for that file
Embedder model fails to load	Falls back to Chroma's default embedder
Sidecar YAML missing or malformed	Logs warning, indexes with empty metadata
File extension unsupported	Skipped silently

Configuration reference¶

Param	Default	Notes
`paths`	`[]`	Files or directories. Directories are walked recursively.
`collection_name`	`"machina_docs"`	ChromaDB collection name
`chunk_size`	`1000`	Target match-chunk size in characters
`chunk_overlap`	`200`	Overlap between consecutive match chunks
`reranker_model`	`None`	`sentence-transformers` cross-encoder name
`embedder`	`None`	`sentence-transformers` embedding model name