Docling

Docling parses PDF, DOCX, PPTX, HTML, and other formats into a rich unified representation including document layout, tables etc., making them ready for generative AI workflows like RAG.

Docling Loader, presented in this notebook, seamlessly integrates Docling into LangChain, enabling you to:

use various document types in your LLM applications with ease and speed, and
leverage Docling's rich representation for advanced, document-native grounding.

In the sections below, we showcase Docling Loader's usage, covering document loading specifics but also demonstrating an end-to-end RAG pipeline.

Setup

%pip install -qU docling langchain-community langchain langchain-text-splitters langchain-huggingface langchain-milvus pip

Note: you may need to restart the kernel to use updated packages.

from langchain_community.document_loaders import DoclingLoader

FILE_PATH = "https://arxiv.org/pdf/2408.09869"

def clip_text(text, threshold=100):
    return f"{text[:threshold]}[...]" if len(text) > threshold else text

def print_docs(docs):
    for d in docs:
        print(f"metadata={d.metadata}, page_content={repr(clip_text(d.page_content))}")

API Reference:DoclingLoader

Document loading

Docling Loader can be used in two different modes, based on the export type:

Markdown mode: for each input doc, outputs a LangChain Document with the Markdown representation of the input doc
Doc-chunks mode: for each input doc, outputs the doc chunks (using Docling layout-aware chunking) as LangChain Documents

Using Markdown mode

The markdown mode (default mode) returns the markdown export of the input documents.

For customizing the markdown export, the user can pass the Docling markdown export kwargs (via keyword argument md_export_kwargs).

Advanced tip: for customizing the conversion initialization and/or execution, the user can pass a Docling DocumentConverter object (via keyword argument converter) and/or the conversion kwargs (via keyword argument convert_kwargs) respectively.

loader = DoclingLoader(
    file_path=FILE_PATH,
    export_type=DoclingLoader.ExportType.MARKDOWN,
)
docs = loader.load()

print_docs(docs)

metadata={'source': 'https://arxiv.org/pdf/2408.09869'}, page_content='## Docling Technical Report\n\nVersion 1.0\n\nChristoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nik[...]'

Now that the docs have been loaded, any built-in (or custom) LangChain splitter can be used to split them.

For illustrating one option, below we show a possible splitting with a MarkdownHeaderTextSplitter:

from langchain_text_splitters import MarkdownHeaderTextSplitter

splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=[("#", "Header_1"), ("##", "Header_2"), ("###", "Header_3")],
)
md_splits = [split for doc in docs for split in splitter.split_text(doc.page_content)]

print_docs(md_splits[:3])

API Reference:MarkdownHeaderTextSplitter

metadata={'Header_2': 'Docling Technical Report'}, page_content='Version 1.0  \nChristoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagen[...]'
metadata={'Header_2': 'Abstract'}, page_content='This technical report introduces Docling , an easy to use, self-contained, MITlicensed open-source p[...]'
metadata={'Header_2': '1 Introduction'}, page_content='Converting PDF documents back into a machine-processable format has been a major challenge for decad[...]'

Using doc chunks mode

The doc-chunks mode directly returns the document chunks including rich metadata such as the page number and the bounding box info.

For custom chunking, the user can pass a Docling BaseChunker object (via keyword argument chunker).

loader = DoclingLoader(
    file_path=FILE_PATH,
    export_type=DoclingLoader.ExportType.DOC_CHUNKS,
)
doc_splits = loader.load()

print_docs(doc_splits[:3])

metadata={'source': 'https://arxiv.org/pdf/2408.09869', 'dl_meta': {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/0', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'page_header', 'prov': [{'page_no': 1, 'bbox': {'l': 17.088111877441406, 't': 583.2296752929688, 'r': 36.339778900146484, 'b': 231.99996948242188, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 38]}]}], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 14981478401387673002, 'filename': '2408.09869v3.pdf'}}}, page_content='arXiv:2408.09869v3 [cs.CL] 30 Aug 2024'
metadata={'source': 'https://arxiv.org/pdf/2408.09869', 'dl_meta': {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/2', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 1, 'bbox': {'l': 282.772216796875, 't': 512.7218017578125, 'r': 328.8624572753906, 'b': 503.340087890625, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 11]}]}], 'headings': ['Docling Technical Report'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 14981478401387673002, 'filename': '2408.09869v3.pdf'}}}, page_content='Version 1.0'
metadata={'source': 'https://arxiv.org/pdf/2408.09869', 'dl_meta': {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/3', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 1, 'bbox': {'l': 113.4512939453125, 't': 482.4101257324219, 'r': 498.396728515625, 'b': 439.45928955078125, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 295]}]}], 'headings': ['Docling Technical Report'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 14981478401387673002, 'filename': '2408.09869v3.pdf'}}}, page_content='Christoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagenas Cesar Berro[...]'

RAG

In this section we put together a demo RAG pipeline and run it using the documents loaded above.

import json
import os
from pathlib import Path
from tempfile import mkdtemp

from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import PromptTemplate
from langchain_huggingface import HuggingFaceEndpoint
from langchain_huggingface.embeddings import HuggingFaceEmbeddings
from langchain_milvus import Milvus

# https://github.com/huggingface/transformers/issues/5486:
os.environ["TOKENIZERS_PARALLELISM"] = "false"

QUESTION = "Which are the main AI models in Docling?"
PROMPT = PromptTemplate.from_template(
    "Context information is below.\n---------------------\n{context}\n---------------------\nGiven the context information and not prior knowledge, answer the query.\nQuery: {input}\nAnswer:\n",
)
HF_EMBED_MODEL_ID = "BAAI/bge-small-en-v1.5"
HF_LLM_MODEL_ID = "mistralai/Mixtral-8x7B-Instruct-v0.1"

embedding = HuggingFaceEmbeddings(model_name=HF_EMBED_MODEL_ID)
llm = HuggingFaceEndpoint(repo_id=HF_LLM_MODEL_ID)


def run_rag(documents, embedding, llm, question, prompt):
    milvus_uri = str(Path(mkdtemp()) / "docling.db")  # or set as needed
    vectorstore = Milvus.from_documents(
        documents,
        embedding,
        connection_args={"uri": milvus_uri},
        drop_old=True,
    )
    retriever = vectorstore.as_retriever()
    question_answer_chain = create_stuff_documents_chain(llm, prompt)
    rag_chain = create_retrieval_chain(retriever, question_answer_chain)
    resp_dict = rag_chain.invoke({"input": question})

    answer = clip_text(resp_dict["answer"], threshold=200)
    print(f"Question:\n{resp_dict['input']}\n\nAnswer:\n{json.dumps(answer)}")
    for i, doc in enumerate(resp_dict["context"]):
        print()
        print(f"Source {i+1}:")
        print(f"  text: {json.dumps(clip_text(doc.page_content, threshold=200))}")
        for key in doc.metadata:
            if key != "pk":
                val = doc.metadata.get(key)
                clipped_val = clip_text(val) if isinstance(val, str) else val
                print(f"  {key}: {clipped_val}")

API Reference:create_retrieval_chain | create_stuff_documents_chain | PromptTemplate | HuggingFaceEndpoint | HuggingFaceEmbeddings

Using Markdown mode

Below we run the RAG pipeline passing it the output of the Markdown mode (after splitting):

run_rag(
    documents=md_splits,
    embedding=embedding,
    llm=llm,
    question=QUESTION,
    prompt=PROMPT,
)

Question:
Which are the main AI models in Docling?

Answer:
"The main AI models in Docling are a layout analysis model called DocLayNet and a table structure recognition model called TableFormer. DocLayNet is an accurate object-detector for page elements, while[...]"

Source 1:
  text: "As part of Docling, we initially release two highly capable AI models to the open-source community, which have been developed and published recently by our team. The first model is a layout analysis m[...]"
  Header_2: 3.2 AI models

Source 2:
  text: "This technical report introduces Docling , an easy to use, self-contained, MITlicensed open-source package for PDF document conversion. It is powered by state-of-the-art specialized AI models for layo[...]"
  Header_2: Abstract

Source 3:
  text: "Thanks to the high-quality, richly structured document conversion achieved by Docling, its output qualifies for numerous downstream applications. For example, Docling can provide a base for detailed e[...]"
  Header_2: 5 Applications

Source 4:
  text: "Docling is designed to allow easy extension of the model library and pipelines. In the future, we plan to extend Docling with several more models, such as a figure-classifier model, an equationrecogni[...]"
  Header_2: 6 Future work and contributions

Using doc-chunk mode

Below we run the RAG pipeline passing it the output of the doc-chunk mode.

Notice how the sources now also contain document-level grounding (e.g. page number or bounding box information):

run_rag(
    documents=doc_splits,
    embedding=embedding,
    llm=llm,
    question=QUESTION,
    prompt=PROMPT,
)

Question:
Which are the main AI models in Docling?

Answer:
"The main AI models in Docling are:\n\n1. A layout analysis model, an accurate object-detector for page elements.\n2. TableFormer, a state-of-the-art table structure recognition model."

Source 1:
  text: "As part of Docling, we initially release two highly capable AI models to the open-source community, which have been developed and published recently by our team. The first model is a layout analysis m[...]"
  dl_meta: {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/34', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 3, 'bbox': {'l': 107.07593536376953, 't': 406.1695251464844, 'r': 504.1148681640625, 'b': 330.2677307128906, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 608]}]}], 'headings': ['3.2 AI models'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 14981478401387673002, 'filename': '2408.09869v3.pdf'}}
  source: https://arxiv.org/pdf/2408.09869

Source 2:
  text: "With Docling , we open-source a very capable and efficient document conversion tool which builds on the powerful, specialized AI models and datasets for layout analysis and table structure recognition[...]"
  dl_meta: {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/9', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 1, 'bbox': {'l': 107.0031967163086, 't': 136.7283935546875, 'r': 504.04998779296875, 'b': 83.30133056640625, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 488]}]}], 'headings': ['1 Introduction'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 14981478401387673002, 'filename': '2408.09869v3.pdf'}}
  source: https://arxiv.org/pdf/2408.09869

Source 3:
  text: "Docling is designed to allow easy extension of the model library and pipelines. In the future, we plan to extend Docling with several more models, such as a figure-classifier model, an equationrecogni[...]"
  dl_meta: {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/60', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 5, 'bbox': {'l': 106.92281341552734, 't': 323.5386657714844, 'r': 504.00347900390625, 'b': 258.76641845703125, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 543]}]}], 'headings': ['6 Future work and contributions'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 14981478401387673002, 'filename': '2408.09869v3.pdf'}}
  source: https://arxiv.org/pdf/2408.09869

Source 4:
  text: "This technical report introduces Docling , an easy to use, self-contained, MITlicensed open-source package for PDF document conversion. It is powered by state-of-the-art specialized AI models for layo[...]"
  dl_meta: {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/6', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 1, 'bbox': {'l': 142.92593383789062, 't': 364.814697265625, 'r': 468.3847351074219, 'b': 300.651123046875, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 431]}]}], 'headings': ['Abstract'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 14981478401387673002, 'filename': '2408.09869v3.pdf'}}
  source: https://arxiv.org/pdf/2408.09869

Document loader conceptual guide
Document loader how-to guides

Setup​

Document loading​

Using Markdown mode​

Using doc chunks mode​

RAG​

Using Markdown mode​

Using doc-chunk mode​

Related​

Was this page helpful?

Setup

Document loading

Using Markdown mode

Using doc chunks mode

RAG

Using Markdown mode

Using doc-chunk mode

Related