Docling
Docling parses PDF, DOCX, PPTX, HTML, and other formats into a rich unified representation including document layout, tables etc., making them ready for generative AI workflows like RAG.
Docling Loader, presented in this notebook, seamlessly integrates Docling into LangChain, enabling you to:
- use various document types in your LLM applications with ease and speed, and
- leverage Docling's rich representation for advanced, document-native grounding.
In the sections below, we showcase Docling Loader's usage, covering document loading specifics but also demonstrating an end-to-end RAG pipeline.
Setupโ
%pip install -qU docling langchain-community langchain langchain-text-splitters langchain-huggingface langchain-milvus pip
Note: you may need to restart the kernel to use updated packages.
from langchain_community.document_loaders import DoclingLoader
FILE_PATH = "https://arxiv.org/pdf/2408.09869"
def clip_text(text, threshold=100):
return f"{text[:threshold]}[...]" if len(text) > threshold else text
def print_docs(docs):
for d in docs:
print(f"metadata={d.metadata}, page_content={repr(clip_text(d.page_content))}")
Document loadingโ
Docling Loader can be used in two different modes, based on the export type:
- Markdown mode: for each input doc, outputs a LangChain
Document
with the Markdown representation of the input doc - Doc-chunks mode: for each input doc, outputs the doc chunks (using Docling layout-aware chunking) as LangChain
Document
s
Using Markdown modeโ
The markdown mode (default mode) returns the markdown export of the input documents.
For customizing the markdown export, the user can pass the Docling markdown export kwargs (via keyword argument md_export_kwargs
).
Advanced tip: for customizing the conversion initialization and/or execution, the user can pass a Docling DocumentConverter
object (via keyword argument converter
) and/or the conversion kwargs (via keyword argument convert_kwargs
) respectively.
loader = DoclingLoader(
file_path=FILE_PATH,
export_type=DoclingLoader.ExportType.MARKDOWN,
)
docs = loader.load()
print_docs(docs)
metadata={'source': 'https://arxiv.org/pdf/2408.09869'}, page_content='## Docling Technical Report\n\nVersion 1.0\n\nChristoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nik[...]'
Now that the docs
have been loaded, any built-in (or custom) LangChain splitter can be used to split them.
For illustrating one option, below we show a possible splitting with a MarkdownHeaderTextSplitter
:
from langchain_text_splitters import MarkdownHeaderTextSplitter
splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=[("#", "Header_1"), ("##", "Header_2"), ("###", "Header_3")],
)
md_splits = [split for doc in docs for split in splitter.split_text(doc.page_content)]
print_docs(md_splits[:3])
metadata={'Header_2': 'Docling Technical Report'}, page_content='Version 1.0 \nChristoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagen[...]'
metadata={'Header_2': 'Abstract'}, page_content='This technical report introduces Docling , an easy to use, self-contained, MITlicensed open-source p[...]'
metadata={'Header_2': '1 Introduction'}, page_content='Converting PDF documents back into a machine-processable format has been a major challenge for decad[...]'
Using doc chunks modeโ
The doc-chunks mode directly returns the document chunks including rich metadata such as the page number and the bounding box info.
For custom chunking, the user can pass a Docling BaseChunker
object (via keyword argument chunker
).
loader = DoclingLoader(
file_path=FILE_PATH,
export_type=DoclingLoader.ExportType.DOC_CHUNKS,
)
doc_splits = loader.load()
print_docs(doc_splits[:3])
metadata={'source': 'https://arxiv.org/pdf/2408.09869', 'dl_meta': {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/0', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'page_header', 'prov': [{'page_no': 1, 'bbox': {'l': 17.088111877441406, 't': 583.2296752929688, 'r': 36.339778900146484, 'b': 231.99996948242188, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 38]}]}], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 14981478401387673002, 'filename': '2408.09869v3.pdf'}}}, page_content='arXiv:2408.09869v3 [cs.CL] 30 Aug 2024'
metadata={'source': 'https://arxiv.org/pdf/2408.09869', 'dl_meta': {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/2', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 1, 'bbox': {'l': 282.772216796875, 't': 512.7218017578125, 'r': 328.8624572753906, 'b': 503.340087890625, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 11]}]}], 'headings': ['Docling Technical Report'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 14981478401387673002, 'filename': '2408.09869v3.pdf'}}}, page_content='Version 1.0'
metadata={'source': 'https://arxiv.org/pdf/2408.09869', 'dl_meta': {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/3', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 1, 'bbox': {'l': 113.4512939453125, 't': 482.4101257324219, 'r': 498.396728515625, 'b': 439.45928955078125, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 295]}]}], 'headings': ['Docling Technical Report'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 14981478401387673002, 'filename': '2408.09869v3.pdf'}}}, page_content='Christoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Nikolaos Livathinos Panos Vagenas Cesar Berro[...]'
RAGโ
In this section we put together a demo RAG pipeline and run it using the documents loaded above.
import json
import os
from pathlib import Path
from tempfile import mkdtemp
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import PromptTemplate
from langchain_huggingface import HuggingFaceEndpoint
from langchain_huggingface.embeddings import HuggingFaceEmbeddings
from langchain_milvus import Milvus
# https://github.com/huggingface/transformers/issues/5486:
os.environ["TOKENIZERS_PARALLELISM"] = "false"
QUESTION = "Which are the main AI models in Docling?"
PROMPT = PromptTemplate.from_template(
"Context information is below.\n---------------------\n{context}\n---------------------\nGiven the context information and not prior knowledge, answer the query.\nQuery: {input}\nAnswer:\n",
)
HF_EMBED_MODEL_ID = "BAAI/bge-small-en-v1.5"
HF_LLM_MODEL_ID = "mistralai/Mixtral-8x7B-Instruct-v0.1"
embedding = HuggingFaceEmbeddings(model_name=HF_EMBED_MODEL_ID)
llm = HuggingFaceEndpoint(repo_id=HF_LLM_MODEL_ID)
def run_rag(documents, embedding, llm, question, prompt):
milvus_uri = str(Path(mkdtemp()) / "docling.db") # or set as needed
vectorstore = Milvus.from_documents(
documents,
embedding,
connection_args={"uri": milvus_uri},
drop_old=True,
)
retriever = vectorstore.as_retriever()
question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)
resp_dict = rag_chain.invoke({"input": question})
answer = clip_text(resp_dict["answer"], threshold=200)
print(f"Question:\n{resp_dict['input']}\n\nAnswer:\n{json.dumps(answer)}")
for i, doc in enumerate(resp_dict["context"]):
print()
print(f"Source {i+1}:")
print(f" text: {json.dumps(clip_text(doc.page_content, threshold=200))}")
for key in doc.metadata:
if key != "pk":
val = doc.metadata.get(key)
clipped_val = clip_text(val) if isinstance(val, str) else val
print(f" {key}: {clipped_val}")
Using Markdown modeโ
Below we run the RAG pipeline passing it the output of the Markdown mode (after splitting):
run_rag(
documents=md_splits,
embedding=embedding,
llm=llm,
question=QUESTION,
prompt=PROMPT,
)
Question:
Which are the main AI models in Docling?
Answer:
"The main AI models in Docling are a layout analysis model called DocLayNet and a table structure recognition model called TableFormer. DocLayNet is an accurate object-detector for page elements, while[...]"
Source 1:
text: "As part of Docling, we initially release two highly capable AI models to the open-source community, which have been developed and published recently by our team. The first model is a layout analysis m[...]"
Header_2: 3.2 AI models
Source 2:
text: "This technical report introduces Docling , an easy to use, self-contained, MITlicensed open-source package for PDF document conversion. It is powered by state-of-the-art specialized AI models for layo[...]"
Header_2: Abstract
Source 3:
text: "Thanks to the high-quality, richly structured document conversion achieved by Docling, its output qualifies for numerous downstream applications. For example, Docling can provide a base for detailed e[...]"
Header_2: 5 Applications
Source 4:
text: "Docling is designed to allow easy extension of the model library and pipelines. In the future, we plan to extend Docling with several more models, such as a figure-classifier model, an equationrecogni[...]"
Header_2: 6 Future work and contributions
Using doc-chunk modeโ
Below we run the RAG pipeline passing it the output of the doc-chunk mode.
Notice how the sources now also contain document-level grounding (e.g. page number or bounding box information):
run_rag(
documents=doc_splits,
embedding=embedding,
llm=llm,
question=QUESTION,
prompt=PROMPT,
)
Question:
Which are the main AI models in Docling?
Answer:
"The main AI models in Docling are:\n\n1. A layout analysis model, an accurate object-detector for page elements.\n2. TableFormer, a state-of-the-art table structure recognition model."
Source 1:
text: "As part of Docling, we initially release two highly capable AI models to the open-source community, which have been developed and published recently by our team. The first model is a layout analysis m[...]"
dl_meta: {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/34', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 3, 'bbox': {'l': 107.07593536376953, 't': 406.1695251464844, 'r': 504.1148681640625, 'b': 330.2677307128906, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 608]}]}], 'headings': ['3.2 AI models'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 14981478401387673002, 'filename': '2408.09869v3.pdf'}}
source: https://arxiv.org/pdf/2408.09869
Source 2:
text: "With Docling , we open-source a very capable and efficient document conversion tool which builds on the powerful, specialized AI models and datasets for layout analysis and table structure recognition[...]"
dl_meta: {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/9', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 1, 'bbox': {'l': 107.0031967163086, 't': 136.7283935546875, 'r': 504.04998779296875, 'b': 83.30133056640625, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 488]}]}], 'headings': ['1 Introduction'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 14981478401387673002, 'filename': '2408.09869v3.pdf'}}
source: https://arxiv.org/pdf/2408.09869
Source 3:
text: "Docling is designed to allow easy extension of the model library and pipelines. In the future, we plan to extend Docling with several more models, such as a figure-classifier model, an equationrecogni[...]"
dl_meta: {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/60', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 5, 'bbox': {'l': 106.92281341552734, 't': 323.5386657714844, 'r': 504.00347900390625, 'b': 258.76641845703125, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 543]}]}], 'headings': ['6 Future work and contributions'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 14981478401387673002, 'filename': '2408.09869v3.pdf'}}
source: https://arxiv.org/pdf/2408.09869
Source 4:
text: "This technical report introduces Docling , an easy to use, self-contained, MITlicensed open-source package for PDF document conversion. It is powered by state-of-the-art specialized AI models for layo[...]"
dl_meta: {'schema_name': 'docling_core.transforms.chunker.DocMeta', 'version': '1.0.0', 'doc_items': [{'self_ref': '#/texts/6', 'parent': {'$ref': '#/body'}, 'children': [], 'label': 'text', 'prov': [{'page_no': 1, 'bbox': {'l': 142.92593383789062, 't': 364.814697265625, 'r': 468.3847351074219, 'b': 300.651123046875, 'coord_origin': 'BOTTOMLEFT'}, 'charspan': [0, 431]}]}], 'headings': ['Abstract'], 'origin': {'mimetype': 'application/pdf', 'binary_hash': 14981478401387673002, 'filename': '2408.09869v3.pdf'}}
source: https://arxiv.org/pdf/2408.09869
Relatedโ
- Document loader conceptual guide
- Document loader how-to guides