In this tutorial, we build a workflow for using Docling Parse (https://github.com/docling-project/docling-parse) to analyze PDF documents at a detailed structural level. We start by preparing a stable Python environment, handling common Colab dependency issues, and generating a custom multi-page PDF with text, columns, table-like content, vector shapes, and an embedded image. We then use Docling Parse to extract words, characters, and lines with page-level coordinates, render visual overlays, and save the results into structured JSON and CSV files. Through this workflow, we see how low-level PDF parsing can support document AI tasks such as layout analysis, reading-order reconstruction, table-aware processing, and retrieval-ready document preparation.
As of 2026, with the increasing complexity of enterprise documents and the growing demand for fine-grained data extraction, Docling Parse offers a scalable solution for building robust document intelligence pipelines. This tutorial provides the foundational steps to integrate it into your AI infrastructure.
Setting Up the Docling Parse Environment and Dependencies
We begin by setting up the environment for Docling Parse. This step is crucial for ensuring compatibility and avoiding common pitfalls, especially when using Colab or similar platforms.
Prerequisites
- Python 3.8 or higher
- pip package manager
- Access to a virtual environment (recommended)
Step 1: Create a Virtual Environment (Optional but Recommended)
python -m venv docling_env
source docling_env/bin/activate # On Windows: docling_env\Scripts\activate
Step 2: Install Docling Parse
pip install docling-parse
Note: In Colab, you may encounter dependency conflicts with pre-installed libraries. If so, try installing in a specific order or using the --no-deps flag for certain packages. For example:
pip install --upgrade pip
pip install docling-parse --no-deps
Then manually install missing dependencies as needed (e.g., opencv-python, numpy, pandas).
Step 3: Verify Installation
import docling_parse
print(docling_parse.__version__)
This should output the version number (e.g., 0.1.0 or later).
Generating a Custom Multi-Page PDF for Testing
To test the parsing pipeline, we create a synthetic multi-page PDF containing diverse elements: text, multiple columns, table-like structures, vector shapes, and an embedded image. This helps validate Docling Parse's ability to handle complex layouts.
Step 1: Install PDF Generation Library
pip install reportlab
Step 2: Generate the PDF
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas
from reportlab.lib.utils import ImageReader
# Create a PDF with two pages
c = canvas.Canvas("test_document.pdf", pagesize=letter)
# Page 1: Text, columns, and a table
c.setFont("Helvetica", 12)
c.drawString(50, 750, "Docling Parse Test Document - Page 1")
c.drawString(50, 700, "This is a sample text with multiple styles.")
# Add a table-like structure
c.setStrokeColor("black")
c.rect(50, 600, 500, 100, fill=0)
c.drawString(60, 650, "Column 1")
c.drawString(200, 650, "Column 2")
c.drawString(350, 650, "Column 3")
# Add vector shapes
c.setFillColor("blue")
c.circle(400, 400, 50, fill=1, stroke=1)
# Page 2: Image and more text
c.showPage()
c.setFont("Helvetica", 10)
c.drawString(50, 750, "Page 2 - Embedded Image Example")
# Embed an image (ensure 'sample_image.png' exists)
img = ImageReader("sample_image.png")
c.drawImage(img, 50, 500, width=200, height=200)
c.save()
Tip: Replace "sample_image.png" with any test image. If you don't have one, skip the image addition and proceed with text and shapes.
Parsing the PDF with Docling Parse
Now we use Docling Parse to extract low-level elements: words, characters, and lines, along with their page-level coordinates.
Step 1: Initialize the Parser
from docling_parse import PdfParser
parser = PdfParser()
doc = parser.parse_pdf("test_document.pdf")
Step 2: Extract Words and Coordinates
# Iterate over pages
for page_num, page in enumerate(doc.pages):
print(f"Processing Page {page_num + 1}")
for word in page.words:
print(f" Word: {word.text}, Bounding Box: {word.bbox}")
word.bbox returns a tuple (x0, y0, x1, y1) in PDF coordinates (origin at bottom-left).
Step 3: Extract Characters and Lines
for page_num, page in enumerate(doc.pages):
# Characters
for char in page.characters:
print(f" Char: '{char.text}', BBox: {char.bbox}")
# Lines (text lines)
for line in page.lines:
print(f" Line: '{line.text}', BBox: {line.bbox}")
Step 4: Render Visual Overlays (Optional)
To visually verify the extraction, we can draw bounding boxes on the PDF page using opencv-python or Pillow.
from PIL import Image, ImageDraw
import fitz # PyMuPDF to render page as image
# Convert PDF page to image
pdf_document = fitz.open("test_document.pdf")
page = pdf_document[0]
pix = page.get_pixmap()
img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
# Draw bounding boxes
draw = ImageDraw.Draw(img)
for word in doc.pages[0].words:
bbox = word.bbox
# Convert PDF coordinates to image coordinates if needed
draw.rectangle([bbox[0], bbox[1], bbox[2], bbox[3]], outline="red", width=2)
img.show()
Note: Coordinate transformation may be required if the PDF uses a different origin.
Saving Results to Structured Files
For downstream tasks, we save the extracted data into JSON and CSV formats.
Save to JSON
import json
output = {}
for page_num, page in enumerate(doc.pages):
page_data = {
"words": [{"text": w.text, "bbox": w.bbox} for w in page.words],
"lines": [{"text": l.text, "bbox": l.bbox} for l in page.lines],
"characters": [{"text": c.text, "bbox": c.bbox} for c in page.characters]
}
output[f"page_{page_num + 1}"] = page_data
with open("parsed_output.json", "w") as f:
json.dump(output, f, indent=2)
Save to CSV
import pandas as pd
rows = []
for page_num, page in enumerate(doc.pages):
for word in page.words:
rows.append({
"page": page_num + 1,
"type": "word",
"text": word.text,
"x0": word.bbox[0],
"y0": word.bbox[1],
"x1": word.bbox[2],
"y1": word.bbox[3]
})
df = pd.DataFrame(rows)
df.to_csv("parsed_output.csv", index=False)
These structured outputs are ready for use in layout analysis, machine learning pipelines, or retrieval-augmented generation (RAG) systems.
Applications and Next Steps
This pipeline demonstrates how Docling Parse enables layout-aware document intelligence. Key applications include:
- Layout Analysis: Identify and classify regions (text, tables, images) for document understanding.
- Reading-Order Reconstruction: Use word-level coordinates to infer the correct reading order, essential for accessibility and NLP.
- Table-Aware Processing: Extract table structures with bounding boxes for precise data extraction.
- Retrieval-Ready Preparation: Convert PDFs into structured formats (JSON/CSV) that can be indexed for semantic search or LLM prompts.
Looking Ahead in 2026
As document AI evolves, tools like Docling Parse play a critical role in handling diverse document types (scanned PDFs, invoices, reports) with high accuracy. Future enhancements may include native integration with OCR engines, support for PDF/A compliance, and optimized performance for large-scale batch processing. By building on this foundation, developers can create advanced pipelines for intelligent document processing.
via MarkTechPost
