How to Build a Parsing Pipeline with Docling Parse for Layout-Aware Document Intelligence

In this tutorial, we build a workflow for using Docling Parse (https://github.com/docling-project/docling-parse) to analyze PDF documents at a detailed structural level. We start by preparing a stable Python environment, handling common Colab dependency issues, and generating a custom multi-page PDF with text, columns, table-like content, vector shapes, and an embedded image. We then use Docling Parse to extract words, characters, and lines with page-level coordinates, render visual overlays, and save the results into structured JSON and CSV files. Through this workflow, we see how low-level PDF parsing can support document AI tasks such as layout analysis, reading-order reconstruction, table-aware processing, and retrieval-ready document preparation.


As of 2026, with the increasing complexity of enterprise documents and the growing demand for fine-grained data extraction, Docling Parse offers a scalable solution for building robust document intelligence pipelines. This tutorial provides the foundational steps to integrate it into your AI infrastructure.


Setting Up the Docling Parse Environment and Dependencies


We begin by setting up the environment for Docling Parse. This step is crucial for ensuring compatibility and avoiding common pitfalls, especially when using Colab or similar platforms.


Prerequisites

  • Python 3.8 or higher
  • pip package manager
  • Access to a virtual environment (recommended)

Step 1: Create a Virtual Environment (Optional but Recommended)


python -m venv docling_env
source docling_env/bin/activate  # On Windows: docling_env\Scripts\activate

Step 2: Install Docling Parse


pip install docling-parse

Note: In Colab, you may encounter dependency conflicts with pre-installed libraries. If so, try installing in a specific order or using the --no-deps flag for certain packages. For example:


pip install --upgrade pip
pip install docling-parse --no-deps

Then manually install missing dependencies as needed (e.g., opencv-python, numpy, pandas).


Step 3: Verify Installation


import docling_parse
print(docling_parse.__version__)

This should output the version number (e.g., 0.1.0 or later).


Generating a Custom Multi-Page PDF for Testing


To test the parsing pipeline, we create a synthetic multi-page PDF containing diverse elements: text, multiple columns, table-like structures, vector shapes, and an embedded image. This helps validate Docling Parse's ability to handle complex layouts.


Step 1: Install PDF Generation Library


pip install reportlab

Step 2: Generate the PDF


from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas
from reportlab.lib.utils import ImageReader

# Create a PDF with two pages
c = canvas.Canvas("test_document.pdf", pagesize=letter)

# Page 1: Text, columns, and a table
c.setFont("Helvetica", 12)
c.drawString(50, 750, "Docling Parse Test Document - Page 1")
c.drawString(50, 700, "This is a sample text with multiple styles.")
# Add a table-like structure
c.setStrokeColor("black")
c.rect(50, 600, 500, 100, fill=0)
c.drawString(60, 650, "Column 1")
c.drawString(200, 650, "Column 2")
c.drawString(350, 650, "Column 3")
# Add vector shapes
c.setFillColor("blue")
c.circle(400, 400, 50, fill=1, stroke=1)

# Page 2: Image and more text
c.showPage()
c.setFont("Helvetica", 10)
c.drawString(50, 750, "Page 2 - Embedded Image Example")
# Embed an image (ensure 'sample_image.png' exists)
img = ImageReader("sample_image.png")
c.drawImage(img, 50, 500, width=200, height=200)

c.save()

Tip: Replace "sample_image.png" with any test image. If you don't have one, skip the image addition and proceed with text and shapes.


Parsing the PDF with Docling Parse


Now we use Docling Parse to extract low-level elements: words, characters, and lines, along with their page-level coordinates.


Step 1: Initialize the Parser


from docling_parse import PdfParser

parser = PdfParser()
doc = parser.parse_pdf("test_document.pdf")

Step 2: Extract Words and Coordinates


# Iterate over pages
for page_num, page in enumerate(doc.pages):
    print(f"Processing Page {page_num + 1}")
    for word in page.words:
        print(f"  Word: {word.text}, Bounding Box: {word.bbox}")

word.bbox returns a tuple (x0, y0, x1, y1) in PDF coordinates (origin at bottom-left).


Step 3: Extract Characters and Lines


for page_num, page in enumerate(doc.pages):
    # Characters
    for char in page.characters:
        print(f"  Char: '{char.text}', BBox: {char.bbox}")
    
    # Lines (text lines)
    for line in page.lines:
        print(f"  Line: '{line.text}', BBox: {line.bbox}")

Step 4: Render Visual Overlays (Optional)


To visually verify the extraction, we can draw bounding boxes on the PDF page using opencv-python or Pillow.


from PIL import Image, ImageDraw
import fitz  # PyMuPDF to render page as image

# Convert PDF page to image
pdf_document = fitz.open("test_document.pdf")
page = pdf_document[0]
pix = page.get_pixmap()
img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)

# Draw bounding boxes
draw = ImageDraw.Draw(img)
for word in doc.pages[0].words:
    bbox = word.bbox
    # Convert PDF coordinates to image coordinates if needed
    draw.rectangle([bbox[0], bbox[1], bbox[2], bbox[3]], outline="red", width=2)

img.show()

Note: Coordinate transformation may be required if the PDF uses a different origin.


Saving Results to Structured Files


For downstream tasks, we save the extracted data into JSON and CSV formats.


Save to JSON


import json

output = {}
for page_num, page in enumerate(doc.pages):
    page_data = {
        "words": [{"text": w.text, "bbox": w.bbox} for w in page.words],
        "lines": [{"text": l.text, "bbox": l.bbox} for l in page.lines],
        "characters": [{"text": c.text, "bbox": c.bbox} for c in page.characters]
    }
    output[f"page_{page_num + 1}"] = page_data

with open("parsed_output.json", "w") as f:
    json.dump(output, f, indent=2)

Save to CSV


import pandas as pd

rows = []
for page_num, page in enumerate(doc.pages):
    for word in page.words:
        rows.append({
            "page": page_num + 1,
            "type": "word",
            "text": word.text,
            "x0": word.bbox[0],
            "y0": word.bbox[1],
            "x1": word.bbox[2],
            "y1": word.bbox[3]
        })

df = pd.DataFrame(rows)
df.to_csv("parsed_output.csv", index=False)

These structured outputs are ready for use in layout analysis, machine learning pipelines, or retrieval-augmented generation (RAG) systems.


Applications and Next Steps


This pipeline demonstrates how Docling Parse enables layout-aware document intelligence. Key applications include:

  • Layout Analysis: Identify and classify regions (text, tables, images) for document understanding.
  • Reading-Order Reconstruction: Use word-level coordinates to infer the correct reading order, essential for accessibility and NLP.
  • Table-Aware Processing: Extract table structures with bounding boxes for precise data extraction.
  • Retrieval-Ready Preparation: Convert PDFs into structured formats (JSON/CSV) that can be indexed for semantic search or LLM prompts.

Looking Ahead in 2026


As document AI evolves, tools like Docling Parse play a critical role in handling diverse document types (scanned PDFs, invoices, reports) with high accuracy. Future enhancements may include native integration with OCR engines, support for PDF/A compliance, and optimized performance for large-scale batch processing. By building on this foundation, developers can create advanced pipelines for intelligent document processing.

via MarkTechPost

Related