Agent Skill · Databricks

databricks-unstructured-pdf-generation

Generate PDF documents from HTML and upload to Unity Catalog volumes. Use for creating test PDFs, demo documents, reports, or evaluation datasets.

View SKILL.md on GitHub → Source repository Provider profile

Provider: Databricks Path in repo: experimental/databricks-unstructured-pdf-generation/SKILL.md

Skill body

PDF Generation from HTML

Convert HTML content to PDF documents and upload them to Unity Catalog Volumes.

Workflow

Write HTML files to ./raw_data/html/ (write multiple files in parallel for speed)
Convert HTML → PDF using <SKILL_ROOT>/scripts/pdf_generator.py (parallel conversion)
Upload PDFs to Unity Catalog volume using databricks fs cp
Generate doc_questions.json with test questions for each document

Path convention: <SKILL_ROOT> below = the directory containing this SKILL.md. Resolve to the absolute install path (e.g. ~/.claude/skills/databricks-unstructured-pdf-generation). ./raw_data/... paths are relative to your own project cwd.

Dependencies

uv pip install plutoprint

Step 1: Write HTML Files

mkdir -p ./raw_data/html

Write HTML documents to ./raw_data/html/filename.html. Use subdirectories to organize (structure is preserved).

Step 2: Convert to PDF

# Convert entire folder (parallel, 4 workers)
python <SKILL_ROOT>/scripts/pdf_generator.py convert --input ./raw_data/html --output ./raw_data/pdf

Skips files where PDF exists and is newer than HTML. Use --force to reconvert all.

Step 3: Upload to Volume

databricks fs requires the dbfs: scheme prefix even for UC Volume paths. -r copies the contents of the source directory into the target (the source directory name is not preserved), so files land directly under raw_data/.

databricks fs cp -r --overwrite ./raw_data/pdf dbfs:/Volumes/my_catalog/my_schema/raw_data

Step 4: Generate Test Questions

Create ./raw_data/pdf/pdf_eval_questions.json with questions for Knowledge Assistant evaluation or MAS:

{
  "api_errors_guide.pdf": {
    "question": "What is the solution for error ERR-4521?",
    "expected_fact": "Call /api/v2/auth/refresh with refresh_token before the 3600s TTL expires"
  },
  "installation_manual.pdf": {
    "question": "What port does the service use by default?",
    "expected_fact": "Port 8443 for HTTPS, configurable via CONFIG_PORT environment variable"
  }
}

This JSON can be used to build KA test cases and validate retrieval accuracy.

Document Content Guidelines

When generating documents for Knowledge Assistant testing or demos:

Multi-page documents: Each PDF should be several pages with substantial content
Specific error codes and solutions: Include product-specific error codes, causes, and resolution steps
Technical details: API endpoints, configuration parameters, version numbers, specific commands
Simple CSS: Keep styling minimal for fast HTML creation and reliable PDF conversion
Queryable facts: Include details a KA must read the document to answer (not general knowledge)

Good document types:

Product user manuals with troubleshooting sections
API error reference guides (error codes, causes, solutions)
Installation/configuration guides with specific steps
Technical specifications with version-specific details

Example content: Instead of generic “Connection failed” errors, write:

“Error ERR-4521: OAuth token expired. Cause: Token TTL exceeded 3600s default. Solution: Call /api/v2/auth/refresh with your refresh_token before expiration. See Section 4.2 for token lifecycle management.”

CLI Reference

python <SKILL_ROOT>/scripts/pdf_generator.py convert [OPTIONS]

  --input, -i     Input HTML file or folder (required)
  --output, -o    Output folder for PDFs (required)
  --force, -f     Force reconvert (ignore timestamps)
  --workers, -w   Parallel workers (default: 4)

Folder Structure

Subfolder structure is preserved:

./raw_data/html/                    ./raw_data/pdf/
├── report.html             →       ├── report.pdf
├── quarterly/                      ├── quarterly/
│   └── q1.html             →       │   └── q1.pdf
└── legal/                          └── legal/
    └── terms.html          →           └── terms.pdf

Troubleshooting

Issue	Solution
“plutoprint not installed”	`uv pip install plutoprint`
PDF looks wrong	Check HTML/CSS syntax
“Volume does not exist”	`databricks volumes create CATALOG SCHEMA VOLUME_NAME MANAGED` (four separate positional args, not `catalog.schema.volume`)