Skip to content
Cload Cloud
Documentation

pdf

Extract text, tables, metadata, merge & annotate PDFs.

What pdf Does

The PDF skill is a comprehensive tool for working with PDF documents programmatically. It enables you to extract text and structured data from PDFs, retrieve metadata, merge multiple documents, and add annotations—all without manual file handling. This skill is essential for teams that process large volumes of documents, automate data extraction workflows, or need to programmatically manipulate PDF files as part of their AI agent pipelines.

Designed for product designers and power users leveraging Claude AI agents, this skill transforms PDFs from static documents into actionable data. Whether you’re building workflows that parse invoices, consolidate reports, or annotate contracts, the PDF skill handles the heavy lifting of document processing. It integrates seamlessly with Claude’s agent framework, making it ideal for automation workflows that touch documentation.

How to Install

Prerequisites

  • Python 3.8 or higher
  • pip package manager
  • Access to Claude API credentials

Installation Steps

  1. Clone or download the skills repository

    git clone https://github.com/anthropics/skills.git
    cd skills/skills/pdf
    
  2. Install required dependencies

    pip install pypdf pdfplumber python-dotenv
    
  3. Configure your Claude API key

    • Create a .env file in your project directory
    • Add your API key: ANTHROPIC_API_KEY=your_api_key_here
    • Never commit this file to version control
  4. Import the skill into your Claude agent

    from skills.pdf import PDFSkill
    pdf_tool = PDFSkill(api_key=os.getenv('ANTHROPIC_API_KEY'))
    
  5. Verify installation

    # Test basic functionality
    text = pdf_tool.extract_text('sample.pdf')
    print(text[:100])  # Print first 100 characters
    
  6. Add to your agent configuration

    • Register the PDF skill in your Claude agent’s tool manifest
    • Test extraction with a sample PDF file

Use Cases

Invoice and Receipt Processing: Automatically extract line items, amounts, dates, and vendor information from hundreds of invoices to feed into accounting systems or expense management platforms.,Legal Document Review: Parse contracts and agreements to identify key clauses, dates, and obligations, enabling faster contract analysis and compliance checking across large document sets.,Report Consolidation: Merge quarterly reports, research documents, or project summaries into unified PDFs while maintaining formatting, then extract key metrics for executive dashboards.,Form Data Extraction: Pull structured data from filled PDF forms (tax returns, applications, surveys) and transform it into CSV or JSON for database import without manual data entry.,Document Annotation Workflows: Add comments, highlighting, and metadata tags to PDFs as part of review processes, enabling collaborative document workflows with audit trails.

How It Works

The PDF skill leverages two primary libraries—PyPDF and pdfplumber—to handle different aspects of PDF processing. PyPDF excels at document-level operations like merging, splitting, and metadata manipulation, while pdfplumber specializes in precise text and table extraction by understanding PDF geometry and layout. When you invoke text extraction, the skill analyzes the PDF’s internal structure to determine whether content exists as selectable text or embedded images. For text-based PDFs, it preserves layout information including spacing and column structure; for image-heavy or scanned PDFs, it can integrate OCR capabilities through optional dependencies.

Table extraction is particularly sophisticated—the skill uses pdfplumber’s table detection algorithms to identify grid structures, parse cells, and reconstruct tabular data as JSON or CSV. This approach maintains relationships between headers and values that simple text extraction would lose. Metadata extraction retrieves document properties like author, creation date, title, and custom fields embedded in the PDF’s information dictionary, which is crucial for document management and compliance workflows.

For merging and annotation operations, the skill constructs new PDF objects that reference the original pages while applying transformations. Annotations are stored as PDF markup objects, preserving them for downstream applications. All operations can be chained—extract metadata to determine file importance, extract tables for processing, then merge results back into an annotated output document. This modular approach integrates seamlessly with Claude’s agent framework, allowing multi-step workflows where each extraction feeds into AI analysis or data transformation steps.

Pros and Cons

Pros:

  • Seamless integration with Claude agent framework for end-to-end automation
  • Handles both text-based and scanned (image) PDFs with optional OCR
  • Accurate table detection preserves data structure for complex layouts
  • Lightweight Python implementation with minimal dependencies
  • Open-source with community support and active maintenance
  • No cloud dependency—process PDFs locally with full privacy
  • Supports batch operations for processing large document volumes efficiently

Cons:

  • OCR accuracy depends on image quality and requires additional dependencies
  • Performance degrades significantly with very large PDFs (500+ MB)
  • Bookmark hierarchies may flatten when merging complex multi-level structures
  • Limited form field extraction compared to commercial PDF APIs
  • Metadata preservation during transformations may lose some custom properties
  • No built-in support for extracting data from dynamic form widgets or XFA forms
  • Requires manual setup compared to drag-and-drop commercial tools

Document Parsing — General-purpose document processing for various formats beyond PDFs,CSV/Excel Handler — Export extracted PDF tables to spreadsheets or import tabular data,Image Recognition — Complement OCR capabilities for complex document layouts,File Management — Organize, version, and move processed PDF files,Data Transformation — Convert extracted PDF data into different formats for downstream systems

Alternatives

Adobe PDF Services API — Cloud-based PDF processing with advanced features like PDF generation and form data extraction. More expensive and cloud-dependent, but handles complex commercial workflows and offers guaranteed uptime.,Apache PDFBox — Open-source Java library offering similar extraction and manipulation capabilities. Better for Java-based systems but requires JVM overhead compared to Python solutions.,IronPDF / SelectPdf — Commercial solutions with robust table detection and image-to-PDF conversion. Offer superior support and specialized features but at higher cost and vendor lock-in risk.

Glossary

Key terms

Text Extraction
The process of retrieving readable text content from a PDF file. In text-based PDFs, this accesses the embedded text stream; in image-based PDFs, it may require OCR. The result is continuous text that may or may not preserve original formatting.
Table Detection
The algorithmic identification of grid-based data structures within PDFs. The skill analyzes cell boundaries, borders, and alignment to reconstruct tabular data with preserved row-column relationships, outputting structured formats like JSON or CSV.
Metadata
Embedded information about a PDF file itself, including document properties like author, title, creation date, modification date, subject, keywords, and custom fields. Stored in the PDF's information dictionary, separate from document content.
Annotation
User-added markup on a PDF page, such as highlights, comments, strikethrough text, or underlines. Annotations are stored as separate objects within the PDF and remain editable in most PDF readers.
OCR (Optical Character Recognition)
Technology that converts images of text (in scanned PDFs or photos) into machine-readable text. Used when PDFs are image-based rather than containing embedded text streams, enabling searchability and extraction from scanned documents.
FAQ

Frequently Asked Questions

How do I extract text from a scanned PDF or image-based PDF?

The PDF skill can integrate with OCR libraries like Tesseract or pytesseract for image-based PDFs. Install the OCR dependency (`pip install pytesseract`), ensure Tesseract is installed on your system, and set `use_ocr=True` when calling `extract_text()`. This processes each page as an image and converts it to searchable text, though accuracy depends on image quality and language.

What's the difference between text extraction and table extraction?

Text extraction returns all content as a continuous string, preserving line breaks but losing table structure. Table extraction specifically identifies grid-based data and returns it as structured JSON objects with rows and columns, maintaining cell relationships. Use table extraction when you need to preserve data relationships; use text extraction for content analysis and language processing.

Can I use this skill to extract data from password-protected PDFs?

Yes, if you have the password. Pass the password to the extraction function: `pdf_tool.extract_text('file.pdf', password='your_password')`. The skill will decrypt the PDF before processing. Note that some PDFs use permission-only passwords (which prevent printing but allow reading)—these don't require a password for text extraction.

How do I merge multiple PDFs while preserving bookmarks and annotations?

Use the `merge_pdfs()` function with the `preserve_metadata=True` flag: `pdf_tool.merge_pdfs(file_list, output_path='merged.pdf', preserve_metadata=True)`. This maintains bookmarks from source documents and preserves any existing annotations, though note that bookmark hierarchies may flatten depending on the source structure.

What file size limits exist for PDF processing?

The skill can handle PDFs up to several hundred MB, though processing time increases significantly with size. For files over 500 MB, consider splitting them first using `split_pdf()`. Memory usage scales with document complexity—scanned PDFs with high-resolution images consume more memory than text-based PDFs of the same page count.

How do I add annotations programmatically to a PDF?

Use the `annotate_pdf()` function to add comments, highlights, or markup: `pdf_tool.annotate_pdf('input.pdf', annotations=[{'page': 0, 'type': 'highlight', 'coordinates': [x1, y1, x2, y2]}])`. Annotations are stored as PDF objects that remain visible in any PDF reader. You can add text comments, highlights, strikethrough, and underline markups.

Can the skill detect and extract metadata like author and creation date?

Yes. Call `get_metadata(pdf_file)` to retrieve all document properties including author, creation date, modification date, subject, keywords, and custom fields. This returns a dictionary of all available metadata. Note that metadata must be explicitly embedded in the PDF—if not present, those fields will be empty.

How does the skill handle PDFs with different encodings or special characters?

The skill automatically detects text encoding and handles Unicode characters, supporting PDFs with international text, mathematical symbols, and special characters. However, some legacy PDFs with non-standard encodings may require the `encoding='latin-1'` parameter. Test extraction on a sample page if you encounter character issues.

More in Documentation

All →
Documentation

Twitter Algorithm Optimizer

Analyze and optimize tweets for maximum reach using Twitter's open-source algorithm insights. Rewrite and edit tweets to improve engagement and visibility.

ComposioHQ
Documentation

NotebookLM Integration

Lets Claude Code chat directly with NotebookLM for source-grounded answers based exclusively on uploaded documents.

PleasePrompto
Documentation

Meeting Insights Analyzer

Analyzes meeting transcripts to uncover behavioral patterns including conflict avoidance, speaking ratios, filler words, and leadership style.

ComposioHQ