Skip to main content

PDF Parser

The PDF parser tool extracts text and tables from PDF files accessible via URL. It returns clean, structured content that agents can analyse, summarise, or extract data from.

Tool: parse_pdf

Downloads a PDF from a URL and extracts its content as plain text and Markdown tables.

Arguments

ArgumentTypeDescription
urlstringURL of the PDF to parse
pagesstringOptional — page range to extract (e.g. "1-5", "3", "10-20")
extract_tablesbooleanConvert tables to Markdown format (default: true)

Output

Returns the extracted text with:

  • Paragraphs — plain text, page breaks noted
  • Tables — converted to Markdown table format
  • Page markers--- Page N --- separators

Use cases

Parse the Q3 earnings report at https://example.com/reports/q3-2024.pdf and extract all financial figures.
Read pages 5-15 of the technical specification at [URL] and summarise the API changes.
Extract all tables from the data sheet at [URL] and identify the column headers.

Local files

To parse a local PDF, first save it to the vault, then reference the vault path:

Read the PDF at vault/documents/contract.pdf and summarise the payment terms.

The agent uses vault_read to get the file, then parse_pdf on the local path.

Limitations

  • Very large PDFs (100+ pages) may exceed context limits — use the pages parameter to extract specific sections
  • Scanned PDFs without OCR text layer will return minimal content
  • Password-protected PDFs cannot be parsed