madam.pdf module

PDF processor using pypdf (metadata) and pdf2image (rasterization).

The optional pdf dependency group must be installed:

uv sync --extra pdf
class madam.pdf.PDFProcessor(config: Mapping[str, Any] | None = None)[source]

Bases: Processor

Represents a processor that handles Portable Document Format (PDF) files.

Reading requires pypdf. Rasterization additionally requires pdf2image and a system-wide installation of poppler.

__init__(config: Mapping[str, Any] | None = None) None[source]

Initializes a new PDFProcessor.

Parameters:

config – Mapping with settings.

can_read(file: IO) bool[source]

Returns whether the specified MIME type is supported by this processor.

Parameters:

file (IO) – file-like object to be tested

Returns:

whether the data format of the specified file is supported or not

Return type:

bool

rasterize(asset: Asset, page: int = 0, dpi: int = 72, mime_type: str = 'image/jpeg') Asset[source]

Converts one page of a PDF asset into a raster image.

Requires the pdf2image package and a system poppler installation.

Parameters:
  • asset (Asset) – PDF asset to rasterize

  • page (int) – Zero-based page index

  • dpi (int) – Output resolution in dots per inch

  • mime_type (str) – MIME type of the output image ('image/jpeg' or 'image/png')

Returns:

Raster image asset

Return type:

Asset

Raises:

OperatorError – if page is out of range or rasterization fails

read(file: IO) Asset[source]

Reads a PDF file and returns an Asset.

The returned asset carries a page_count metadata attribute with the number of pages in the document.

Parameters:

file (IO) – Readable binary file-like object containing PDF data

Returns:

Asset with mime_type='application/pdf' and page_count

Return type:

Asset

property supported_mime_types: frozenset

MIME types this processor can handle (used to build the Madam index).