madam.pdf module

PDF processor using pypdf (metadata) and pdf2image (rasterization).

The optional pdf dependency group must be installed:

uv sync --extra pdf
class madam.pdf.PDFMetadataProcessor(config: Mapping[str, Any] | None = None)[source]

Bases: MetadataProcessor

Reads, strips, and writes PDF document information metadata (title, author, subject, creator, producer).

Metadata is stored under the 'pdf' format key, so reading a PDF via PDFProcessor yields asset.pdf as a mapping with the available fields.

Requires the pypdf package (madam[pdf] extra).

Added in version 1.0.

__init__(config: Mapping[str, Any] | None = None) None[source]

Initializes a new MetadataProcessor.

combine(file: IO, metadata: Mapping) IO[source]

Return a copy of the PDF with the given document information written back.

Raises:

UnsupportedFormatError – if file is not a PDF or metadata contains no 'pdf' entry.

property formats: frozenset

The metadata formats which are supported.

Returns:

supported metadata formats

Return type:

set[str]

read(file: IO) Mapping[str, Mapping][source]

Extract document information from a PDF file.

Returns:

{'pdf': {'title': ..., 'author': ..., ...}} or {} if the document carries no info metadata.

Raises:

UnsupportedFormatError – if file is not a PDF.

strip(file: IO) IO[source]

Return a copy of the PDF with document information fields cleared.

Raises:

UnsupportedFormatError – if file is not a PDF.

class madam.pdf.PDFProcessor(config: Mapping[str, Any] | None = None)[source]

Bases: Processor

Represents a processor that handles Portable Document Format (PDF) files.

Reading requires pypdf. Rasterization additionally requires pdf2image and a system-wide installation of poppler.

Added in version 0.24.

__init__(config: Mapping[str, Any] | None = None) None[source]

Initializes a new PDFProcessor.

Parameters:

config – Mapping with settings.

can_read(file: IO) bool[source]

Returns whether the specified MIME type is supported by this processor.

Parameters:

file (IO) – file-like object to be tested

Returns:

whether the data format of the specified file is supported or not

Return type:

bool

execute_run(steps: list[Callable], asset_or_context: Asset | ProcessingContext) Asset | ProcessingContext

Execute a grouped run of consecutive operators from this processor.

The default implementation applies each step sequentially, equivalent to the old per-step behaviour. Subclasses may override this to defer encoding: accumulate each operator’s effect into a ProcessingContext and return it; Pipeline will call ProcessingContext.materialize() at the next processor boundary or at the end of the pipeline.

Parameters:
  • steps – Ordered list of tagged operator callables in this run.

  • asset_or_context – Input asset (or live context from a preceding run of the same processor).

Returns:

Processed Asset or a live ProcessingContext.

Added in version 1.0.

rasterize(asset: Asset, page: int = 0, dpi: int = 72, mime_type: str = 'image/jpeg') Asset[source]

Converts one page of a PDF asset into a raster image.

Requires the pdf2image package and a system poppler installation.

Parameters:
  • asset (Asset) – PDF asset to rasterize

  • page (int) – Zero-based page index

  • dpi (int) – Output resolution in dots per inch

  • mime_type (str) – MIME type of the output image ('image/jpeg' or 'image/png')

Returns:

Raster image asset

Return type:

Asset

Raises:

OperatorError – if page is out of range or rasterization fails

read(file: IO) Asset[source]

Reads a PDF file and returns an Asset.

The returned asset carries a page_count metadata attribute with the number of pages in the document.

Parameters:

file (IO) – Readable binary file-like object containing PDF data

Returns:

Asset with mime_type='application/pdf' and page_count

Return type:

Asset

property supported_mime_types: frozenset

MIME types this processor can handle (used to build the Madam index).

Added in version 0.24.

madam.pdf.combine(assets: Iterable[Asset], *, page_width: float, page_height: float) Asset[source]

Combines a sequence of image assets into a multi-page PDF.

Each image is scaled to fit the page dimensions (preserving aspect ratio) and centred on a white background. The page dimensions are given in PDF points (1 pt = 1/72 inch); at 72 DPI one point equals one pixel.

Parameters:
  • assets (Iterable[Asset]) – Iterable of image assets

  • page_width (float) – Page width in PDF points (must be positive)

  • page_height (float) – Page height in PDF points (must be positive)

Returns:

Asset with mime_type='application/pdf' and page_count

Return type:

Asset

Raises:
  • ValueError – If assets is empty or dimensions are non-positive

  • OperatorError – If an asset is not an image or cannot be decoded

Added in version 1.0.