`madam.pdf` module

PDF processor using pypdf (metadata) and pdf2image (rasterization).

The optional pdf dependency group must be installed:

uv sync --extra pdf

class madam.pdf.PDFMetadataProcessor(config: Mapping[str, Any] | None = None)[source]

Bases: MetadataProcessor

Reads, strips, and writes PDF document information metadata (title, author, subject, creator, producer).

Metadata is stored under the 'pdf' format key, so reading a PDF via PDFProcessor yields asset.pdf as a mapping with the available fields.

Requires the pypdf package (madam[pdf] extra).

Added in version 1.0.

__init__(config: Mapping[str, Any] | None = None) → None[source]: Initializes a new MetadataProcessor.

combine(file: IO, metadata: Mapping) → IO[source]

Return a copy of the PDF with the given document information written back.

Raises:: UnsupportedFormatError – if file is not a PDF or metadata contains no 'pdf' entry.

property formats: frozenset

The metadata formats which are supported.

Returns:: supported metadata formats
Return type:: set[str]

read(file: IO) → Mapping[str, Mapping][source]

Extract document information from a PDF file.

Returns:: {'pdf': {'title': ..., 'author': ..., ...}} or {} if the document carries no info metadata.
Raises:: UnsupportedFormatError – if file is not a PDF.

strip(file: IO) → IO[source]

Return a copy of the PDF with document information fields cleared.

Raises:: UnsupportedFormatError – if file is not a PDF.

class madam.pdf.PDFProcessor(config: Mapping[str, Any] | None = None)[source]

Bases: Processor

Represents a processor that handles Portable Document Format (PDF) files.

Reading requires pypdf. Rasterization additionally requires pdf2image and a system-wide installation of poppler.

Added in version 0.24.

__init__(config: Mapping[str, Any] | None = None) → None[source]

Initializes a new PDFProcessor.

Parameters:: config – Mapping with settings.

can_read(file: IO) → bool[source]

Returns whether the specified MIME type is supported by this processor.

Parameters:: file (IO) – file-like object to be tested
Returns:: whether the data format of the specified file is supported or not
Return type:: bool

execute_run(steps: list[Callable], asset_or_context: Asset | ProcessingContext) → Asset | ProcessingContext

Execute a grouped run of consecutive operators from this processor.

The default implementation applies each step sequentially, equivalent to the old per-step behaviour. Subclasses may override this to defer encoding: accumulate each operator’s effect into a ProcessingContext and return it; Pipeline will call ProcessingContext.materialize() at the next processor boundary or at the end of the pipeline.

Parameters:

steps – Ordered list of tagged operator callables in this run.
asset_or_context – Input asset (or live context from a preceding run of the same processor).

Returns:

Processed Asset or a live ProcessingContext.

Added in version 1.0.

rasterize(asset: Asset, page: int = 0, dpi: int = 72, mime_type: str = 'image/jpeg') → Asset[source]

Converts one page of a PDF asset into a raster image.

Requires the pdf2image package and a system poppler installation.

Parameters:

asset (Asset) – PDF asset to rasterize
page (int) – Zero-based page index
dpi (int) – Output resolution in dots per inch
mime_type (str) – MIME type of the output image ('image/jpeg' or 'image/png')

Returns:

Raster image asset

Return type:

Asset

Raises:

OperatorError – if page is out of range or rasterization fails

read(file: IO) → Asset[source]

Reads a PDF file and returns an Asset.

The returned asset carries a page_count metadata attribute with the number of pages in the document.

Parameters:: file (IO) – Readable binary file-like object containing PDF data
Returns:: Asset with mime_type='application/pdf' and page_count
Return type:: Asset

property supported_mime_types: frozenset: MIME types this processor can handle (used to build the Madam index).

Added in version 0.24.

madam.pdf.combine(assets: Iterable[Asset], *, page_width: float, page_height: float) → Asset[source]

Combines a sequence of image assets into a multi-page PDF.

Each image is scaled to fit the page dimensions (preserving aspect ratio) and centred on a white background. The page dimensions are given in PDF points (1 pt = 1/72 inch); at 72 DPI one point equals one pixel.

Parameters:

assets (Iterable[Asset]) – Iterable of image assets
page_width (float) – Page width in PDF points (must be positive)
page_height (float) – Page height in PDF points (must be positive)

Returns:

Asset with mime_type='application/pdf' and page_count

Return type:

Asset

Raises:

ValueError – If assets is empty or dimensions are non-positive
OperatorError – If an asset is not an image or cannot be decoded

Added in version 1.0.

madam.pdf module

`madam.pdf` module