madam.pdf module
PDF processor using pypdf (metadata) and pdf2image (rasterization).
The optional pdf dependency group must be installed:
uv sync --extra pdf
- class madam.pdf.PDFMetadataProcessor(config: Mapping[str, Any] | None = None)[source]
Bases:
MetadataProcessorReads, strips, and writes PDF document information metadata (title, author, subject, creator, producer).
Metadata is stored under the
'pdf'format key, so reading a PDF viaPDFProcessoryieldsasset.pdfas a mapping with the available fields.Requires the
pypdfpackage (madam[pdf]extra).Added in version 1.0.
- __init__(config: Mapping[str, Any] | None = None) None[source]
Initializes a new MetadataProcessor.
- combine(file: IO, metadata: Mapping) IO[source]
Return a copy of the PDF with the given document information written back.
- Raises:
UnsupportedFormatError – if file is not a PDF or metadata contains no
'pdf'entry.
- read(file: IO) Mapping[str, Mapping][source]
Extract document information from a PDF file.
- Returns:
{'pdf': {'title': ..., 'author': ..., ...}}or{}if the document carries no info metadata.- Raises:
UnsupportedFormatError – if file is not a PDF.
- strip(file: IO) IO[source]
Return a copy of the PDF with document information fields cleared.
- Raises:
UnsupportedFormatError – if file is not a PDF.
- class madam.pdf.PDFProcessor(config: Mapping[str, Any] | None = None)[source]
Bases:
ProcessorRepresents a processor that handles Portable Document Format (PDF) files.
Reading requires pypdf. Rasterization additionally requires pdf2image and a system-wide installation of poppler.
Added in version 0.24.
- __init__(config: Mapping[str, Any] | None = None) None[source]
Initializes a new
PDFProcessor.- Parameters:
config – Mapping with settings.
- can_read(file: IO) bool[source]
Returns whether the specified MIME type is supported by this processor.
- Parameters:
file (IO) – file-like object to be tested
- Returns:
whether the data format of the specified file is supported or not
- Return type:
- execute_run(steps: list[Callable], asset_or_context: Asset | ProcessingContext) Asset | ProcessingContext
Execute a grouped run of consecutive operators from this processor.
The default implementation applies each step sequentially, equivalent to the old per-step behaviour. Subclasses may override this to defer encoding: accumulate each operator’s effect into a
ProcessingContextand return it;Pipelinewill callProcessingContext.materialize()at the next processor boundary or at the end of the pipeline.- Parameters:
steps – Ordered list of tagged operator callables in this run.
asset_or_context – Input asset (or live context from a preceding run of the same processor).
- Returns:
Processed
Assetor a liveProcessingContext.
Added in version 1.0.
- rasterize(asset: Asset, page: int = 0, dpi: int = 72, mime_type: str = 'image/jpeg') Asset[source]
Converts one page of a PDF asset into a raster image.
Requires the
pdf2imagepackage and a system poppler installation.- Parameters:
- Returns:
Raster image asset
- Return type:
- Raises:
OperatorError – if page is out of range or rasterization fails
- read(file: IO) Asset[source]
Reads a PDF file and returns an
Asset.The returned asset carries a
page_countmetadata attribute with the number of pages in the document.- Parameters:
file (IO) – Readable binary file-like object containing PDF data
- Returns:
Asset with
mime_type='application/pdf'andpage_count- Return type:
- madam.pdf.combine(assets: Iterable[Asset], *, page_width: float, page_height: float) Asset[source]
Combines a sequence of image assets into a multi-page PDF.
Each image is scaled to fit the page dimensions (preserving aspect ratio) and centred on a white background. The page dimensions are given in PDF points (1 pt = 1/72 inch); at 72 DPI one point equals one pixel.
- Parameters:
- Returns:
Asset with
mime_type='application/pdf'andpage_count- Return type:
- Raises:
ValueError – If assets is empty or dimensions are non-positive
OperatorError – If an asset is not an image or cannot be decoded
Added in version 1.0.