madam.pdf module
PDF processor using pypdf (metadata) and pdf2image (rasterization).
The optional pdf dependency group must be installed:
uv sync --extra pdf
- class madam.pdf.PDFProcessor(config: Mapping[str, Any] | None = None)[source]
Bases:
ProcessorRepresents a processor that handles Portable Document Format (PDF) files.
Reading requires pypdf. Rasterization additionally requires pdf2image and a system-wide installation of poppler.
- __init__(config: Mapping[str, Any] | None = None) None[source]
Initializes a new
PDFProcessor.- Parameters:
config – Mapping with settings.
- can_read(file: IO) bool[source]
Returns whether the specified MIME type is supported by this processor.
- Parameters:
file (IO) – file-like object to be tested
- Returns:
whether the data format of the specified file is supported or not
- Return type:
- rasterize(asset: Asset, page: int = 0, dpi: int = 72, mime_type: str = 'image/jpeg') Asset[source]
Converts one page of a PDF asset into a raster image.
Requires the
pdf2imagepackage and a system poppler installation.- Parameters:
- Returns:
Raster image asset
- Return type:
- Raises:
OperatorError – if page is out of range or rasterization fails
- read(file: IO) Asset[source]
Reads a PDF file and returns an
Asset.The returned asset carries a
page_countmetadata attribute with the number of pages in the document.- Parameters:
file (IO) – Readable binary file-like object containing PDF data
- Returns:
Asset with
mime_type='application/pdf'andpage_count- Return type: