segment#

This module provides utilities to segment a document into components.

See:

class aws_textract_pipeline.segment.SegmentPdfResult(page_pdf_list: ~typing.List[~pymupdf.Document] = <factory>, page_image_list: ~typing.List[~pymupdf.Pixmap] = <factory>)[source]#

Returned object of segment_pdf().

To save fitz.Document object to local file, use the following code:

>>> res = SegmentPdfResult(...)
>>> page = res.page_pdf_list[0]
>>> page.save("/path/to/save/page.pdf")

To save fitz.Pixmap object to local file, use the following code:

>>> res = SegmentPdfResult(...)
>>> pixmap = res.page_image_list[0]
>>> pixmap.save("/path/to/save/image.png", output="png")

To get width and height of the image, use the following code:

>>> pixmap.width
>>> pixmap.height
aws_textract_pipeline.segment.segment_pdf(pdf_content: bytes, dpi: int = 200) SegmentPdfResult[source]#

Segment PDF into pages.

Parameters:
  • pdf_content – PDF content in bytes.

  • dpi – DPI of the image.