doc_type#

See:

DocTypeEnum
S3ContentTypeEnum

class aws_textract_pipeline.doc_type.DocTypeEnum(value)[source]#

Enumeration for document types.

Each intake document will be classified into one of these types. This value is critical for identifying the appropriate processing logic to apply in the downstream process.

For example:

aws_textract_pipeline.landing.get_doc_md5(): This function uses the document type to determine how to calculate a unique identifier for the document.
aws_textract_pipeline.tracker.BaseTracker.raw_to_component(): This method uses the document type to determine how to segment the document.
aws_textract_pipeline.tracker.BaseTracker.component_to_textract_output(): This method uses the document type to determine how to process the Textract output.

classmethod detect_doc_type(filename: str) → str[source]#

Detect document type based on file name.

Parameters:: filename – file name with extension, example: “example.pdf”

aws_textract_pipeline.doc_type.ext_to_doc_type_mapper = {'bmp': 'bmp', 'csv': 'csv', 'doc': 'word', 'docx': 'word', 'gif': 'gif', 'jpeg': 'jpg', 'jpg': 'jpg', 'json': 'json', 'pdf': 'pdf', 'png': 'png', 'ppt': 'ppt', 'pptx': 'ppt', 'tiff': 'tiff', 'tsv': 'tsv', 'txt': 'text', 'xls': 'excel', 'xlsx': 'excel'}#: Mapping from file extension to DocTypeEnum.

class aws_textract_pipeline.doc_type.S3ContentTypeEnum(value)[source]#

AWS S3 Content Type. Proper content type allow you to open the S3 object in web browser without downloading it.

Ref:

https://www.ibm.com/docs/en/aspera-on-cloud?topic=SS5W4X/dita/content/aws_s3_content_types.htm

aws_textract_pipeline.doc_type.doc_type_to_content_type_mapper: Dict[str, Optional[str]] = {'bmp': 'image/bmp', 'csv': 'text/csv', 'excel': 'application/x-msexcel', 'gif': 'image/gif', 'jpg': 'image/jpeg', 'json': 'application/json', 'pdf': '\tapplication/pdf', 'png': 'image/png', 'ppt': 'application/mspowerpoint', 'text': 'text/plain', 'tiff': 'image/tiff', 'tsv': 'text/csv', 'unknown': None, 'word': 'application/msword'}#: Mapping from DocTypeEnum to S3ContentTypeEnum.