doc_type#
See:
- class aws_textract_pipeline.doc_type.DocTypeEnum(value)[source]#
Enumeration for document types.
Each intake document will be classified into one of these types. This value is critical for identifying the appropriate processing logic to apply in the downstream process.
For example:
aws_textract_pipeline.landing.get_doc_md5(): This function uses the document type to determine how to calculate a unique identifier for the document.aws_textract_pipeline.tracker.BaseTracker.raw_to_component(): This method uses the document type to determine how to segment the document.aws_textract_pipeline.tracker.BaseTracker.component_to_textract_output(): This method uses the document type to determine how to process the Textract output.
- aws_textract_pipeline.doc_type.ext_to_doc_type_mapper = {'bmp': 'bmp', 'csv': 'csv', 'doc': 'word', 'docx': 'word', 'gif': 'gif', 'jpeg': 'jpg', 'jpg': 'jpg', 'json': 'json', 'pdf': 'pdf', 'png': 'png', 'ppt': 'ppt', 'pptx': 'ppt', 'tiff': 'tiff', 'tsv': 'tsv', 'txt': 'text', 'xls': 'excel', 'xlsx': 'excel'}#
Mapping from file extension to
DocTypeEnum.
- class aws_textract_pipeline.doc_type.S3ContentTypeEnum(value)[source]#
AWS S3 Content Type. Proper content type allow you to open the S3 object in web browser without downloading it.
Ref:
- aws_textract_pipeline.doc_type.doc_type_to_content_type_mapper: Dict[str, Optional[str]] = {'bmp': 'image/bmp', 'csv': 'text/csv', 'excel': 'application/x-msexcel', 'gif': 'image/gif', 'jpg': 'image/jpeg', 'json': 'application/json', 'pdf': '\tapplication/pdf', 'png': 'image/png', 'ppt': 'application/mspowerpoint', 'text': 'text/plain', 'tiff': 'image/tiff', 'tsv': 'text/csv', 'unknown': None, 'word': 'application/msword'}#
Mapping from
DocTypeEnumtoS3ContentTypeEnum.