doc_type#

See:

class aws_textract_pipeline.doc_type.DocTypeEnum(value)[source]#

Enumeration for document types.

Each intake document will be classified into one of these types. This value is critical for identifying the appropriate processing logic to apply in the downstream process.

For example:

classmethod detect_doc_type(filename: str) str[source]#

Detect document type based on file name.

Parameters:

filename – file name with extension, example: “example.pdf”

aws_textract_pipeline.doc_type.ext_to_doc_type_mapper = {'bmp': 'bmp', 'csv': 'csv', 'doc': 'word', 'docx': 'word', 'gif': 'gif', 'jpeg': 'jpg', 'jpg': 'jpg', 'json': 'json', 'pdf': 'pdf', 'png': 'png', 'ppt': 'ppt', 'pptx': 'ppt', 'tiff': 'tiff', 'tsv': 'tsv', 'txt': 'text', 'xls': 'excel', 'xlsx': 'excel'}#

Mapping from file extension to DocTypeEnum.

class aws_textract_pipeline.doc_type.S3ContentTypeEnum(value)[source]#

AWS S3 Content Type. Proper content type allow you to open the S3 object in web browser without downloading it.

Ref:

aws_textract_pipeline.doc_type.doc_type_to_content_type_mapper: Dict[str, Optional[str]] = {'bmp': 'image/bmp', 'csv': 'text/csv', 'excel': 'application/x-msexcel', 'gif': 'image/gif', 'jpg': 'image/jpeg', 'json': 'application/json', 'pdf': '\tapplication/pdf', 'png': 'image/png', 'ppt': 'application/mspowerpoint', 'text': 'text/plain', 'tiff': 'image/tiff', 'tsv': 'text/csv', 'unknown': None, 'word': 'application/msword'}#

Mapping from DocTypeEnum to S3ContentTypeEnum.