landing#

Landing bucket is where the intake documents are stored at the beginning of the pipeline.

class aws_textract_pipeline.landing.MetadataKeyEnum(value)[source]#

An enumeration.

class aws_textract_pipeline.landing.LandingDocument(s3uri: str, doc_type: str, features: List[str])[source]#

Represent a document in landing zone. A document in landing zone is a single S3 object. The metadata of the S3 object should include the following information:

{
    "landing_s3uri": "s3://bucket/key" # the S3 URI of the document in landing zone
    "doc_type": "pdf|word|excel|ppt|image|..." # the type of the document
    "features": ["TABLES"|"FORMS"|"QUERIES"|"SIGNATURES"|"LAYOUT", ...]
}
classmethod load(bsm: BotoSesManager, s3path: S3Path)[source]#

Load a LandingDocument object from S3 object.

Parameters:
  • bsm – the boto_session_manager.BotoSesManager object.

  • s3path – the S3 path of the document in landing zone.

dump(bsm: BotoSesManager, body: bytes) S3Path[source]#

Dump the LandingDocument object to S3 object.

This method is used in the ingestion pipeline (prior to the Textract pipeline) to dump the document to the landing zone.

Parameters:
  • bsm – the boto_session_manager.BotoSesManager object.

  • body – the binary content of the document.

aws_textract_pipeline.landing.get_md5_of_bytes(b: bytes) str[source]#

Get md5 of a binary object.

aws_textract_pipeline.landing.get_tar_file_md5(bsm: BotoSesManager, s3path: S3Path) str[source]#

Get md5 of all files in a tar file on S3. This md5 is deterministic. This md5 value is used as the content-based unique id of a document.

aws_textract_pipeline.landing.get_doc_md5(bsm: BotoSesManager, s3path: S3Path, doc_type: str) str[source]#

Get the md5 of the document based on it’s content. In Landing zone, we may use the file name as the S3 object key. However, the file name is not unique. The md5 of the content is a better value for the S3 object key.