Processing

Document processing pipeline processes it using a set of NLP, LLM services, and open source libraries. It runs multiple substeps as follows:

Text and images are extracted from each document. The raw text and images are stored in new locations on buckets.
Optionally, PHI and PII data is redacted from text and images before storing.
Text is parsed to extract structured data from documents. The structured data extraction requires a multitude of steps - entity extraction, form fields extraction, and extraction of data in tables.
Text is chunked into smaller yet complete chunks (i.e. avoiding splitting a chunk such that the central idea of the chunk is not lost leading to less accurate semantic meaning).
The above two steps require the ingestion of domain knowledge into the pipeline.
Processed data is stored in a variety of databases - relational DBs, vector DBs, search indexing platforms.

Last updated 1 year ago