Doc Management Service

This chapter describes Trillo Workbench document service. It processes documents using document AI and generative AI for structured data extraction, semantic matching, summarization and Q & A.

Architecture

Trillo Doc AI is another Trillo product. It is built on the top of Trillo Workbench. It uses Doc Management Service APIs. Trillo Doc AI is described in a separate document. This chapter provides an overview of Doc Management Service, its component architecture (shown below) and document processing pipeline.

  1. Trillo Doc Service is built on the top Trillo Workbench like any other application using metadata for data models and serverless functions for API and workflows.

  2. Using the service a workflow processes files available on the cloud storage buckets (the processing pipeline is described below). These files can be transferred from the source using multiple channels. One of the channel is SFTP which is provided by another Trillo product called Trillo File Manager.

  3. Trillo Doc Service organizes files into logical folders. A file may appear in multiple folder hierarchy depending on the application needs.

  4. A processing pipeline built using Doc Service, processes each file for extracting its content and structured data using PDF libraries, Document AI and Generative AI APIs.

  5. It provides APIs to create embeddings and summaries of documents.

  6. It provides APIs to store embedding in one of the several vector databases. Trillo Workbench integrates with several vectors databases such as, i) pgvector, ii) AlloyDB, iii) Chroma.

  7. Doc Services implements access control (RBAC) on documents. RBAC is discussed in a separate chapter.

APIs of Doc Management Service

Doc Management Service APIs can be categorized into following APIs.

  1. File Transfer APIs: Similar to File Management Service discussed in the previous chapter, it provides APIs to upload and download files.

  2. Organize Folders and Files: Again, similar to File Management Service, it provides APIs to organize folders and files using operations such as copy, rename, move, delete, etc.

  3. Classification: It provides APIs to classify documents if it is purchase order, invoice, article, medical transcripts, etc. It provides APIs to classify pages of a document.

  4. Extraction: These APIs are built using Trillo functions. They extract content of file as text or structured data using libraries and other APIs such as, i) PDF libraries, ii) Google Document AI APIs, iii) Generative AI APIs.

  5. Summarization: It provides APIs to summarize document pages.

  6. Embedding: It provides APIs for generating vectors based on default or custom chunking.

  7. Storage: Using Data Service APIs, it stores raw content, summaries into relational databases or BigQuery. It stores embedding into vector databases for semantic search.

  8. Preview Image, Image Extraction, Thumbnail: It provides APIs using Trillo functions to generate preview or page, extract embedded images and generate thumbnails.

  9. RBAC: An enterprise class RBAC (role, group, rule and attribute based) can be overlayed on the top of folder hierarchy. (This is under development).

  10. Search and QA: It provides APIs for semantic search and Q&A.

These APIs are described in the following documents.

  1. Trillo Workbench Restful APIs

  2. Trillo Workbench SDK

  3. API Playground - in addition to it, all APIs are available in the Trillo Workbench UI with documents. You can exercise APIs if its preconditions are met.

Document Processing Pipeline

The following diagram shows a document processing pipeline. Trillo Workbench can form the pipeline into multiple concurrent threads and nodes to process a large number of documents.

Customization of Pipeline

Trillo Doc AI is implemented using metadata driven schemas and serverless functions. Therefore it is easy to customize. In an enterprise use case it is expected that this pipeline will have steps to integrate with enterprise application APIs and access control policies.

Last updated