Write a workflow (pipeline for data processing)

In this lesson, we demonstrate a workflow implementation that reads data from BigQuery and write as “ndjosn” to bucket. It uses BucketOp to write the file to the bucket.

Code: /lessons/Write_a_workflow/WriteWorkflow.java

In Trillo Workbench any function can run as a background process or like a workflow. When it runs in the background a unique task id is assigned to it. All its logs are stored in the database using taskId as one of the fields.

A workflow is generally a long running process and processes a large amount of data such as a large file, large dataset in BigQuery or 1000s of invoices received this week as PDF on an email address. Therefore code for workflow is written as steps that are processing a chunk of data in each iteration. Some of the steps may run concurrently using java threads. A workflow may farm out work to multiple sub workflows running on a cluster of machines. The workflow loops periodically checks if the workflow has been canceled by another process or user. Trillo Workbench handles complexity transparently. It uses “Op” subclasses for concurrent processing.

Steps

  1. Create a new function called Workflow.java.

  2. ...

Last updated