Functional system overview

Data processing framework

Kylo provides support for the following pipelines functions:

pipeline definition
pipeline deployment
pipeline execution
pipeline management
pipeline monitoring

Microservices

The microservices are materialized in a string of containerized executables. These services are generated, developed and orchestrated through Kylo (Kylo + Apache NiFi). Each will need to link up with a Providence Service

Provenance:
All services are responsible for sending provenance or trace messages to an external/pluggable provenance system. Provenance messages are a step above basic logging in that they will be more well defined (what should be logged) and they will be collected by a central provenance system.
The Provenance Service collects provenance messages posted by the ingestion microservices and persists the messages for traceability and obtaining lineage of a message through the various services. The Provenance Service consists of two components: Provenance Persistence and Provenance Query Service.
The Provenance Persistence Service is a microservice that reads the messages posted by the ingestion microservice and persists them to Elasticsearch.
The Provenance Query Service is a microservice that responds to user queries on data traceability by searching Elasticsearch.

Ingestion framework

The Ingestion framework supports reception and storage of incoming data files to a “landing zone” from which it is available for processing. The framework consists of several microservices:

Producer:
Source Data systems are the entry point for all data that is to be ingested. Source Producers will be developed to extract and to post their data sets and/or requests to transfer data sets. It receives data in a variety of formats (for example: XML, CSV, binary, or by URI reference) through a variety of protocols (for example: HTTP/REST, SFTP Kafka/JSON).
In addition to extracting payload information from the source systems to the Landing Zone, each producer instance extracts metadata relating to the request. Further, data type and validation classification information is also extracted if it is part of the request URI.
Each Producer instance logs each request to “some enterprise providence service”. On completion of processing, if a failure occurred (for example: the payload is missing information), the request is logged into the Exception Service. This is part of the processing of ensuring full tracking of successful and failed processing.
Messages successfully processed are put into a standard Ingestion Framework message format for downstream processing. These messages are output to a message queue that is configurable, that will be read by the next microservice in the ingestion workflow.
Consumer:
Consumer is a simple, configurable, message-driven microservice for transferring data from point A to point B.
It stores data to long-term, durable storage for subsequent processing (Staging Zone).
The Consumer Service (NiFi processor) listens for requests on an inbound queue. Each message represents a request to copy a blob (payload of data) from a source location to a target location. The source and target locations are configurable The specific process for determining the source blob name and the destination blob name from the metadata request are also configurable via the transfer service plugins.
After successful completion of the copy from the source location to staging zone and the archive zone, the consumer service posts a new message on the configured outbound queue as a notification to any interested parties to indicate that the resource is available in the new location.
Checkpoint:
- Checkpoint service is a microservice that updates and conveys the outcome of the processing of an Ingestion F) to the Provenance service. Both successful and failed processing IMF notify the Checkpoint service. Checkpoint updates the IMF (see Checkpoint IMF classification below) and posts the updated message to Checkpoint outbound Provenance service's message queue.
- Checkpoint service currently supports configurations for the ingestion pipeline and for the Data Processing Framework (DPF).
- For ingestion pipeline processing Checkpoint service listens for messages on an inbound message queue and posts modifications to the IMF to the outbound Provenance and Regulator message queues.
- For the Data Processing Framework (DPF) configurations, messages are posted to Checkpoint's secure HTTPS endpoint by Kylo, as well as the aforementioned outbound queues.

Processing Framework

This framework is responsible for validating the data, parsing and converting it to a Relational Format, and adding a Hive Schema to it.

Validate:
Validation determines if data has any exceptions and pushes validated data to the core zone.
Exception service is a microservice indicating that an error occurred and conveys the error to the Checkpoint and Provenance services. When an ingestion step fails, the message is posted on the inbound Exception service queue, and then the Exception service posts the update.
Exception service currently supports configurations for the ingestion pipeline.
For ingestion pipeline processing, the Exception service listens for messages on an inbound message queue and posts modifications to the Provenance message queues as well as the secure HTTPS endpoint for Kylo.
Flatten & Schema:
The flattening process parses the data (XML unbundling, or mapping of text fields and keys) and puts the fields into Hive columns with hive data types (because Spark reads Hive Tables faster).
This creates the new Hive Schema.
If exceptions occur, interfacing is with the same exception service identified in Validate, and the same processes are followed.

Access Framework

This framework is responsible for validating the data, parsing and converting it to a Relational Format, and then adding a Hive Schema to it.

Transform:
Transformation Services in this example is for future use and is not part of the scope beyond the processing framework identified above.
Conceptually, data mappings can be generated with NiFi and executed and monitored by Kylo.
Application Views:
Hive/Presto Views should be created to provide specific data access protections in addition to the other security measures being put in place (for example: Encrypted files in flight, Kerberos, Ranger Policies and Vormetric Transparent Encryption, and Files at rest). This limits what can be retrieved by individual users.
Semantic mapping to application specific requirements can also be generated here, which can represent logical mapping that occurs during access and not during the traditional ETL phase of processing.

Functional system overview