Data Extraction and Migration from multiple Data Sources

The purpose of this document is to show how we can retrieve data from different data sources in order to transform and merge them later. In the following example data records are retrieved from a RDBMS and from a CSV file.

The first data source is a database table that contains basic customer data such as:

The second data source is a CSV file representing a dump of an Excel Spreadsheet which contains additional data of these customers. The file could be provided by a sales representatives dealing with these customers.

The goal is to:

Data Extraction Setup

To run this example, you have to install lifecycle management and integration runtime.

The sql script below shows how to create the customer table.

drop table if exists customer;
create table customer 
( 
customerId bigint(20) NOT NULL AUTO_INCREMENT,
entryDate datetime DEFAULT NULL,
firstName varchar(255) NOT NULL,
lastName varchar(255) NOT NULL,
ssn varchar(255) NOT NULL,
street varchar(255) NOT NULL,
city varchar(255) NOT NULL,
postalCode varchar(255) NOT NULL,
state varchar(255) NOT NULL,
phoneNumber varchar(255) NOT NULL,
areaCode varchar(255) NOT NULL,
investableAssets bigint(20) NOT NULL,
PRIMARY KEY (customerId)
) ENGINE=InnoDB AUTO_INCREMENT=11 DEFAULT CHARSET=latin1;

Data Extraction Process

A file trigger is used to start the data extraction process. The process retrieves data records from CSV file, put the result on structured data (dataFile) and send them to the Data Base Extraction and Merge Process as shown below.

File Extraction Process

Data Base Extraction and Merge Process

Initially the process is waiting data from File Extraction Process to be started then user have to enter two parameters for filtering:

The SQL query is built based on those parameters:

The Database will be invoked with camel application type (find more details in connectivity to JDBC section's), the output will be saved in structured data (dataDB) then merged with dataFile using Message Transformation Application.

Definition of the Business Service

The Business Service is created as a Process Definition in a Process Model of your Infinity Service Bus Lifecycle Portal.

Definition of the Input Data Structure

The Input Data Structure to define Query Predicates is created as a Primitive Data Types in a Process Model of your Infinity Service Bus Lifecycle Portal.

Find more details on Primitive Data here:

Area Code

Definition of the RDBMS Record

The structure of the RDBMS Record for which a set is to be retrieved from the RDBMS is created as a Structured Data Types in a Process Model of your Infinity Service Bus Lifecycle Portal.

Customer DataBase Structure

Definition of the CSV Record Structure

The structure of the CSV record is created as a Structured Data Types in a Process Model of your Infinity Service Bus Lifecycle Portal.

Customer File Structure

Definition of Output Record Structure(Data Merge)

The structure of the Output Record for which a set is to be returned as a Structured Data Types in a Process Model of your Infinity Service Bus Lifecycle Portal.

Customer Merge Structure

Connectivity to JDBC

Connectivity to the RDBMS is achieved with a Camel Application with and Endpoint and Route as following:

Dynamic configuration variables related to the JDBC connectivity (host, port, dbName, user, password) are also used. The screenshot below shows how to set the different variables default values.

Connectivity to CSV

The connectivity to the CSV file is achieved with a Camel Trigger Endpoint and Route as below:

Data Transformation and Merge via Message Transformation

Message Transformations is used to perform the required transformations between the input data (records from database and CSV file) and output data (merged data). The following screen shot describes how the records are merged.

Example

The following screenshots are retrieved from data extraction example, as you can see we have two data sources:

Records from database

The following is an example of CSV file :

Customerid,Salary,Role
1,2600,HR
2,2920,Software engineer
3,3100,QA
4,1960,IT administrator
5,2380,Data Manager

The process is started by the file trigger(reading file), then user enters areaCode and investableAssetsLimit for filtering.

Once the above activity has been completed, the merged data is shown as below.

As expected, the Salary and Role fields are coming from the CSV file for customers with the same id.

Exposing the Service via REST

You can use the Process Interface facility of Stardust to expose services based on processes via REST or as Web Services.

Invoking the Service periodically

Additionally to invoking the Data Extraction Service via a REST calls, it should be invoked once a day and the result should be send to the sales manager via e-mail.

This can be easily achieved by defining another Process Definition DailyCustomerReport

A Quartz Timer Start Event would be defined for this Process Definition with the URI

<from uri="quartz://myTimer?trigger.repeatInterval=2000&trigger.repeatCount=-1"/>

Daily Customer Report would invoke the Customer Query Service and a Mail Application which would send the retrieved data to the sales manager.

Processing of Record Sets

If you intend to process sets of records in your data processing, e.g. to

you first need to model Structured Data Types with

to carry the results of the data retrieval, e.g. for the example above

To map entries from all CSV entries to RDBMS entries add

for (var n in csvCustomerRecords)

{
	for (var m in rdbmsCustomerRecords)

	{
		if (rdbmsCustomerRecord[m].CUSTOMERID == csvCustomerRecord[n].ID)
		{
			outputRecord.push({
				customerId: ,
				role: csvCustomerRecord[n].Role,
				...});
		}
		
		break;
	}	
}

to the Advanced Mapping field of the corresponding Message Transformation.