Data challenges and solutions

Data Migration

Abstract

Have you ever had a legacy system that was no more meeting your business requirements. Have you ever been tempted seeing a brand new, modern and shiny system that was much better suited to your business requirements that evolved significantly over last couple of years. Have you also been afraid that moving from a legacy to a new system might be a real disaster affecting your business and that was the reason to stay with your old system forever. We all have. Migration always looks like a painful project. And that is true, a huge number of projects fails because migration from a legacy does not go well.

But it does not need to be a painful experience if properly planned and managed. In this article, we try to show what a data migration project is and what challenges are hiding behind it. We will particularly focus on technical side of data migration and leave business aspects of migration for another article.

Data migration

Data migration project is about moving our data from one, existing system to another one. For the sake of clarity they will be referred to as source systems and target systems in this article. The target system might be a brand new system that we have just developed or purchased. It may also be another existing system that we would like to integrate with the first one. The approaches are very similar and are based on an ETL process – Extraction, Transformation, Loading. Let’s see what those steps mean in more details.

Extraction

As the first step, we need to extract the data that we would like to see in the target system. In order to design the data extraction process, first we need to identify:

  • What data we need

This analysis needs to be based on the target system that we have. We need to understand how this system works and what data it needs to provide all required business functionality. As an example, the target system may need information about our existing customers or about products that we offer.

  • Where the required data is sitting in

We need to identify all systems, data silos and other places where we can find our source data. Our source data can be stored in various formats, being it relational databases, document repositories, web pages or even printouts. We need to find all of them and make sure we can easily access them. As per the previous examples, customer data may be sitting in our old CRM system and product information may be in our Product Catalogue or might be even in PDF files stored on a shared drive.

Once we know what data we need and where to extract it from we are ready to design the data extraction process. Let’s see what we need to do here:

  • Data cleansing

Data in source systems may be not really what we would like it to be and we need to process it accordingly. Data may be:

  • Outdated or obsolete – so we need to ignore it and not consider for migration if it is not required in the target system,
  • Invalid or irrelevant – as above, we do not need this data in the target system,
  • Corrupted – so we need to fix it before we can use it.
  • Data extraction techniques

We have two approaches here:

  • Custom development – we can extract the data from the source systems developing our own custom tools. They can be developed from the scratch (e.g. in Python, perl or JAVA) or can be based on tools delivered by source systems vendors (e.g. using SQLPlus from Oracle or CQL COPY from Apache Cassandra). The advantage of this approach is the flexibility that we can achieve if the extraction needs to be really complex. However, we need experienced migration developers to do it.
  • 3rd party extraction tools – there are many on the market and can be chosen based on particular needs and budget. They might be easier to start with and to use but flexibility and performance may be compromised. Some of them may not deliver all required functionality.
  • Data quality validation

Along with the data extraction we need to make sure the extracted data meets required quality standards. All extracted data should be at least validated against:

  • Data format – e.g. dates should be in date format, all numerical fields should contain only numbers,
  • Acceptable values – e.g. country name column should only contain valid country names,
  • Business rules – e.g. all extracted bank accounts should also have customers extracted and we should not see an account without an owner.
  • Data reconciliation

We need to extract all data that is required in the target system and that was available in the source system ensuring that no data is lost during the extraction process. Both systems should finally have the same amount of data, e.g. 1000 valid customers from the source system need to be migrated into the target system. Reconciliation process helps here. We need to measure the amount of data in the source system and compare it to the amount of data we extracted. As an example, we may need to count all customers in our CRM system and then in the files that the extraction process produced.

Once the data is extracted it is kept in a staging area. It might be a relation database with tables or a folder with flat files, depending on the chosen technology. In the staging area the data is waiting for the following step what is a transformation.

Transformation

The data is extracted but it is still the source data in the original format and based on the original data model. The next step is to convert the data into a data model that is required by the target system. As examples we may need the following transformations to happen:

  • Simple transformation

In order to create a new, target object we may need to change formats of some fields, to discard some parts of the original data or to lookup for some new values from other source (e.g. from lookup tables or from other systems). This kind of transformation is usually done within one data object.

  • Splitting data objects

Some objects may be split into multiple different objects. In a source CRM system an address can be represented as an attribute of a customer. While in the target system we may have two different objects, one being a customer and the other one being an address linked to a customer. As a result, the transformation process needs to create two different objects and a link between them (new primary and foreign keys). It may happen that one source object is used to create multiple target objects.

  • Merging data objects

This situation is opposite to the one described above. Multiple different objects from the source system are used to create one object in the target system.

  • Multiplying objects

In the source system an object may have an attribute showing the quantity of objects created, e.g. a car object may have a quantity of three telling us that a customer has three cars of the same type. While the target data model may require creating three separate car objects, one for each car that the customer possess. In this case the final quantity of target objects will be different from the quantity seen in the source system and this needs to be taken into account when reconciling.

  • Aggregating objects

This situation is opposite to the one described above. Multiple instances of objects of the same type are used to create one object in the target system.

  • Complex transformation

In some cases we may need to implement a complex, custom transformation to create target objects. It may be a situation when the target object does not exist in the source systems. In another scenario we may need multiple, different source objects together with complex business rules to be applied in order to create a target object.

If we come to transformation tools, we may decide to develop custom software in-house or use existing, 3rd party software. We need to consider and evaluate all pros and cons of both approaches.

The same as during the extraction process, we need to make sure we neither lose any data nor we create any redundant and unnecessary data during the transformation process. The reconciliation process needs to assure that we have the same amount of data before and after the transformation. This might be challenging when we need to create or discard intentionally some data during transformation so the reconciliation process needs to be designed carefully with attention to all details.

The output data coming from the transformation process may be stored in the same way as the input data, e.g. in a relational database or in files. Once the data is transformed and stored it’s ready for the next step, loading.

Loading

We have the data now in the target format and conforming to the target data model. In the loading phase we directly interact with target systems, one or more. Each of them may be based on a different technology and may require a different loading methods and tools to be used. As examples, we may have:

  • Databases

The target system may be a database, relational or NoSQL. Depending on the vendor we may be able to use their own tools (e.g. sqlldr for Oracle or CQL COPY for Apache Cassandra) or we may prefer to develop our own tools in Python, JAVA or using any other programming language. At this stage, the loader should be very simple and should be only loading the already prepared data. As such, in most cases the benefits from custom development might be insignificant and not worth investing.

  • Black box systems

In another scenario, our target system may not be exposing its internal database for us to load the data into. Alternatively, it may be exposing a SOAP interface, a RESTful web service or even its own API. In that case, we will need to use a specific tools to handle that kind interface or to develop a tool on our own.

  • Files

Some target systems may be able to consume the data from files. In that case, we may only need to produce files formatted to the given requirements and the target system will load them itself using its own mechanism. The file format might be CSV, XML, JSON or even any proprietary file format. Depending on that format we may use some existing tools or develop a custom loader that will meet our specific requirements.

Whatever the target system is and whatever the loading method we use, the reconciliation process comes back to the picture after the load is finished. We need make sure that in the target system we have all data that we wanted to have. In our end to end reconciliation process we also need to check the data on every step, being it extraction, transformation and load.

Summary

In this short introduction to the data migration process we briefly discussed when this processes may be used and what it takes to complete it successfully. We described three most important steps, data extraction, transformation and load, highlighted now important the data validation and data quality are and how reconciliation can help here. In the following articles we will focus on each step separately and we will present hands-on examples.

 

LEAVE A COMMENTS