Don't let your "data lake" turn into a swamp

By Gaurav Roy, Director

HP Inc. Personal Systems Cloud Software R&D

The data economy is hot in the market today. Artificial Intelligence, Machine Learning, and Big Data power more data-driven analysis and automation than ever.

At HP Inc.’s Personal Systems, we have rolled out a Device-as-a-Service (DaaS) program that is a centerpiece of HP’s transformation from selling hardware to offering subscription cyber solutions. Subscribers benefit because HP continuously monitors the performance and health of their HP system, spotting issues before they become breakdowns that impact work. Running continuous health checks on millions of devices is generating an immense volume and variety of data.

As we all move from data warehouses to data lakes, the quality and structure of data becomes central to all our AI implementations. We need to develop our ability to absorb unstructured data in myriad formats. There’s a tendency to dump everything into the lake – device data, usage data, financial data and as homes get smarter even kitchen sink data. A data lake can quickly become enormous – woo-hoo! – but if it’s all garbage in there, it isn’t a lake.

It’s a swamp.


Characteristics of a data swamp

To know whether your data lake is turning into a swamp, ask the following questions:

  1. Is it hard to get data out?
  2. Is it hard to correlate between different data points or types of data?
  3. Is the quality and continuity of data really bad? Are there lots of missing fields or missing days.

If any of above conditions are true, then you are in a swamp – or a swamp in the making.


How to keep your data in order in the lake

From HP DaaS, we have learned three key principles for keeping our data lake clean and useful. It’s simple, but it works!

1. Structure the Data

  1. Collect data in any format.
  2. Transform it to a structured format. (We call these “data classes.”)
    The structured format has a data section and a metadata section.
    • The data section holds the actual data.
    • The metadata section has information regarding the data class, version, source, ingestion time, and common internal identifiers like deviceID, userID, and companyID.

This converts all your data into a known, standardized data structure. By structuring your data in an automated fashion every day, you will always be able to account for all your data.

2. Cleanse the data

Once the data is in a standard format:

  1. Pass it through a validator.
  2. Flag anomalies and place defaults to complete incomplete data. 
  3. Feedback and fix your data collector to improve quality.

3. Create "swim lanes"

  1. Partition stored data into different folders, based on the application it came from and the privacy jurisdiction that applies.
  2. Below that high-level partition, create a structured folder per class of data.

How do these steps help?

Your data scientist will love this because...
...they spend less time on format conversion. For event correlation, identifiers are present in the data. Anomalies in the data are flagged.

Your privacy engineer will love this because...
...this approach controls what Personally Identifiable Information (PII) is stored. It makes it easy to cleanse data from deprecated applications to impose privacy and preserve the right of erasures. The data lake will have built-in privacy for data segregation.

Your data collection engineers will love this because...
...there is feedback and classification to improve capture quality. Everyone loves structure.

 

 


How to make this happen?

Below is a strategy to do this using the Amazon Web Services cloud. (The same can be done using Microsoft Azure or Google Cloud Platform.)


Collection and Ingestion

Data Collection and Data Ingestion can happen using various mechanisms and from an astronomically large number of sources.

It’s best to impose structure at the collection point and do format conversion at the source. Streaming is one way to pull this structured data. Amazon Kinesis is an option for pulling in large streams of data in real time.

In cases where bulk data transfer happens, structure is hard to impose. Rather than letting someone dump their raw data right into your lake, give them a separate “holding pond” that will serve as a staging area for incoming data. S3 or a similar raw file storage area is good for this. In the temporary data pond, various Big Data jobs (typically Big Data clustering technologies) can be used to convert the data dump to your standardized format. Doing this at scale can be a 24-by-7 job. Doing it at least daily is a good idea. Once the data has been ingested, the staging data can be archived or deleted, draining the pond to make room for the next data dump.

Structuring the data also facilitates removal of PII.


Cleansing and Classification

Data Cleansing and Classification is one of the most important stages. It needs to validate data coming in and write out to corresponding folders as per the application and data class in bulk. This stage needs to be massively scalable because it is actually reading each piece of data and validating each field. To plough through this large set of files or streams at scale, Apache Spark and its competitors are good choices.

For deprecated applications and data formats, having the data saved into individual folders will help you clean up the data. It’s also a good idea to have a centralized configuration utility where this validation can be picked up.


Essential to unlock AI

Keeping your data lake clean and structured can be a massive undertaking, but it is essential for harnessing the promise of AI. By following these simple principles, you can maintain a good, clean, structured data lake – for easy correlation by data scientists and secure, performant processes for data engineers and DevOps. The real dividend will be the insights you derive.



Gaurav Roy, a director of cloud software R&D for HP Inc., considers his work at HP to be “the adventure of a lifetime.” He is leading the transformation of the Personal Systems Software Division with innovations in cloud, web, and Artificial Intelligence. “HP is changing the game on its devices from the cloud - instilling intelligence and AI into them.” 

Gaurav has played key technical and executive leadership roles in large companies (Vodafone, Motorola, Samsung) and successful startups (Azingo, MobiliYa, and Nevis Networks). He is also proud of technical product achievements as a part of large teams: HP Device-as-a-Service, Tizen OS (which powers most Samsung televisions), Samsung Knox, and Vodafone Webbox. Gaurav is an avid board gamer and loves racquet sports like badminton and tennis. If you visit Houston, Texas, feel free hit him up for a tech discussion or a game. 



Images: Creative Common License:

Author : gaurav.roy

Very well written

Awesome article with detailed explanation.

Great coverage! Steps given could be used as a best practice for building and maintain clean and robust Data Lake