Thursday, May 8, 2014

The Chicken or Egg Debate of Data

Instagram



What came first, Chicken or the Egg? As a vegetarian I would prefer to avoid answering this question. From an evolutionary perspective, the chicken became a chicken slowly. There probably was never a fine line in the sand that the “pre-chicken” crossed and declared itself the modern chicken or the egg to have arrived. So, it goes with data at most companies. How does a company’s most valued asset of the future, data, evolve and grow?

For most companies born prior to the dawn of the data/internet age, data is stuck in silos and is not nurtured to be of any meaningful use. However, traditional companies generate massive amounts of data exhaust that can be leveraged for actionable insights. Deciding what to do first, build the data warehouse or determine the desired outcome, is the chicken or the egg question many organizations currently face – and an opportunity for data scientists to help answer.

Defining Data Exhaust

A slow evolution seems to be most effective approach for traditional companies moving towards a data-based decision making model. To start, organizations should pick an area that has high potential for either top-line or bottom-line impact. Sales is a logical starting point for companies struggling with revenue growth and for those facing margin pressures, cost savings is a good area of focus. Either way, a tremendous amount of data exhaust is available to drive actionable insights for organizations. For example, interactions sales and service teams have with customers are valuable pieces of data, even if that data is not captured in a standardized format.

Data Wrangling

How do organizations begin to apply analytical tools to extract actionable insights from their data exhaust? First, data scientist must prep the data to check the accuracy, completeness, and timeliness of the information. Most likely, a number of cleansing logic must be applied to make the data ready for analysis. Often correcting the format of the exhaust is the most arduous task, especially when multiple systems are involved.

For example, in some date-based data sets, a month may be stored first followed by the day, but in other systems it is opposite. Systems without built-in edits allow users to put information into the system based on personal preferences which can cause record duplication. As a result, organizations with multiple users for a single contact may create duplicate entries based on spelling errors or formatting preferences. This can cause significant issues, for determining simple data points like, “how many customers does the company have?” Thus, matching and merging these seemingly disparate records is a must. In short, most data scientists will end up spending sixty to eighty percent of their time just prepping the data. The best of the best are often called data wranglers.

The Data Warehouse

The key to data wrangling is to codify all data cleansing rules into a system ensuring it becomes a repeatable exercise. This is the beginning of an organization’s data warehouse – and it can be as simple as a shared drive. The key is to create something that anyone can use; eliminating the need to start from scratch based on employee changes. This slow and continual built-up of a rules repository is what is converting a company’s exhaust into gold. This approach does not require huge outlays in building an expensive data warehouse.

Building a good one from scratch is a long journey, expensive and will test the patience of even the most ardent C-suite sponsors. Funding for data warehouse projects is often pulled because the timelines get stretched or the budget gets bloated. Worse, in an effort to meet timelines and pacify the sponsors, shortcuts are adopted in the cleansing logic and corrupted data is brought in.

While the iterative approach is the recommended path, it is equally important to ensure that the knowledge and data cleansing rules are not generated by one person. Without the institutionalization of rules, organizations run the risk of a single person becoming a bottleneck or in the worst case, leave the company with all the knowledge. Therefore, it is best to have a team responsible for the rules with an explicit mandate to ultimately import all the established rules into the data warehouse.

As an added benefit to using the iterative approach, with each new development not only is the data improving but the company is gaining new insights based on analytics that drive real value. So, when the time comes for the data science team to build the real data warehouse, the business case is self-evident and more cost effective because of the prior work. The data warehouse will be useful from day one because the organization has clean data and knows how to derive actionable insights. Thus, the evolution from egg to chicken is one that is planned, proven, and provides a greater return on investment for the company.


Like   Comment  Share  Weebly  Press  Blog  Disqus  Delicious Tweet  YouTube  Google+  Blogger


BY: SANDEEP SANCHETI


Instagram



No comments:

Post a Comment