10 steps to data profiling for successful data discovery: Part I

10 steps to data profiling for successful data discovery - Part I
November 18, 2020 | Data Catalog

10 steps to data profiling for successful data discovery: Part I

Introduction

Data profiling is where to start when data quality is a priority. This is the step that ensures that the data you have access to is legitimate and has acceptable quality. Data profiling focuses on examining and analyzing data, followed by the creation of a useful summary of that data. Effective data profiling falls into three categories:

  • Structural discovery that validates data’s consistency and correct formatting
  • Content discovery that looks focuses on individual records to check for error
  • Relationship discovery to understand the relationship between parts of the data

Data discovery is meant to provide insight and trends of the data that is in the inventory. Before you get to profile your data, you need to take into consideration 10 steps to make your data discovery endeavor successful. Our platform at DQLabs does AI driven data profiling and accepts data from multiple sources in different formats. The steps are;

Step 1

Identify the data domains. Gather the domains of data that you want to profile and verify that they are all credible. It is important to have a clear understanding of the domains because it gives a picture of how data flows within the organization. This ensures that the amount of focus data is not overwhelming to the data analyst and too much time isn’t wasted looking at data that will end up not adding value to the analysis stage.

This process involves using the data semantics to discover its functional meaning. To achieve this, an analyst requires a domain profile that contains the main characteristics of the data. For instance, if the data belongs to an enterprise, the first step would be to identify which characteristic regarding the products is in the data. The next step is checking the specific field/characteristics to ensure they are standard; this can be achieved by rules parsing the data to understand whether it’s trustworthy. In cases, the data is in a spreadsheet of rows and columns, you create the profile by analyzing the individual columns. This can be done by executing the data discovery process by applying data rules and column name rules. Data name will filter the columns that meet the threshold defined by the rule. Column name rules will filter the column names meeting the defined rule’s logic.

Step 2

Get authorization and protect any sensitive data. Request for authorization on all required domains and state exactly what data will be needed from each domain. This will ensure that sensitive data that is not useful in data discovery remain safe as the process of data discovery continues. It is always important to understand that all not all available data in each domain will be used and the organization might be reluctant to give access to some sensitive data. In some cases, the organization can have access to its data but be prohibited from sharing it because of an agreement with a client. For instance, organizations working military or intelligence services might be limited from sharing specific information on previous and upcoming transactions.

After parsing the data with rules, the sensitive data is highlighted and prepared to be masked. Data discovery also involves taking action on the sensitive data to increase the overall health of the organization’s data. Data masking involves obscuring the original sensitive data by adding other content to make it unidentifiable. This ensures that going forward, the sensitive data remains hidden thereby enhancing the data’s privacy.

Step 3

Uncover potential internal sources. Understand the organization’s data is generation in terms of where it’s generated? how it’s generated? and how it is shared?. If they have online platforms, understand which data they generate and whether it mixes with data generated from their offices. This will help in organizing the data in a logical way to make the profiling process faster and more effective. This is a very crucial step as it allows the analysts to decide on how to structure their profiling process.

The discovered data should be categorized based on possible usage. For instance, the data can be categorized into quantitative and qualitative data. Qualitative data will require context to be added for successful profiling. Examples of qualitative data include; employee satisfaction from feedback, customers’ complaints, among others. Quantitative data on the hand are numeric and require no further action to be taken for successful profiling. Many analysts make the mistake of ignoring qualitative data and instead focus on the quantitative data that have numbers that are easy to analyze such as revenue, number of customers, and other easy to understand numeric data. This can lead to incomplete reports because qualitative provides context on major changes in the qualitative data. For instance, a major drop in qualitative data such as sales can be explained by qualitative analysis of customer’s ease in using a new online platform.

Step 4

Uncover potential external sources. Understand which external sources of data will be useful enough to provide potentially enriching data. This step includes vetting the reliability of the external sources and analyzing their relationship to the organization. External sources of data allow the analyst to understand the organization’s operations better so as not to make data profiling decisions in isolation of the industry’s standards. By using external sources an analyst gains an edge in understanding the internal data especially the outliers. Therefore, understanding these sources makes the profiling process faster as they’ll already know where to refer to.

External data will provide a good source of the comparator for the conclusions reached from the internal data. However, there is a quality risk associated with external sources because the organization may not have control of some external data sources. For instance, the industry’s performance data extracted from external sources require the extra step of the analyst vetting the source. The analyst should ensure they have a clear idea of the external data that they will need. External data sources such as the number of vendors and the number of active customers should be updated regularly to clearly match internal data sources. While uncovering potential external sources, the analyst also has to ensure that they narrow their focus to what directly impacts the organization and the analysis that they are aiming to undertake.

Step 5

Prioritize candidates of source data. After uncovering all the internal and external sources, and getting authorization to the sources of data, the next step is setting priorities on source data. Setting priorities will make the profiling process flow seamlessly and lead to more insight during the data discovery process. Failure to set priorities can lead to more time being consumed by data sets that eventually end up making little to no impact on the analysis results. Like every other activity within an organization, data profiling has to be optimized in order to minimize the time from the start of data analysis to the publishing of the final analysis.

By creating a list of source data with the priorities set, the analyst is now able to map the way forward. The priority set determines the time and resources allocated towards gathering the data. For instance, the high priority data would require thorough profiling to ensure that it meets the quality and content threshold that matches its position in the priorities list. This also allows the analyst and opportunity to optimize the source data discovery process in terms of cost and time. Like any other business activity, it is critical that the resources spent on data discovery match the value derived from the process in order to make economic sense.

Continue reading Part II – 10 steps to data profiling for successful data discovery.