The data profiling allows a intuitive ways to clean, transform, and understand data in Power Query Editor, which helps you identifying the nuances of the data such as determining anomalies, examining and developing the underlying data structures, and querying data statistics such as row counts, value distributions, minimum and maximum values, averages, errors and so on.
Data profiling includes following options:
- Column quality
- Column distribution
- Column profile
Lets see how to use data profiling in Power BI step by step.
First load a sample data set in Power BI desktop, here we have used a dataset financials that comes by default within Power BI desktop (October 2020 and later version).
Once you load data in Power BI Desktop, lets click on Transform data tab in ribbon bar.
Once you click on Transform data, a Power Query Editor opens as shown below.
Next in Power Query Editor, go to View tab, there you will see the Column Quality, Column Distribution, and Column Profile options inside thee Data Preview Group.
Note that By default, Power Query will perform this data profiling over the first 1,000 rows of your data in dataset. To perform data profiling on entire dataset, you can check the lower left corner of Power query editor window to change how column profiling is performed.
Lets see understand about these option one by one.
Column quality shows you the percentages of data that is Valid shown in green, in Error shown in red, and Empty shown in dark grey.
When you enable the Column quality option, you will see a percentage of data in terms of Valid, Error and Empty are displayed directly underneath the name of the column shown below.
When you hover over any of the columns, you will get the information about the numerical distribution of the quality of values throughout the column. Also, selecting the ellipsis (…) opens some quick action buttons for operations on the values as shown below.
Column distribution shows you the distribution of the data within the column and the counts of distinct and unique values.
Distinct values are all values in a column, including duplicates and null values while unique values do not include duplicates or nulls. Therefore, distinct in this table tells you the total count of how many values are present, while unique tells you how many of those values are not duplicates or nulls.
When you enable Column distribution option you will see a small bar chart underneath the names of the columns which shows the the frequency and distribution of the values in each of the column, and data in these small chart is sorted in descending order from the value with the highest frequency.
When you hover over any of the columns, you will get the information about the overall data in the column in terms of distinct count and unique values. Also, selecting the ellipsis (…) opens some quick action buttons for operations on the values as shown below.
Column profile shows you a more in-depth look into the statistics within the column such as different values distinct and unique values, empty string, including the count of rows which is important when verifying whether the importing of your data was successful.
When you enable this option, you will see the information of column statistics and a chart that shows a value distribution underneath the data preview section.
In below screenshot you can see, the column statistic and value distribution for segment column. similarly, you can select other column to see the information.
Filter by value
You can also interact with the value distribution chart on the right side and select any of the bars by hovering over the parts of the chart as shown below.
now either you can click on ellipsis (…) or simply right click on bar, it gives you some filter options that you can perform as per your requirement.
You can also copy the data for column statistics and value distribution chart, In the upper-right corner of both the column statistics and value distribution sections, you can select the ellipsis (…) to display a Copy shortcut menu. Select it to copy the data displayed in either section to the clipboard, notepad or excel.
Group by value
You can also perform grouping based on value and text length, when you select the ellipsis (…) in the upper-right corner of the value distribution chart, in addition to Copy you can select Group by. This feature groups the values in your chart by a set of available options.
By default you can see grouping is done values, you can also change that to Group by Text length.
Lets change the grouping to Text Length, and you can see, the column distribution is done now based on text length.
Enabling column profiling for entire dataset
BY default, Power Query perform the data profiling over the first 1,000 rows of your data in dataset as shown below.
In case, if you want to perform data profiling on entire dataset, you can check the lower left corner of Power query editor window to change how column profiling is performed.
Lets click on that button, and you will see there it gives you one more option that is column profiling based on entire data set, just click on that.