Why you should NOT replace blanks with 0 in Power BI!

Just recently, I was watching Jeffrey Wang as a live stream guest with Reid Havens, and one of the dozen wonderful things that Jeffrey shared with the audience was the list of optimizations that the DAX engine performs when creating an optimal query plan for our measures.

And, the one that caught my attention was regarding so-called “Sparse measures”:

To make it simple, once you define the measure, Formula Engine in VertiPaq will add an implicit NonEmpty filter to the query, which should enable the optimizer to avoid full cross-join of dimension tables and scan only those rows where records for the combination of your dimension attributes really exist. For folks coming from the MDX world, the NonEmpty function may look familiar, but let’s see how it works in DAX.

The thing that most resonated with me was when Jeffrey advised against replacing BLANKs with zeroes (or whatever explicit values) in Power BI calculations. I’ve already written how you can handle BLANKs and replace them with zeroes, but in this article, I want to focus on the possible performance implications of this decision.

Setting the stage

Before we start, one important disclaimer: recommendation not to replace BLANK with 0 is just that – recommendation. If the business request is to display 0 instead of BLANK, it doesn’t necessarily mean that you should refuse to do it. In most scenarios, you will probably not even notice a performance decrease, but it will depend on multiple different factors…

Let’s start by writing our simple DAX measure:

Sales Amt 364 Products =
CALCULATE (
    [Sales Amt],
    FILTER ( ALL ( 'Product'[ProductKey] ), 'Product'[ProductKey] = 364 )
)

Using this measure, I want to calculate the total sales amount for the product with ProductKey = 364. And, if I put the value of this measure in the Card visual, and turn on Performance Analyzer to check the times for handling this query, I’m getting the following results:

DAX query took only 11ms to execute, and once I switch to DAX Studio, the xmSQL generated by the Formula Engine is quite simple:

And, if I take a look into the Query plan (physical), I can see that Storage Engine found only 1 existing combination of values to return our data:

Adding more ingredients…

However, let’s say that the business request is to analyze data for Product Key 364 on a daily level. Let’s go and add dates to our report:

This was again very fast! I will now check the metrics within the DAX Studio:

This time, the query was expanded to include Dates table, which affected the work Storage Engine needed to do, as instead of finding only 1 row, this time the number is different:

Of course, you will not notice any difference in performance between these two scenarios, as the difference is only a few milliseconds.

But, this is just the beginning, we are just warming up our DAX engine:)…In both of these cases, as you may see, we are seeing only “filled” values – that combination of rows where both of our requirements are satisfied – product key is 364 and only those dates where we had sales for this product – if you look thoroughly in the illustration above, dates are not contiguous and some are missing, such as January 12th, January 14th to January 21st and so on.

This is because Formula Engine was smart enough to eliminate the dates where product 364 had no sales using the NonEmpty filter, and that’s why the number of records is 58: we have 58 distinct dates where sales of product 364 were not blank:

Now, let’s say that business users also want to see those dates in-between, where product 364 hadn’t made any sales. So, the idea is to display 0$ amount for all those dates. As already described in the previous article, there are multiple different ways to replace the BLANKs with zeroes, and I will use COALESCE() function:

Sales Amt 364 Products with 0 = COALESCE([Sales Amt 364 Products],0)

Basically, what COALESCE function will do, is to check all the arguments provided (in my case there is only one argument), and replace the first BLANK value with the value you specified. Simply said, it will check if the value of the Sales Amt 364 Products is BLANK – if no, it will display the calculated value – otherwise, it will replace BLANK with 0.

Wait, what?! Why am I seeing all the products, when I filtered everything out, except product 364? Let alone that, my table now took more than 2 seconds to render! Let’s check what happened in the background.

Instead of generating one single query, now we have 3 of them. The first one is exactly the same as in the previous case (58 rows). However, the remaining queries target Product and Dates tables, pulling all the rows from both tables (Product table contains 2517 rows, while Dates table has 1826). Not just that, take a look at the query plan:

4.6 million records?! Why on Earth does it happen?! Let me do the math for you: 2.517 * 1.826 = 4.596.042…So, here we had a full cross-join between Product and Dates tables, forcing every single tuple (combination of date-product) to be checked! That happened because we forced the engine to return 0 for every single tuple that would otherwise return blank (and consequentially being excluded from scanning)!

This is a simplistic overview of what happened:

Believe it or not, there is an elegant solution to show blank values out-of-the-box (but, not with 0 instead of BLANK). You can just simply click on the Date field, and choose to Show items with no data:

This will display the blank cells too, but without performing a full cross-join between the Product and Dates tables:

We can now see all the cells (even blanks) and this query took half the time of the previous one! Let’s check the query plan generated by the Formula Engine:

Not all scenarios are catastrophic!

Truth to be said, we could’ve rewritten our measure to exclude some unwanted records, but it would still not be an optimal way for the engine to eliminate empty records.

Additionally, there are certain scenarios when even replacing BLANKs with zero will not cause a significant performance decrease.

Let’s examine the following situation: we are displaying data about the total sales amount for every single brand. And I will add my sales amount measure for product 364:

As you might expect, that was quite fast. But, what will happen when I add my measure that replaces BLANKs with 0, which caused havoc in the previous scenario:

Hm, looks like we didn’t have to pay any penalty in terms of performance. Let’s check the query plan for this DAX query:

Conclusion

As Jeffrey Wang suggested, you should stay away from replacing blanks with zeroes (or with any other explicit values), as this will significantly affect the query optimizer’s ability to eliminate unnecessary data scanning. However, if for any reason you need to substitute blank with some meaningful value, be careful when and how to do it.

As usual, it depends on many different aspects – for columns with low cardinality, or when you’re not displaying data from multiple different tables (like in our example, when we needed to combine data from Product and Dates tables), or visual types that don’t need to display a large number of distinct values (i.e. card visual) – you can get away without paying the performance price. On the other hand, if you use tables/matrices/bar charts that show a lot of distinct values, make sure to check the metrics and query plans before you deploy that report to a production environment.

Thanks for reading!

Last Updated on August 5, 2021 by Nikola

Spread the music:

2 Comments

Ilya Rzhevskiy

September 5, 2021at9:20 am


Wow, I didn’t know this. However in Data Science its always advisable to clean up tables, replacing Blanks with O’s. Because that could enable more complex algorithm calculations later on. So it depends on your usage of data.
1. Nikola
  
  September 6, 2021at4:05 pm
  
  
  Yes, it depends on the use case, but it’s good to be aware of the possible consequences.

Archives

Categories

Meta