Recently, I’ve got a request from business user within my company, to create a report which shows number of consecutive days where money turnover was greater than X value. When I started to prepare the data from our SQL Server database, there were few tricky parts to solve.

Sample Data

In order to demonstrate how I solved this, I will use Stack Overflow sample database. Let’s translate business request into following: find number of consecutive days where there were more than 50 posts and every of these posts had more than 50.000 views.

Table dbo.Posts contains around 17.2 million rows. First, let’s check how many posts with more than 50.000 views were per every single day:

SELECT convert(date,CreationDate) as CreationDate
	,count(Id) as TotalPosts
FROM Posts 
WHERE CreationDate >= '20080731'
	AND CreationDate < '20140101'
        AND ViewCount > 50000
GROUP BY convert(date,CreationDate)
ORDER BY CreationDate

When I run this query, I’m getting following results:

In order to better cope with running queries, I’ve created non-clustered index on CreationDate column, with ViewCount column included. Now, when I run the query to check dates where there were more than 50 posts with 50.000+ views, I am getting following results:

;WITH CTE_POSTS as(
select convert(date,CreationDate) CreationDate
	,count(Id) totalPosts
from Posts 
where CreationDate >= '20080731'
	and CreationDate < '20140101'
	and ViewCount > 50000
group by convert(date,CreationDate)
)

SELECT * 
FROM CTE_POSTS
WHERE totalPosts > 50
order by CreationDate

As you can notice, between September 15th 2008 and September 19th, all dates are included. Then, 20th and 21st are missing…This problem is well known as “Gaps and Islands” problem. Here, first bucket of dates (15th – 19th) represents “island”, while 20th and 21st represent “gap”.

Welcome to the island…

Let’s first identify our “islands” (buckets of consecutive dates which satisfies the criteria: in our case, dates with more than 50 posts which have 50.000+ views):

; WITH CTE_POSTS as(
SELECT CONVERT(DATE,CreationDate) CreationDate
	,COUNT(Id) totalPosts
from Posts 
WHERE CreationDate >= '20080731'
	AND CreationDate < '20140101'
	AND ViewCount > 50000
GROUP BY convert(date,CreationDate)
HAVING COUNT(Id) > 50
)


,starting as (
SELECT CreationDate
	,ROW_NUMBER() OVER(ORDER BY CreationDate) rnm
FROM CTE_POSTS as CTE1
WHERE NOT EXISTS (
	SELECT * 
	FROM CTE_POSTS as CTE2
	WHERE DATEDIFF(dd,CTE2.CreationDate,CTE1.CreationDate) = 1)
	)

,ending as (
SELECT CreationDate
	,ROW_NUMBER() OVER(ORDER BY CreationDate) rnm
FROM CTE_POSTS as CTE1
WHERE NOT EXISTS (
	SELECT * 
	FROM CTE_POSTS as CTE2
	WHERE DATEDIFF(dd,CTE2.CreationDate,CTE1.CreationDate) = -1)
	)

SELECT s.CreationDate as startingDate
	,e.CreationDate as endingDate 
        ,DATEDIFF(dd,s.CreationDate,e.CreationDate) as ConsecutiveDays
FROM starting s
INNER JOIN ending e on s.rnm = e.rnm

This way, we identified our “islands”, as you can see in the following screenshot:

Take advantage of LEAD/LAG

If you run version of SQL Server which is 2012 or later, you can take advantage of time offset functions LEAD and LAG. By default, LEAD will return next value within the sequence, but the offset can be also adjusted with optional parameter, while latter will do exactly the opposite – return previous value in sequence (by default).

Let’s see how LEAD works in our scenario:

; WITH CTE_POSTS as(
select CONVERT(DATE,CreationDate) CreationDate
	,COUNT(Id) totalPosts
FROM Posts 
WHERE CreationDate >= '20080731'
	AND CreationDate < '20140101'
	AND ViewCount > 50000
GROUP BY CONVERT(DATE,CreationDate)
HAVING COUNT(Id) > 50
)

,CTE_LEAD AS
(
SELECT CreationDate
	, DATEADD(day, -1 * DENSE_RANK() OVER(ORDER BY CreationDate), CreationDate) AS DateGroup
FROM CTE_POSTS
)

SELECT MIN(CreationDate) AS start_range
	,MAX(CreationDate) AS end_range
	,DATEDIFF(dd,MIN(CreationDate),MAX(CreationDate)) AS consecutiveDays
FROM CTE_LEAD
GROUP BY DateGroup;

This solution is much more intuitive, since you avoid complexities with self -joins, but what is more important, this code runs much more faster.

Here, I started with ROW_NUMBER() function instead of DENSE_RANK(). But, then I found nice explanation from Itzik Ben-Gan in his book “Microsoft SQL Server 2012 High Performance T-SQL Using Window Functions”, where he clarifies that DENSE_RANK() guarantees that code will work properly even if we had duplicate values in our sequence, while ROW_NUMBER() won’t.

Results are, as expected, the same:

Finding Gaps

In some scenarios, you want to check ranges where your business request wasn’t met. For example, you would like to check dates when your sales were below some threshold, or when your system didn’t work properly.

In my example, I will show dates when there were not more than 50 posts with 50.000+ views.

Traditional T-SQL solution

Opposite from finding “islands”, here we need to check consecutive values and detect where are the “gaps” in between.

; WITH CTE_POSTS as(
select CONVERT(DATE,CreationDate) CreationDate
	,COUNT(Id) totalPosts
FROM Posts 
WHERE CreationDate >= '20080731'
	AND CreationDate < '20140101'
	AND ViewCount > 50000
GROUP BY CONVERT(DATE,CreationDate)
HAVING COUNT(Id) > 50
)

SELECT DATEADD(dd,1,CreationDate) AS startDate
	,DATEADD(dd,-1,(SELECT MIN(t2.CreationDate)
	 FROM CTE_POSTS t2
	 WHERE t2.CreationDate > t1.CreationDate)) AS endDate
FROM CTE_POSTS t1
WHERE NOT EXISTS (
        SELECT *
	FROM CTE_POSTS t2
	WHERE t2.CreationDate = DATEADD(dd,1,t1.CreationDate)
	       AND CreationDate < (SELECT MAX(CreationDate) from CTE_POSTS))

Get sophisticated with Window function

As for “islands” problem, similar logic can be implemented to find “gaps”. So, as in previous example, we will use LEAD function to achieve our goal:

; WITH CTE_POSTS as(
select CONVERT(DATE,CreationDate) CreationDate
	,COUNT(Id) totalPosts
FROM Posts 
WHERE CreationDate >= '20080731'
	AND CreationDate < '20140101'
	AND ViewCount > 50000
GROUP BY CONVERT(DATE,CreationDate)
HAVING COUNT(Id) > 50
)
,CTE_LEAD AS
(
SELECT CreationDate AS startDate
	, LEAD(CreationDate) OVER(ORDER BY CreationDate) AS endDate
FROM CTE_POSTS
)
SELECT DATEADD(dd, 1, startDate) AS startDate
	,DATEADD(dd, -1, endDate) endDate
FROM CTE_LEAD
WHERE DATEDIFF(dd, startDate, endDate) > 1;

Again, with LEAD, the code is much more intuitive and human readable. Here, we are just checking range between consecutive dates and if it’s greater than 1, that means that we have a gap in our sequence. When that criteria is met, we are just adding one day to last “island” date in the sequence to get first “gap” date, while subtracting one from the first “island” date in next sequence to get the last date of the “gap”.

Last Updated on May 19, 2020 by Nikola

Spread the music: