My today’s posting is not only technical but a reminder to always think about what you provide when being tasked with something. Do not only technically solve something, make sure that it is logically viable.
Imagine your superior comes to you and tasks you with the following.
“We need to do a cohort retention analysis of our sales staff. Please provide my with our yearly cohorts over the last full five years. I need to see how the retention develops in buckets of say 180 days – ideally, make it dynamic for me”.
“Sure”, you think. “Not a problem“. And you get to work.
You define your cohorts by the year of the staff’s starting date.
You create the bins by dividing the sales staff’s experience (i.e. the days with the company) by a parameter you created (after all, your boss wants it dynamic, right?)
//Experience bins
int([Experience] / [p.bin_size(days)]) * [p.bin_size(days)]
Next, for the retention you do a quick table calculation and set it to calculate table down.
//Retention - Basic calc
ZN(COUNTD([Id])) / LOOKUP(ZN(COUNTD([Id])), FIRST())
Now all there is left to do is to put your cohorts on columns, the bins on rows and your retention rate on label and color and …

You are done. Or… aren’t you?
What you just provided to your superior is an inaccurate assessment of the company’s retention rates.
Even though the numbers look sensible on from the outset (declining over time) and maybe a bit worriesome we failed to consider that the cohorts are pretending that everyone that joined during a year is basically the same.
However, imagine someone joined January 1st 2023, another colleague in April and the last one end of December.
It is clear that the one who joined in December has no chance to even have made it to even the 180 days bin as per the day of this writing (February 12, ’24).
Neither could the one from April have made it to the 360 days bin though at least there was a chance for him to reach 180 days.
Only one out of three even had the chance to make it through all bins. Something, the previous analysis does not consider. It treats any staff that is not available in a certain bin as churn / no longer retained which is incorrect.
Adjust for the time difference within your cohorts
To account for that we first must define the maximum bin any salesman could have achieved.
//achievable max bin per salesman
{fixed [Id]:
INT(MAX(DATEDIFF('day',[Login], [p.fake_today]))
/
[p.bin_size(days)]) * [p.bin_size(days)]
}
For every ID, i.e. salesman, we calculate the maximum of the difference, measured in days, between the person’s login date and a parameter defined date (p.fake_today, you could also use today() but since my example fake data is static, i faked today) and then attribute that to the corresponding bin. We basically calculate the bin that we would see in the data, if the person would not have churned (if they haven’t, it is just the max bin).
Here is what that would give us at some example IDs:

Next, we will create three more calculations:
dist cnt id | TC
window_sum(COUNTD([Id]))
//dist cnt id | excl | TC
window_sum(SUM({Exclude [Experience Bins]: countd([Id])}))
//fair retention pct | TC
[dist cnt id | TC] / [dist cnt id | excl | TC]
The first calculation will just give us the number of salesmen that actually made it into a specific bin.
The second one gives us the number of salesmen that COULD have made it, i.e. it stipulates our fair comparison base.
The final calculation then just divides the first by the latter returning to us the fair comparison percentage.
But how does this work? In the following screen, I have put the just mentioned first calculations in a table, plus – for clarification purposes – the inner exclude part of the second calculation as a stand-alone calculation and the pure count distinct of the ids.

How to read this?
The first column is our exclude calculation that disregards any granularity imposed by the experience bins. It tells us the number of salesmen within each achievable max bin. So, six people could only have made the 0 (to 89..) days bin. 16 could have made the next bin and so on.
The second column gives us the number of ids that have made it into every bin. If you sum every bin we get the number of agents that factually have made that bin.
Consider the zero bin for example: we have 6 in the first row, 16 in the second. then 5 in the 180 -> 0 row, 6 in the 270 -> 0 and 15 in the 360 -> 0 row. All in 48 agents.
For the 180 days bin we see 3 + 1 + 11 = 15.
These numbers can be found again in the third column which lists 48 for the 0 bin and 15 for the 180 days bin etc.
Note that this is a table calculation and we have set it to calculate along the achievable max bin. Therefore, it will be partitioned by (i.e. creating sub tables for) the combination of cohort (which is 2023 every time) and the experience bin.

Our [dist cnt id excl tc] calculation is calculating the same way but since it excludes the experience bins, it factually just calculates within the 2023 cohort the window sum along the achievable max bin per salesman. Thus, as we can see for the 90 days bin we get 42 which is the result of 16 + 5 + 6 + 15.
Put it to action
The last thing we need to do now is to turn this into our retention table.
Obviously we do not want to have the achieveable max bin on the rows or columns but since we need it, it needs to be put onto the details.
Built this:

The [last] calculation on the filters needs to be added, set to compute along [achieveable max bin per salesman] and then set to 0 since otherwise we would show multiple marks, given that the put the [achieveable max bin per salesman] onto the details.
Set all other calculations that carry ” | TC ” in their name to also compute correspondingly.
Beautify your tooltip a bit and there you have it:
Let’s put the results side by side to clearly see the effects:

Quite a difference towards the longer ends, isn’t it?
And with that, we are done.
I hope you found that useful.
As always appreciate your feedback.
Until next time.
Steffen

Leave a comment