10k @ 1s

1k @ 5s

10k @ 1s

So the 5s spike values are exactly 5% of total values (1k/20k). That make the 95%/5% edge something like 1s/5s. But the 5s values aren’t all 5s, some are 2, 3, or 4. So the edge is more like 1s/2s. Then (I should have probably made this explicit in the post), 95% here is actually “95% of values less than *or equal to* X” whereas technically a percentile should be just “less than X”. Using <= is inherited from mk-query-digest, but perhaps I should re-examine if that's correct.

wow – this is great, very interesting stuff :)

I’m now downloading your code to check out the algorithm – I’m really curious how it works.

Oh, to get a quick grasp of the result, I calculated the correlation between (REAL_95,OLD_95) and (REAL_95,NEW_95), and the results are 0,9998926355 and 0,9999572588 respectively – I think it’s safe to say both are accurate enough, and it seels the new one is even a bit more accurate than the old one (for this dataset).

Exciting :)

]]>