Scott Hanselman

The code worked differently when the moon was full

September 21, 2021 Comment on this post [7] Posted in Bugs | Musings
Sponsored By

Full Moon

I love a good bug, especially ones that are initially hard to explain but then turn into forehead slapping moments - of course!

There's a bug over on Github called Hysteresis effect on threadpool hill-climbing that is a super interesting read.

Hill climbing is an algorithmic technique when you have a problem (a hill) and then you improve and improve (climb) until you have reached some maximum acceptable solution (reach the top).

Sébastien, the OP of the bug, says there's a "Hysteresis effect" on the threadpool. "Hysteresis is the dependence of the state of a system on its history." Something weird is happening now because something happened before...but what was it?

Sawtooth up and down graphs aren't THAT interesting...but look at the x-axis. This isn't showing minute by minute or even millisecond by millisecond ups and downs like you may have seen before. This x-axis uses months as its unit of measure. Read that again and drink it in.

A sawtooth graph going up and down month to month

Things are cool in February until they are bad for weeks and then cool in March and it moves along a cycle. Not strictly the cycle of the moon but close.

He's seeing the number of cores being used changing from month to month when using PortableThreadPool

We have noticed a periodic pattern on the threadpool hill-climbing logic, which uses either n-cores or n-cores + 20 with an hysteresis effect that switches every 3-4 weeks.

Did you know (I know because I'm old) that Windows 95, for a time, was unable to run longer than 49.7 days of runtime? If you ran it that long it would eventually crash! That's because there's 86M ms in a day, i.e. 1000 * 60 * 60 * 24 = 86,400,000 and 32 bits is 4,294,967,296 so 4,294,967,296 / 86,400,000 = 49.7102696 days!

Kevin in the Github issues notes this as well:

The whole period of square wave sounds an awful lot like it is around 49.7 days. That is how long it takes GetTickCount() to wrap around. On POSIX platforms the platform abstraction layer implements this, and the value returned for that is based not on uptime but on wall clock time, which matches with all machines changing on the same day.

This 49.7 days number is well known as it's how long it take before GetTickCount() overflows/wraps. Kevin goes on to actually give the changeover dates which correspond to the graph!

  • Thu Jan 14 2021
  • Sun Feb 07 2021
  • Thu Mar 04 2021
  • Mon Mar 29 2021
  • Fri Apr 23 2021

He then finds the code in PortableThreadPool.cs that explains the issue:

private bool ShouldAdjustMaxWorkersActive(int currentTimeMs) 
{
// We need to subtract by prior time because Environment.TickCount can wrap around, making a comparison of absolute times unreliable.
int priorTime = Volatile.Read(ref _separated.priorCompletedWorkRequestsTime);
int requiredInterval = _separated.nextCompletedWorkRequestsTime - priorTime;
int elapsedInterval = currentTimeMs - priorTime;
if (elapsedInterval >= requiredInterval)
{
...

He says, and this is all Kevin:

currentTimeMs is Environment.TickCount, which in this case happens to be negative.

The if clause controls if the hill climbing is even run.

_separated.priorCompletedWorkRequestsTime and _separated.nextCompletedWorkRequestsTime start out as zero on process start, and only get updated if the hill climbing code is run.

Therefore, requiredInterval = 0 - 0 and elapsedInterval = negativeNumber - 0. This causes the if statement to become
if (negativeNumber - 0 >= 0 - 0) which returns false, so the hill climbing code is not run, and therefore the variables never get updated, and remain zero. The native version of the thread pool code does all math with unsigned numbers which would avoid such a bug, and it's equivalent part is not even quite the same math in the first place.

The easy fix here is probably to use unsigned arithmetic, but alternatively having the two fields get initialized to Environment.TickCount probably also work

Back to me. Fabulous. The fix is to cast the results to unsigned ints via (uint).

Before:

int requiredInterval = _separated.nextCompletedWorkRequestsTime - priorTime;
int elapsedInterval = currentTimeMs - priorTime;

After:

uint requiredInterval = (uint)(_separated.nextCompletedWorkRequestsTime - priorTime);
uint elapsedInterval = (uint)(currentTimeMs - priorTime);

What an interesting and insidious bug! Bugs based on a time calculations can often show themselves later when view through a longer lens and scope of time...sometimes WAY longer than you'd expect.


Sponsor: YugabyteDB is a distributed SQL database designed for resilience and scale. It is 100% open source, PostgreSQL-compatible, enterprise-grade, and runs across all clouds. Sign up and get a free t-shirt.

About Scott

Scott Hanselman is a former professor, former Chief Architect in finance, now speaker, consultant, father, diabetic, and Microsoft employee. He is a failed stand-up comic, a cornrower, and a book author.

facebook twitter subscribe
About   Newsletter
Hosting By
Hosted in an Azure App Service
September 28, 2021 23:45
Good read, it reminded me of Claude Shannon's Information Theory whereby your proximity to something dictates what sense it makes to you. Like it might take a few years to see the seasons are a pattern, but you need to be incredibly fast to see subatomic particles forming.
September 29, 2021 1:22
Really a nice article. Thanks!
September 29, 2021 5:46
Awesome write up! But note the bug author is Sébastien.
Ben
September 29, 2021 17:51
Great post.
September 29, 2021 18:29
Great bug. I'm currently involved in an online art project that is intended to run for 500 years, we just encountered a bug that we could only find after the application was running for a year. Well that or with a good unit test, which we did not have but will soon!

Time is tricky.
September 29, 2021 19:34
@Conner
Time is tricky.

That's an understatement.

After reading some of Jon Skeet's writings about Nodatime, I've become very aware of anything I do with dates and times. This post only goes to reinforce that.
September 30, 2021 15:12
(typo) cast, not case. The fix is to cast the results.

Comments are closed.

Disclaimer: The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way.