Scott Hanselman

Stop, think, research, debug

July 14, '15 Comments [19] Posted in Musings
Sponsored By

I got this great letter from a listener/reader recently. They listened to a recent show on Debugging Windows and it helped them debug a problem at work, but not in a specific technical way. Instead, it changed how they thought about their approach to the topic.

By the way, I've had some amazing guests on lately. If you haven't subscribed, it's a good time to join us. Explore the archives and check our our diverse topics and voices and subscribe. Also, BTW there is a new This Developer's Life out so check that out also.

I've been doing Hanselminutes: Fresh Air for Developers for almost 500 episodes over darn-near 10 years. Getting emails like this is so meaningful, especially when I think about taking breaks or stopping. Sometimes a few shows will go by with no comments and I'll wonder if anyone listens, and then I hear from a Real Live Human who was helped by all this free content I put out and it keeps me going. So first, thanks to all of you for this, if you've ever emailed or donated to fight diabetes.

US Navy Cryptanalytic Bombe by brewbooks, used under CC

Here's what this particular listener said, with emphasis mine.

After listening to your podcast with Mario Hewardt earlier this week on Windows Debugging, I had some of the things you were talking about running through my head. Though I've always come away from your podcasts feeling enriched and excited to tackle new and interesting problems, this was the first time that it had a direct impact on my work so soon after listening.

I work at a big data company that does a lot of social network analysis. We use ElasticSearch in our stack, and we are consistently processing millions of documents using complicated, user generated queries. A release we put out late last week allowed for many, larger, even more complicated user queries, which in turn led to substantial slowdown of our product. Though the code only existed in our staging environment, we are on a deadline for release early this next week. As it became obvious that the application was spending a LOT of time in the code my team was responsible for, we were tasked with "fixing" it ASAP.

I took the first shift, and though my brain immediately started coming up with ways to improve our code, something about your podcast regarding "know the tools your tools are built on" was stuck in my head. Instead of jumping in and optimizing what I was already comfortable with, I spent an hour researching the internals of the ElasticSearch functionality we were relying on.

Not sure how familiar you are with ES, but it distinguishes between searches that simply return a set of documents that match a query, much the way that traditional SQL databases do, and searches that return how well documents match a query, for ranking purposes. As it turned out, we were inadvertently using one of the latter ones, meaning when we provided X giant queries in an OR block, even though it was an OR block, which we expected would short circuit as soon as it returned a TRUE condition, it processed all X queries to determine how well each document matched. My big O notation is a bit rusty, but suffice it to say, it was one of the bad ones.

Instead of a gigantic fire drill app optimization over a weekend, it turned out to be an hour of research followed by switching the word "bool" to the word "or". It's remarkable how the most efficient coding you can do is often stopping and thinking about the problem for awhile!

Anyway, thanks to both you and Mario for saving me and my team a bunch of time!

This was a great reminder to me as well. Research is hard. It's not as dynamic as interactive debugging but it can often save you many wasted hours. Truly successful debugging means doing whatever it takes to understand the problem domain and the code paths.

Do you have any tales of debugging where taking the time to really understand the problem domain saved you time? Or perhaps the opposite, where you just dove in and poked at some code until it worked? Don't be ashamed, I think we've all be on both sides.

Sound off in the comments!

Sponsor: Big thanks to Infragistics for sponsoring the feed this week. Responsive web design on any browser, any platform and any device with Infragistics jQuery/HTML5 Controls.  Get super-charged performance with the world’s fastest HTML5 Grid - Download for free now!

* Photo - US Navy Cryptanalytic Bombe by brewbooks, used under CC

About Scott

Scott Hanselman is a former professor, former Chief Architect in finance, now speaker, consultant, father, diabetic, and Microsoft employee. He is a failed stand-up comic, a cornrower, and a book author.

facebook twitter subscribe
About   Newsletter
Sponsored By
Hosting By
Dedicated Windows Server Hosting by SherWeb
Tuesday, 14 July 2015 09:41:13 UTC
The best story I heard came from an Oracle consultant. He was assigned to fix a clients database which was always slow when it was raining on Tuesdays. When the weather was fine, the db was too. When it rained on any other day, the performance was good. It was only slow, when it rained on Tuesday.
They rebooted their database servers every Monday night. When the database was slow on Tuesday, then they had to reboot them again to make everything running fine again.

After looking at tons of logs and analyzing execution plans it turned out that there was one employee who did not complain. The queries which this employee started every morning were very complicated and looked different from everybody elses. And this employee came to work by bicycle - except when it was raining. Then he came by car and was in the office a bit earlier.
An Oracle database optimizes its execution plans based on the first queries it sees after a reboot. When that employee came to the office by car, then his were the first queries, his queries ran ok and those of everybody else were slow. When he came by bicycle, then somebody else tackled the db first and it optimized for the more average queries.
Tuesday, 14 July 2015 10:41:28 UTC
My problem is that too many times an error or situation is a search-engine answer away. It's rare that I have a problem that lets me research to find the solution that isn't simply floating in the ether.

Sometimes trying to do research into the base product and learning about it, I will come across someone with the same issue I had -- solution.

Mind you, I'm not complaining about having the answer at my fingertips, but going back to a question you posed once upon a time -- am I a programmer -- or just a good Googler?

I'm surprised the error list that shows after a compile aren't just hotlinks to Google search.

Much thanks for This Developer's Life and Hanselminutes, you have one of the few "developers are human" podcasts available.

@Michael Rumpler - awesome debug ;-)
Tuesday, 14 July 2015 12:17:53 UTC
Do you have any tales of debugging where taking the time to really understand the problem domain saved you time?

When it comes to memory dumps you are forced to do this - you can't poke at code and do quick tests because you no longer have a live process (or even the environment the process was executing on). You have to learn the minutiae of any frameworks/runtimes you are working against.

Next time you have to debug something see if you can manage it without a live process. You'll multiply your knowledge in the process: even knowledge about your own code.
Tuesday, 14 July 2015 15:23:45 UTC
Writing, maintaining, and debugging multi-threaded code is also one of those cases where you absolutely cannot rely on just poking around until it works. Without understanding the internals of the threading mechanisms you're using, you're left in a situation where all kinds of Heisenbugs will pop up and you have no way of figuring out why or what you did to cause them.
Chris Hannon
Tuesday, 14 July 2015 19:09:06 UTC
I'm always in the weeds when it comes to debugging. Sometimes it takes both research and a bit of poking around. I worked on what seemed like a show stopper bug (at the time) for xUnit testing and .NET vNext. It took a couple of days to fully comprehend the issue and sometimes I found I did things the hard way unnecessarily.

Debugging - vNext, Mono, xUnit ... On Linux - Part 1

Debugging - vNext, Mono, xUnit ... On Linux - Part 2
Tuesday, 14 July 2015 20:34:30 UTC
Just a heads up on how your blog does not display correctly in IE11 Version 11.0.9600.17842.

The inserts which use the blockquote tag are losing the end of the line, this includes the section containing your listeners reply in the "Stop, think, research, debug" article.

Works fine in Firefox.
Tuesday, 14 July 2015 22:21:41 UTC
In 2009, when WPF was new... One of the largest healthcare companies on earth hired us to build a WPF app for customers of their critical line of business. We were only a few weeks from launch and concerns over memory consumption were getting intense. For weeks we couldn't figure out why our WPF application allocated more and more memory over time, without its user actually doing anything... We figured out that it must have something to do with the text box controls we were using because if you left your cursor blinking in a text box field, memory allocated to the process would increase. You could start the app, put your cursor in a text box, and just watch the perfmon counter creep up until an OutOfMemoryException was thrown. From then on, we called it "the blinking cursor memory leak".

We finally asked for help. We were put in touch with the person that led the WPF performance team in Redmond. We emailed him memory dumps. We went over all of his known issues on different hardware/OS combinations. Finally, one of our engineers decided to just try one of the solutions having to do with the render thread and the *first* Window object instantiated in the app (subsequent Window objects would not have this problem), even though we had ruled this out as a root cause, because it was only supposed to happen on WinXP-SP3, and we had recreated it on Vista... Turns out there was an error in the blog post. It *did* happen in Vista. In fact, the memory leak occurred no matter *what* was being rendered, even a blinking cursor.

One line of code to instantiate a new Window(), and do absolutely nothing with it, plugged the leak. We made graphs of memory allocated over time, when the app was sitting idle, both without the fix and with the fix. One of them showed a stair-step pattern. One of them showed a flat line. We emailed this to the CTO who had voiced concerns, the launch went forward, and now it is running successfully in hundreds of hospitals around the world.
Wednesday, 15 July 2015 12:28:10 UTC
Dont stop the show(s). We do listen. We just don't comment because we don't have anything to add :)
Wednesday, 15 July 2015 13:29:20 UTC
A mentor of mine always used to say "Challenge your assumptions. Is the damn thing plugged in?" Best piece of advice I've ever gotten.

How many times have we started a problem investigation too far down in the weeds ("I could make sure my service is running but maybe I'll just turn on kernel pool tagging first").

This bit of wisdom is what binds us to every investigative profession from auto mechanic to brain surgeon.
Brian Hartung
Wednesday, 15 July 2015 13:49:09 UTC
The podcast helps me out a lot.
Together with .net rocks, it's my main technical podcast. It strikes the balance between information and inspiration and gives me a window to the things besides the ones I use in my day to day job.

One of the reasons comments are so scarce is that they are consumed on times where commenting is just impractical (biking, running, workout, general multitasking).
boris callens
Wednesday, 15 July 2015 18:42:54 UTC
"Have you check your printer cables?" --IT Support Guy
Emmanuel Rosado
Thursday, 16 July 2015 04:28:42 UTC
QA raised a bug saying closed captions are not shown on the TV. Devs checked and found it working fine, so we put it on low priority saying not reproducible any more. QA comes back saying it is always reproducible. After a couple of mails and a live demonstration, we realized that QA does a full spectrum scan that installs all the programs, whereas the developers just scan one frequency to get enough programs in the database to test the captions. The captions work fine after scanning a single frequency, but not after a full scan. Interesting.

Eventually I narrowed it down to a frequency around 60% of the full frequency range. Captions will work fine if you stop scanning before that frequency; if you scan beyond that frequency, captions won't work. Curiouser and curiouser.

Further debugging showed that the caption_enabled_flag stored in a global static structure printed different values (true and false) before and after this frequency. Yeah, memory corruption.

Couple of hours of objdumps and valgrinds later, we found a static array in the scanning module that stored the progress information for each frequency during a full scan. This array was used as a uint32_t pointer instead of uint8_t pointer in one of the functions, causing overflow. At the said frequency, this overflow starts spilling into next elements in the memory layout. One of the things it did was to set our caption_enabled_flag to false.

We made static analyzer mandatory after this.
Thursday, 16 July 2015 07:44:27 UTC
I see a lack of thought, a lot.
When a problem arises, folks rushing to do *some*thing, rather than thinking a moment and doing the *right* thing.

I think it's because they don't want to admit they 'don't-just-know' the solution.

I 'just-know' very little (my brain's been full since the early 90s,) but I know I can find a solution with a wee bit of thought and research.

Thursday, 16 July 2015 16:08:02 UTC
No one specific story, but just how I tackle situations.

Very early on in my career I worked for an European company who had a software written in a scripting language and to compile the code you need a specialized editor, which would need a hardware dongle (as you would expect, vendor made money on the dongle). So the employer, never gave us the dongles, which meant we read the code and make changes and send them 5 thousands miles away to the HQ, to compile. What this tought me was

1. Write less code!
2. Spend time staring at screens and taking walkd before touching the keyboard
3. Run the code in my head.

I'm no gifted developer but whatever success I have found is because of these things I learnt early on.

So when Marketting come running complaining everything is slow on the website, And Ops + devs say, ramp up the hardware on SQL box. I can, after little poking around, point out not to encrypt table column, on which we are running LIKE queries (well, don't encrypt columns period) Or other times, find missing caching headers on expensive assets, which alone helps a lot.

Thanks for your podcasts and blog posts over the years. Appreciate it.
Thursday, 16 July 2015 18:27:22 UTC
Once upon a time software was about to be released again. Long running stress tests showed that the long run did terminate after 2h with an OutOfMemoryException. Either we needed to redefine what long meant or fix the memory leak. Or let the customers decide if 2h was long enough.

The interesting thing was that this happened only on specific Windows XP and Windows 7 machines with an OS hotfix which did some changes to GDI. Windows 8/8.1 did have this issue right from the start.

I finally found that it had something to do with GDI Bitmaps which were allocating 192kb of address space even for 48x48 600 byte Icons. If you have a few thousands of bitmaps around your memory soon becomes severly fragmented. This was related to a change in GDI where the Windows Imaging Component (WIC) was used to decode bitmaps after applying a specific hotfix.

Ok file mapping objects are the root cause so lets track with Windbg the aquire and release calls to find where not released Bitmaps are hiding. After setting the breakpoints in Windbg and adding some logging of the callstacks our application would become very slow and run into timeout errors because of ist new found slowness.

This failure caused me to learn new things like using a hooking library (EasyHook) to intercept some OS calls and log the method parameters and finally write the data out as ETW event. The best thing about ETW is that you can tell the OS to collect the call stacks for you. That approach proved to be ultra fast and resulted in very small ETL files even after running the application for hours. Then it was easy to parse the ETL file and filter out all unbalanced mapping calls which did relate to call stacks in our code.
This enabled us to predict which code sections would benefit most of optimizations (e.g. bitmap reuse, static Bitmaps, ...) so we could do isolated low impact changes even late in the release cycle.
My colleagues did not believe me that I could predict with these exact measurements the outcome of future changes. They were really stunned to find that my predictions turned out to be correct with an error <3%.

That was one of my best experiences that scientific troubleshooting can make a huge difference.

Did I forget something? Yes! This was my first bug report that forced MS to release Windows hot fixes ranging from XP-Windows 8.1. It turned out that the problematic hotfix was a backport from Windows 8 where the file mapping objects were allocated for no reason.
Monday, 20 July 2015 16:28:55 UTC
I did this exact thing a couple weeks ago. we had a bug that I wasn't sure how to fix. I knew that the bug had something to do with javascript promises.

I found this post talking about promises and I bookmarked it. I went back later and read the post (slowly so that I understood it). After that, I was able to make one change the the code and run, and it worked the first time.

Making the code run correctly the first time felt good. I'm still feeling good.
Tuesday, 21 July 2015 17:17:24 UTC
I once worked for a company where we had data processing services that handled millions of database records and thousands of new records were being added daily. Our apps needed to be able to search that database in real time in a variety of ways at the whim of the end user. The obvious problem was that as the database grew, the queries would slow to the point of timeouts. Over a period of months, we tried several things to manage the symptoms including but not limited to capping the results and adding tons of RAM to SQL Server. We even started deleting old data to keep the query times just short enough to prevent timeouts.

We didn't have a DBA, so I finally got fed up one day and decided to figure out the root cause. I bought a book SQL Index design and spent a couple of evenings having my eyes opened to how SQL makes things fast. I also learned how to measure SQL performance using SQL Management Studio and get suggestions from the built-in query/index tuner. (I'd be willing to bet not many programmers even know that feature exists.) It turns out the few indexes we had were terrible or incorrect, and we were also missing a few. After updating the indexes and creating nightly jobs to rebuild the indexes and stats (if needed), the queries went from 30 seconds to less than a second. All because I researched the right way to index tables.

Dan Bailiff
Thursday, 23 July 2015 07:24:28 UTC
Your listener was on the right track differentiating between filters and queries. Filters can operate over bitsets and most of them can and will be cached automatically. Queries deal with scoring and are therefor not cached, granted though that most queries are still plenty fast because they operate on an inverted index that might be in the filesystem cache anyway.

However the `or` filter your listener switched to is a special wrapper filter for when you are using filters that can not utilize (cached) bitsets, these are the script, numeric_range and geo related filters. These need to operate on all the documents anyway and this or filter wrapper allows you shortcircuit these more efficiently then the bool query. However other types of filters typically need to be placed inside a bool filter.

This blog post explains this in-deptly

With Elasticsearch 2.0 the story around filters and queries is simplified greatly. We no longer differentiate between filters an query constructs in the DSL. Both of them have been merged to queries. In addition to this the or/and/not filters have been deprecated and you are safe to use the bool query in all contexts.

This blog post outlines all the improvements to the query execution in 2.0

its good to reiterate though that an elasticsearch search has two phases

1. query, does scoring, aggregations etcetera.
2. post_filter, further filters down the documents after the query phase. This is handy if you want your aggregations to be unaffected by e.g filters a user chose to filter down the documents.

However you can still use filters inside the query phase, be sure to wrap them in a filtered query if you are using Elasticsearch 1.x. Or in the filter clause of the bool query if you are using Elasticsearch 2.0. This is often a better choice then executing the filters inside the post_filter phase depending on your use-case.
Wednesday, 02 September 2015 21:46:05 UTC
Why, just today I ran into a problem where your advice helped.

I've been working on a large-scale refactoring of our crusty old codebase (it's in Struts 1, if that gives you an idea), and after getting all code and JSPs deployed, there was still something with an invalid reference that wasn't switched. I was able to determine that it was a combination of an in-house CMS which hadn't rebuilt some of our pages from the database components they're made of, and the fact that Tomcat wasn't recompiling JSPs automatically into our Catalina work directory. Deleting the work files and re-running our CMS page generator made everything hunky-dory.

I remembered your advice while I was doing this, and marveled at the fact that a few months ago, I wouldn't have thought to check these things and would have gotten caught up for at least a day.
Comments are closed.

Disclaimer: The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way.