Stop, think, research, debug

July 13, 2015 Comment on this post [19] Posted in Musings

About Scott

Scott Hanselman is a former professor, former Chief Architect in finance, now speaker, consultant, father, diabetic, and Microsoft employee. He is a failed stand-up comic, a cornrower, and a book author.

About Newsletter

Hosting By

Hosted on Linux using .NET in an Azure App Service

Comment on this post [19]

Share on BlueSky or use the Permalink and post anywhere!

July 14, 2015 1:41

The best story I heard came from an Oracle consultant. He was assigned to fix a clients database which was always slow when it was raining on Tuesdays. When the weather was fine, the db was too. When it rained on any other day, the performance was good. It was only slow, when it rained on Tuesday.
They rebooted their database servers every Monday night. When the database was slow on Tuesday, then they had to reboot them again to make everything running fine again.

After looking at tons of logs and analyzing execution plans it turned out that there was one employee who did not complain. The queries which this employee started every morning were very complicated and looked different from everybody elses. And this employee came to work by bicycle - except when it was raining. Then he came by car and was in the office a bit earlier.
An Oracle database optimizes its execution plans based on the first queries it sees after a reboot. When that employee came to the office by car, then his were the first queries, his queries ran ok and those of everybody else were slow. When he came by bicycle, then somebody else tackled the db first and it optimized for the more average queries.

Michael Rumpler

July 14, 2015 2:41

My problem is that too many times an error or situation is a search-engine answer away. It's rare that I have a problem that lets me research to find the solution that isn't simply floating in the ether.

Sometimes trying to do research into the base product and learning about it, I will come across someone with the same issue I had -- solution.

Mind you, I'm not complaining about having the answer at my fingertips, but going back to a question you posed once upon a time -- am I a programmer -- or just a good Googler?

I'm surprised the error list that shows after a compile aren't just hotlinks to Google search.

Much thanks for This Developer's Life and Hanselminutes, you have one of the few "developers are human" podcasts available.

@Michael Rumpler - awesome debug ;-)

Chris

July 14, 2015 4:17

Do you have any tales of debugging where taking the time to really understand the problem domain saved you time?

When it comes to memory dumps you are forced to do this - you can't poke at code and do quick tests because you no longer have a live process (or even the environment the process was executing on). You have to learn the minutiae of any frameworks/runtimes you are working against.

Next time you have to debug something see if you can manage it without a live process. You'll multiply your knowledge in the process: even knowledge about your own code.

Jonathan Dickinson

July 14, 2015 7:23

Writing, maintaining, and debugging multi-threaded code is also one of those cases where you absolutely cannot rely on just poking around until it works. Without understanding the internals of the threading mechanisms you're using, you're left in a situation where all kinds of Heisenbugs will pop up and you have no way of figuring out why or what you did to cause them.

Chris Hannon

July 14, 2015 11:09

I'm always in the weeds when it comes to debugging. Sometimes it takes both research and a bit of poking around. I worked on what seemed like a show stopper bug (at the time) for xUnit testing and .NET vNext. It took a couple of days to fully comprehend the issue and sometimes I found I did things the hard way unnecessarily.

Debugging - vNext, Mono, xUnit ... On Linux - Part 1

Debugging - vNext, Mono, xUnit ... On Linux - Part 2

James

July 14, 2015 12:34

Just a heads up on how your blog does not display correctly in IE11 Version 11.0.9600.17842.

The inserts which use the blockquote tag are losing the end of the line, this includes the section containing your listeners reply in the "Stop, think, research, debug" article.

Works fine in Firefox.

Colin

July 14, 2015 14:21

In 2009, when WPF was new... One of the largest healthcare companies on earth hired us to build a WPF app for customers of their critical line of business. We were only a few weeks from launch and concerns over memory consumption were getting intense. For weeks we couldn't figure out why our WPF application allocated more and more memory over time, without its user actually doing anything... We figured out that it must have something to do with the text box controls we were using because if you left your cursor blinking in a text box field, memory allocated to the process would increase. You could start the app, put your cursor in a text box, and just watch the perfmon counter creep up until an OutOfMemoryException was thrown. From then on, we called it "the blinking cursor memory leak".

We finally asked for help. We were put in touch with the person that led the WPF performance team in Redmond. We emailed him memory dumps. We went over all of his known issues on different hardware/OS combinations. Finally, one of our engineers decided to just try one of the solutions having to do with the render thread and the *first* Window object instantiated in the app (subsequent Window objects would not have this problem), even though we had ruled this out as a root cause, because it was only supposed to happen on WinXP-SP3, and we had recreated it on Vista... Turns out there was an error in the blog post. It *did* happen in Vista. In fact, the memory leak occurred no matter *what* was being rendered, even a blinking cursor.

One line of code to instantiate a new Window(), and do absolutely nothing with it, plugged the leak. We made graphs of memory allocated over time, when the app was sitting idle, both without the fix and with the fix. One of them showed a stair-step pattern. One of them showed a flat line. We emailed this to the CTO who had voiced concerns, the launch went forward, and now it is running successfully in hundreds of hospitals around the world.

Mike

July 15, 2015 4:28

Dont stop the show(s). We do listen. We just don't comment because we don't have anything to add :)

Kristijan

July 15, 2015 5:29

A mentor of mine always used to say "Challenge your assumptions. Is the damn thing plugged in?" Best piece of advice I've ever gotten.

How many times have we started a problem investigation too far down in the weeds ("I could make sure my service is running but maybe I'll just turn on kernel pool tagging first").

This bit of wisdom is what binds us to every investigative profession from auto mechanic to brain surgeon.

Brian Hartung

July 15, 2015 5:49

The podcast helps me out a lot.
Together with .net rocks, it's my main technical podcast. It strikes the balance between information and inspiration and gives me a window to the things besides the ones I use in my day to day job.

One of the reasons comments are so scarce is that they are consumed on times where commenting is just impractical (biking, running, workout, general multitasking).

boris callens

July 15, 2015 10:42

"Have you check your printer cables?" --IT Support Guy

Emmanuel Rosado

July 15, 2015 20:28

QA raised a bug saying closed captions are not shown on the TV. Devs checked and found it working fine, so we put it on low priority saying not reproducible any more. QA comes back saying it is always reproducible. After a couple of mails and a live demonstration, we realized that QA does a full spectrum scan that installs all the programs, whereas the developers just scan one frequency to get enough programs in the database to test the captions. The captions work fine after scanning a single frequency, but not after a full scan. Interesting.

Eventually I narrowed it down to a frequency around 60% of the full frequency range. Captions will work fine if you stop scanning before that frequency; if you scan beyond that frequency, captions won't work. Curiouser and curiouser.

Further debugging showed that the caption_enabled_flag stored in a global static structure printed different values (true and false) before and after this frequency. Yeah, memory corruption.

Couple of hours of objdumps and valgrinds later, we found a static array in the scanning module that stored the progress information for each frequency during a full scan. This array was used as a uint32_t pointer instead of uint8_t pointer in one of the functions, causing overflow. At the said frequency, this overflow starts spilling into next elements in the memory layout. One of the things it did was to set our caption_enabled_flag to false.

We made static analyzer mandatory after this.

Amarghosh

July 15, 2015 23:44

I see a lack of thought, a lot.
When a problem arises, folks rushing to do *some*thing, rather than thinking a moment and doing the *right* thing.

I think it's because they don't want to admit they 'don't-just-know' the solution.

I 'just-know' very little (my brain's been full since the early 90s,) but I know I can find a solution with a wee bit of thought and research.

rowland

July 16, 2015 8:08

No one specific story, but just how I tackle situations.

Very early on in my career I worked for an European company who had a software written in a scripting language and to compile the code you need a specialized editor, which would need a hardware dongle (as you would expect, vendor made money on the dongle). So the employer, never gave us the dongles, which meant we read the code and make changes and send them 5 thousands miles away to the HQ, to compile. What this tought me was

1. Write less code!
2. Spend time staring at screens and taking walkd before touching the keyboard
3. Run the code in my head.

I'm no gifted developer but whatever success I have found is because of these things I learnt early on.

So when Marketting come running complaining everything is slow on the website, And Ops + devs say, ramp up the hardware on SQL box. I can, after little poking around, point out not to encrypt table column, on which we are running LIKE queries (well, don't encrypt columns period) Or other times, find missing caching headers on expensive assets, which alone helps a lot.

Thanks for your podcasts and blog posts over the years. Appreciate it.

Sameer

July 16, 2015 10:27

Once upon a time software was about to be released again. Long running stress tests showed that the long run did terminate after 2h with an OutOfMemoryException. Either we needed to redefine what long meant or fix the memory leak. Or let the customers decide if 2h was long enough.

The interesting thing was that this happened only on specific Windows XP and Windows 7 machines with an OS hotfix which did some changes to GDI. Windows 8/8.1 did have this issue right from the start.

I finally found that it had something to do with GDI Bitmaps which were allocating 192kb of address space even for 48x48 600 byte Icons. If you have a few thousands of bitmaps around your memory soon becomes severly fragmented. This was related to a change in GDI where the Windows Imaging Component (WIC) was used to decode bitmaps after applying a specific hotfix.

Ok file mapping objects are the root cause so lets track with Windbg the aquire and release calls to find where not released Bitmaps are hiding. After setting the breakpoints in Windbg and adding some logging of the callstacks our application would become very slow and run into timeout errors because of ist new found slowness.

This failure caused me to learn new things like using a hooking library (EasyHook) to intercept some OS calls and log the method parameters and finally write the data out as ETW event. The best thing about ETW is that you can tell the OS to collect the call stacks for you. That approach proved to be ultra fast and resulted in very small ETL files even after running the application for hours. Then it was easy to parse the ETL file and filter out all unbalanced mapping calls which did relate to call stacks in our code.
This enabled us to predict which code sections would benefit most of optimizations (e.g. bitmap reuse, static Bitmaps, ...) so we could do isolated low impact changes even late in the release cycle.
My colleagues did not believe me that I could predict with these exact measurements the outcome of future changes. They were really stunned to find that my predictions turned out to be correct with an error <3%.

That was one of my best experiences that scientific troubleshooting can make a huge difference.

Did I forget something? Yes! This was my first bug report that forced MS to release Windows hot fixes ranging from XP-Windows 8.1. It turned out that the problematic hotfix was a backport from Windows 8 where the file mapping objects were allocated for no reason.

Alois Kraus

July 20, 2015 8:28

I did this exact thing a couple weeks ago. we had a bug that I wasn't sure how to fix. I knew that the bug had something to do with javascript promises.

I found this post talking about promises and I bookmarked it. I went back later and read the post (slowly so that I understood it). After that, I was able to make one change the the code and run, and it worked the first time.

Making the code run correctly the first time felt good. I'm still feeling good.

paul

July 21, 2015 9:17

I once worked for a company where we had data processing services that handled millions of database records and thousands of new records were being added daily. Our apps needed to be able to search that database in real time in a variety of ways at the whim of the end user. The obvious problem was that as the database grew, the queries would slow to the point of timeouts. Over a period of months, we tried several things to manage the symptoms including but not limited to capping the results and adding tons of RAM to SQL Server. We even started deleting old data to keep the query times just short enough to prevent timeouts.

We didn't have a DBA, so I finally got fed up one day and decided to figure out the root cause. I bought a book SQL Index design and spent a couple of evenings having my eyes opened to how SQL makes things fast. I also learned how to measure SQL performance using SQL Management Studio and get suggestions from the built-in query/index tuner. (I'd be willing to bet not many programmers even know that feature exists.) It turns out the few indexes we had were terrible or incorrect, and we were also missing a few. After updating the indexes and creating nightly jobs to rebuild the indexes and stats (if needed), the queries went from 30 seconds to less than a second. All because I researched the right way to index tables.

Dan Bailiff

July 22, 2015 23:24

Your listener was on the right track differentiating between filters and queries. Filters can operate over bitsets and most of them can and will be cached automatically. Queries deal with scoring and are therefor not cached, granted though that most queries are still plenty fast because they operate on an inverted index that might be in the filesystem cache anyway.

However the `or` filter your listener switched to is a special wrapper filter for when you are using filters that can not utilize (cached) bitsets, these are the script, numeric_range and geo related filters. These need to operate on all the documents anyway and this or filter wrapper allows you shortcircuit these more efficiently then the bool query. However other types of filters typically need to be placed inside a bool filter.

This blog post explains this in-deptly

With Elasticsearch 2.0 the story around filters and queries is simplified greatly. We no longer differentiate between filters an query constructs in the DSL. Both of them have been merged to queries. In addition to this the or/and/not filters have been deprecated and you are safe to use the bool query in all contexts.

This blog post outlines all the improvements to the query execution in 2.0

its good to reiterate though that an elasticsearch search has two phases

1. query, does scoring, aggregations etcetera.
2. post_filter, further filters down the documents after the query phase. This is handy if you want your aggregations to be unaffected by e.g filters a user chose to filter down the documents.

However you can still use filters inside the query phase, be sure to wrap them in a filtered query if you are using Elasticsearch 1.x. Or in the filter clause of the bool query if you are using Elasticsearch 2.0. This is often a better choice then executing the filters inside the post_filter phase depending on your use-case.

Martijn Laarman

September 02, 2015 13:46

Why, just today I ran into a problem where your advice helped.

I've been working on a large-scale refactoring of our crusty old codebase (it's in Struts 1, if that gives you an idea), and after getting all code and JSPs deployed, there was still something with an invalid reference that wasn't switched. I was able to determine that it was a combination of an in-house CMS which hadn't rebuilt some of our pages from the database components they're made of, and the fact that Tomcat wasn't recompiling JSPs automatically into our Catalina work directory. Deleting the work files and re-running our CMS page generator made everything hunky-dory.

I remembered your advice while I was doing this, and marveled at the fact that a few months ago, I wouldn't have thought to check these things and would have gotten caught up for at least a day.

Shotgun Ninja

Comments are closed.