Saturday, February 4, 2012

Big end failure

Recently I did a project - which involved debugging a mixed .NET/C++ application that was failing on a 64 bit build. Now this was a complicated non-trivial app, that used the Microsoft DirectShow multimedia framework, and one of the reasons that I was hired, was because I have pretty decent experience with this.

My first line of attack was obviously to check the directshow stuff and while the code wasn't really to my taste, It seemed to work in 32 bit just fine. The 64 bit build just refused to work. I immediately jumped to the conclusion that there was something to do with the interaction of a 64 bit process and 32 bit filters. I checked whether a DShow filter works out of the box after rebuilding with 64 bit, and it did. So I had nothing.

Then I had no choice but to run both versions in separate debugger instances simultaneously, stepping to the point where something differed. The debugging was made quite annoying by the fact that for some reason, the 32 bit app would throw an access violation everytime I single stepped, perhaps the environment was broken(I was doing all this over a remote connection), perhaps attaching to a .NET app calling a C++ dll was messing up the debugger. The only way I could step was to add new breakpoints and resume the execution each time! It finally led to a decryption function, and there was a lot of bit twiddling going on and some typedef'd types. Once again I thought I'd found the issue - I made sure every instance of int in that code was replaced with _int32 so that the code would not be wrong in assuming every int was 32 bits. But the problem still remained.

Once again I did the side-by-side debugging and painstakingly expanded the macros in the crytpo code to compare intermediate values of expressions, and suddenly the light shone through - A macro that returned the last byte of a DWORD gave different results on x86 and x64. In other words, the code was using the big-endian version of things - Looking through a few headers, it became obvious. The code comes from a very old C program, written in the 90s : Though it checked for various architectures, it somehow assumed that anything not x86 was a big-endian platform (which may have been true for the platforms it had been intended for) - However this made it also assume that x64 was a big-endian, and so we had our mess-up.
So in the end it was just a line or two of code actually changed, but who would have "thunk it" eh?

PS : The title of this post is a pun!

No comments:

Post a Comment