Saturday, December 20, 2008

Christmas mystery

We were tracking down quite an interesting problem in the past few days …

We have this service that accepts connections from multiple computers and reports some status back. And it does that via SOAP. And the SOAP implementation is our internal.

So far so good – we are using this approach in quite some programs deployed in many places. Everything is working well.

Everything, except one installation.

That customer was reporting weird things. Mysterious things. Sometimes the client on one computer would for a moment display a same data as another computer. And then the display would flip back to the correct data. Mysterious.

As expected, the guy who wrote the service pointed his finger to the guy that wrote the SOAP layer (me). And, as expected, I did the reverse. We were pretty sure that the problem was caused by the buggy code written by the other guy. We were both wrong.

Of course we first spent about a working day logging various parts of the service and SOAP server. You know the drill – add logging, compile, connect VPN, upload the source to the customer, start two clients at two machines on the remote site via RDP, wait for the problem, download logs, analyze. Booooring. At the end, we were none smarter. We only knew that the server always returns correct data to all clients.

Then we switched our attention to the client. Who’d say, the client always received the correct data. Except that it sometimes displayed wrong data. But that data was never received via the SOAP layer. As I said before – mystery.

Of course, once we got to that point we knew that the problem lies inside the client software so we only have to dig in that direction. And, of course, we got the answer. And boy was it a surprising one!

You see, our client is somewhat baroque. Layers of layers of code that were developed over the years. It’s a fate of all successful applications and this one is quite successful, at least in the vertical market we are working in. It is therefore not very surprising that the client is using temporary files in a somewhat strange arrangement. Instead of creating temporary files with unique names, it creates a temporary folder inside the %temp% and stores files there. This folder is guaranteed to be unique on the system as it uses a global counter as a part of the folder name. IOW, first client on the computer stores data in %temp%\data_1, second in %temp%\data_2. So the client on the first computer stored remote data (returned from the service) in %temp%\data_1\list.txt and the client on the second computer used %temp%\data_1\list.txt. They were both using ‘data_1’ subfolder as they were both the first (and only) client instance on that machine. A recipe for disaster? Not really as the %temp% folder is local to the computer.

Except that it was not.

Somebody at the customer’s site got a brilliant idea and configured temp folders for all domain clients to point to the domain controller. Don’t know who and surely don’t know why, but as the result %temp% pointed to the same folder (on the domain controller computer) on all client computers in the domain. So the first client downloaded its data to the %temp% (on the domain server), second client downloaded its data to the same location and then the first client displayed that data – data that was already modified and which belonged to the second client. A typical race condition if I ever saw one.

The solution is, of course, to always use GetTempFileName. But this time we surely (re)learnt it in an interesting way.