Today, the topic is the law of unintended consequences as it applies to modifying software.
Since we’ve now
revealed what Fog Creek Copilot is, I can talk in a lot more detail
about exactly what I’m up to. At the moment, I’m wrapping up some GUI
improvements that have been the source of two weeks of misery for me.
As you all know by now, we’re using TightVNC as our code base. We’re
diverging more and more from that starting point (we stopped complying
with VNC’s protocol weeks ago and will break a lot more in about a week
when I change the compression mechanism), but the fact it’s our base
means that certain design decisions made for TightVNC affect the design
of Aardvark.
One of the design
decisions that vncviewer (the TightVNC client) made was an extreme
usability headache. Specifically, there is no way to quit vncviewer
until the connection has fully succeeded or fully failed—i.e., you
cannot abort the connection during handshaking. This manifests itself
in the UI by giving you a tiny window that pops up that says
“Connecting…” and has a button marked “Hide.” The “Hide” button does
exactly that: it hides the window. You may think that
vncviewer has quit, since it no longer has any open windows or taskbar
icons, but it’ll actually still continue trying to make the connection.
This results either in a window popping up unexpectedly several seconds
later or in a seemingly random message box telling you that a
connection couldn’t be established. Either one is surprising and
confusing if you thought that vncviewer had been closed for several
minutes. We found this behavior totally unacceptable, so the spec
called for displaying a normal window immediately and allowing the user
to quit at any time.
Unfortunately,
vncviewer had some design decisions that made fixing this extremely
difficult. All of the negotiations occurred in an independent thread
that could not be stopped once it was launched. No useful feedback came
out of the thread (it updated some text, but didn't pass any messages).
The viewing window assumed that, if it were displayed, it had a valid
connection, which meant that it would complain when it tried to send
data to a nonexistent socket.
Just under two weeks
ago, I started the process of tackling these issues. I broke up most of
the worker thread into chunks handled by Windows messages, and I
modified what was left of the negotiation thread to use message passing
and flags to let the user quit cleanly. That solved about 50% of the
problem. There were a tremendous number of bugs, but I was able to fix
the overwhelming majority of them over the next few days without any
major problems.
Except one. This
heavily modified program, which was for the first time called Aardvark
Helper in our internal emails instead of vncviewer, had a tendency of
closing the connection during the negotiation cycle. A lot. The problem
was that it wasn't easily reproducible. It was certainly pervasive—it
happened constantly—but there were no replicatable sequences that would
cause the bug to surface. Usually, launching Aardvark Helper and then
WinVNC caused a crash, but sometimes it didn’t, and occasionally
launching WinVNC and then Aardvark Helper also crashed. Frequently both
orders would be fine several tries in a row. Most frustratingly,
stepping through handshaking in the debugger fixed the problem. Doing
that was the one surefire way to guarantee that everything would work
just peachy.
I scoured the way I’d
broken up handshaking looking for any clue of what was going wrong.
Inconsistent crashing during code that you know is threaded is usually
a sign of a threading issue, so at first I thought that the handshaking
wasn’t proceeding linearly or was restarting itself at some point. Yet
a collection of printfs quickly assured me that my refactored
negotiation code was running just fine, even when the program was
crashing, which removed that possibility from the table. Unfortunately,
this happened running up to CFUNITED, so I couldn't stay focused on
just this one bug. In fact, I couldn't even focus on just this one code
base: we had forked everything a few days earlier for the show and had
a policy of fixing bugs in those builds by any means necessary (which
at one point meant adding a line in code that basically read "if you
can't open the file, pretend you could and wing it"). So I had a very
hard time actually focusing on any individual bug for very long.
Literally the day
before CFUNITED, I discovered yet another bug: Aardvark Helper wouldn’t
send mouse movements unless the window lost focus and then got it back.
I was too busy to deal with this quirk at that point, so I simply filed
it in FogBUGZ and figured I’d attack it later.
Yesterday, I got back
to attacking the crash-during-negotiation bug. As I toyed for the
umpteenth time to see whether I could reproduce it reliably, I was also
scanning open issues in FogBUGZ, and suddenly, I realized something.
Aardvark Helper wasn’t sending messages unless the window lost and
regained focus. The crash normally happened if I launched Aardvark
Helper, then WinVNC. When I did that, I usually, but not always, then
reselected Aardvark Helper, which meant…
Which meant that
reselecting the Aardvark Helper window and then moving the mouse would
cause the program to send VNC data during handshaking. Which causes the
programs to quit because they get corrupt data during the handshake.
I’d thought to ensure
that Aardvark Helper’s window wouldn’t send data when the socket was
invalid, but I’d neglected to make a distinction between a valid
VNC-like connection and a simple socket connection in the middle of
handshaking. Oops.
A mere 25 lines of
code later, the bleeding-edge version of Aardvark Helper now has the
stability of the CFUNITED build, in single-window form. So now the
applications run smoothly and stably, and for the first time in weeks,
I can honestly say that neither Aardvark Helper nor Aardvark Host have
any known operational bugs. It was a headache getting here, but at
least I’m pretty confident now that when we start the private beta in a
few days, the clients will be ready.