jumpingfish › Two Days In A Row

1. Thursday

We had tested the logic. We made some changes and tested again. And when we deployed into production, we tested again, pushing a few small datasets thru the route to make sure that everything worked as expected, which it did.

“Let me give you a complete data dump now,” Vyas said.

And we ran it thru the system.

“The route is processing the input files,” I reported. And then a few moments later, “It’s generating the output files.” And then finally, after a few more moments, “The output files are being picked up by the listener.”

After all the output files were picked up, we waited a moment, and then he confirmed, “I got the data in our system.”

But a few minutes later someone chimed in on one of our Skype channels, “We’re getting a bunch of bogus messages without a time tag. And soon after that, there was a cascade of automated notifications and alarms sent out by email.

Although in our post mortem we weren’t so sure that those alarms were related to our errors, and although the root cause of the problem was the format of the input files, it’s indisputable that it was the execution of my code that unleashed those furies.

“Sorry guys,” I later said.

“Tomorrow,” Vyas said, “we’ll turn the system on.”

2. Friday

The next morning, I Skyped Vyas my plan. “I’ll manually process a few of the oldest files. If they run ok, then we can turn the system on.”

“Awesome,” he said.

Moments later I was again reporting the progress.

“The route processing the input files,” I said. And then, “It’s generating the output files.” And finally, “The output files are being picked up by the listener.” (Sound familiar?)

This time, I could see the results showing up in the output queue. The message count kept rising. I kept watching. The curve kept going up. As the count reached 1200, it was clear that the worker was not pulling anything out of the queue.

“Hmm…” Vyas said. “I’m seeing empty payloads.”

There was again an error of some sort, but worse, this one was blocking all incoming data from any customers.

It was Friday afternoon. The room was dark and mostly empty. Being relatively new to the team, I was woefully unequipped to debug the problem. Fortunately, there were a few generous souls still hanging around.

After about two hours of spelunking, we came up with a workaround. An hour later, sitting alone in the darkness and quiet, I put the final touches on a trouble ticket for the outage. I had also come up with a credible explanation of the root cause which, again, absolved my code of responsibility.

But absolved or not, the indisputable fact remains: my stuff broke things in production two days in a row.

TGIF!