Wednesday, November 5, 2008

Use of Excel-generated HTML Considered Harmful

This was one of the most strange bugs that I had to resolve.

While I was writing the blog post about computing electoral correlations across states using prediction markets, I wanted to include a table with some results, to illustrate how different states are correlated.

So, I prepared the table in Excel, and then copied and pasted it on Blogger.

Then a strange thing happened: My Feedburner feed stopped working. Nobody received any updates, and suddenly the number of subscribers fell to zero.

Trying to figure out that was wrong, I got a message that my feed was bigger than 512Kb. Admittedly, my table was kind of big, with more than 300 entries. So, I decided to trim it down, to 30-50 rows.

After that fix my feed started working again.

I was still puzzled though why the problem did not appear earlier, given that I have written some pretty long posts (e.g., Why People Participate on Mechanical Turk?) and I never exceeded the 512Kb limit.

Well, the problem was not over. Even though my feed was working, the post about computing electoral correlations across states using prediction markets did not appear in Google Reader, and in other readers. However, the reader on my cell phone was displaying the post. Very very strange.

I followed all the troubleshooting steps on Feedburner, nothing.

So, I decided to take a closer look at the HTML source. I was in for a surprise! The table that I copied and pasted from Excel, had a seriously fat, ugly, and problematic HTML code.

As an example, instead of having a table cell written as "<td>NTH.DAKOTA</td>" it had the following code instead:
<td class="xl63" style="border-style: none none solid; border-color: -moz-use-text-color -moz-use-text-color rgb(149, 179, 215); border-width: medium medium 0.5pt; background: rgb(219, 229, 241) none repeat scroll 0% 0%; font-size: 10pt; color: black; font-weight: 400; text-decoration: none; font-family: Calibri; -moz-background-clip: -moz-initial; -moz-background-origin: -moz-initial; -moz-background-inline-policy: -moz-initial;">NTH.DAKOTA</td>
This not only resulted in a seriously padded HTML, it was also generating validation problems, causing Google Reader to reject the post and not display it at all.

Solution? Nuking by regular expression. I replaced all the "<td [^>]+>" instances with "<td>", and I had a seriously trimmed table from 116Kb (!) to 7Kb. After that, Google picked the post within seconds....

Lesson? Never, ever use an Excel-generated table in Blogger. Or if you need to do that, make sure to remove all the fat...