A while back people discovered that many RSS readers, and all online RSS aggregators didn't sandbox content from different sites and malicious HTML could cause cross site scripting (XSS) attacks and general nastiness. As a result most feed readers filter HTML through a seriously restrictive white list, including removing all CSS information. I've reached the point where I've simply had enough of this. CSS is a vital part of the internet and if feeds are going to be useful, we need them to work with CSS properly. So let's take a look at what's really going wrong:
Taking The Easy Option Instead Of The Right Option
Filtering down to the most basic HTML standards is easy, but it doesn't deliver the experience that end users should be expecting. There's no reason you can't support the vast majority of the HTML standard and still protect your users from XSS attacks. You do have to invest a lot more time into fully parsing the HTML into a model that only allows valid options, dropping anything invalid, not recognized or unsafe and then serialize back out. That takes a lot more work than just applying string parsing to strip things out, but it is far less likely to result in data loss (formatting is data too).
Put Security In The Right Place
What About Browser Bugs?
If you separate out the content into separate browser views there's still the possibility that a browser bug will allow security flaws. Of course, if that's the case your feed reader is only a small part of the problem – any website you visit with your browser could also cause you issues. If your feed reader sanitizes HTML it would only take a little bit of social engineering to make you open the entry in your browser where the exploit would occur. The simplest way to achieve this is to design the entry content to make a lot more sense when viewed with CSS styles (or just add a note to the top to say that).
Why Do You Need Formatting Anyway
Formatting is data. In many cases the way content looks is just as important as what the content says. The specific example that's driving me crazy at the moment is the recent changes feed from our internal wiki. In NetNewsWire it's the most fantastic way to follow a wiki I've ever come across – each change turns up in the feed reader with the complete page content and the changes highlighted either green or red so you can quickly see what changed in context. The markup is actually nice and semantic – the changes are marked up using ins and del tags – but the default rendering of these tags is next to useless as it doesn't stand out from the rest of the content enough. So for those poor people using FeedBurner (and any other reader that strips the background-color style on the ins and del tags) the feed is nearly useless because it's too hard to see what changed. The formatting of the content is critical to being able to extract the information – falling back to just the semantic information renders the content almost useless.
The great irony in this situation is that font tags would have worked perfectly even though they don't include the right semantic information and make the feed content far less accessible.
There has been a huge push in recent years to move away from the old habits of early HTML and to leverage CSS for presentation – the fact that it doesn't work in feed readers is a major pain for people trying to do the right thing. It's good that we identified a security threat and dealt with it quickly – but it's not acceptable to stop there. We need to work to get the functionality that we used to have back without reintroducing the security risks. It's not simple, but it is important.
Let's stop neutering the web.