On Ampersands And Standards
Byron commented on ampersand redux:
Yes, an ampersand is valid as part of an attribute value (as represented in an HTML document) where that ampersand is part of an entity reference. An ampersand that is not part of an entity reference is not valid in an attribute value, in an HTML document. Serialization has nothing to do with it, since an HTML document is not the serialization of a DOM tree, although it can be viewed as such. I did not mean to say anything about serializing attribute values, I meant to say that an attribute value in an HTML document cannot legally have an ampersand that is not part of an entity reference. If your document does have such an ampersand, it will not validate. It might work in current browsers, but down the road it might not. Don’t do it. If a browser gets it wrong, file a bug against the browser or avoid ampersands entirely, don’t force every other author of HTML parsers to work around your markup’s faults. I still disagree with the first part – ampersands are perfectly valid in HTML comments but when serialized they must be escaped as entities. It is critical to consider entities as equivalent to the character they represent, otherwise é wouldn’t be the same as é which is clearly ludicrous. Regardless, the point is entirely academic so I’ll leave it at that. The last part however is crazy. If a browser has a bug and you need to support that browser, you should do whatever it takes to make your application work with that browser – standards be damned. It is in no way acceptable for a software developer to skip requirements just because it would mean conflicting with a standard. If adhering to the standard was also a requirement then the higher priority requirement should wind up being implemented and the other one revised to not be in conflict. If you can get the browser vendor to fix the issue and you consider it acceptable to make all your clients upgrade to the fixed version then by all means follow the standard – otherwise through it out. Standards are designed to enhance interoperability, if they reduce interoperability in areas that are important to your project they are completely worthless and should be ignored. The comment about forcing every other HTML parser to work around the markup problems is a red herring as well – HTML parsers already have to deal with that kind of thing and that’s not going to change. XML parsers on the other hand do not have to handle invalid mark up and most don’t which is precisely why I pointed out that you should always escape ampersands correctly in XHTML despite the fact that most if not all browsers will get it right either way. Software development is about achieving the project’s requirements. It’s not about politics, it’s not about standards and it’s not about making yourself feel good. If you can meet your requirements and do any of that, then great, but the requirements are the only thing that have to be achieved and they override anything else. That said, any of those things could be made a requirement of the project, but it’s quite rare that they would actually be requirements let alone high priority ones.
Ampersand Redux
It seems I wasn’t clear enough with my ampersand related comments. I’m not talking about standards here, the standards are very clear – & should always be escaped as &, no ifs no buts. However, we live in the real world and many things don’t follow standards correctly. So while David is correct that the validator will complain if you don’t escape ampersands in HTML documents, some browsers will get it wrong if you do escape them in some cases (it’s exotic and the actual test cases are at work not here unfortunately). In XHTML however, you really seriously have to escape them because a) browsers get it right when kicked into XHTML mode, and b) XML parsers barf if you don’t. Byron also chimes in with a comment:
Odd Bits Of HTML Behaviour
If you wanted to create a hyperlink to a file called “Me & You”, which of the following should you use?
<a href="Me & You"> or
<a href="Me & You"> In other words, should you escape the ampersand or not? It depends. If you create a plain HTML page, you must not escape the ampersand or it won't work (browser dependent obviously), however if you leave it unescaped it will work in every browser. If however, you create an XHTML document you should escape the ampersand, otherwise XML parsers will break when parsing the document and browsers will get the link right as long as they are kicked into XHTML mode by the appropriate declaration at the top of the file. If you want to test this try linking to an URL with
&
in it (ie: the file name literally includes the HTML entity for ampersand). Better yet, don’t put stupid characters in your URLs.
Greg, Im Well Aware of When Its Appropriate To Use An Apostrophe
This blog is a very informal place for me – I write what I want, when I want, how I want. Thats why I have a blog. Recently, Greg Black took issue with my (admittedly fairly regular) misuse of the apostrophe. I am actually quite familiar with the rules of when to use an apostrophe and when not to, however since this is informal writing, I tend not to proof read me comments and also tend to think much faster than my fingers type. If this were formal writing not only would I proof read my comments to ensure that apostrophe’s were used in the correctly locations but also rewrite those long sentences with many sub-sentences in brackets (I have a bad habit of doing that – it matches my though patterns), and of course the over use of dash’s to tack on additional points. I would probably even go so far as using paragraph’s to delimit separate points instead of using them whenever I feel like some white space is required. Heck, I might even run a speeling checker over it. Its also worth responding to Gregs comment:
Excitement
(In case you haven’t noticed, it seems to be a very work oriented evening this evening.) I’m often a little envious of people who get to work in cool places developing brand new technology and speaking at conferences (and more importantly having more than a handful of people actually care about that area of development). I’ve been in the same job for about 3 years now and while we are and always have been on the very cutting edge of content related technologies (see, not even a cool name for it…), it’s a little bit old hat to me. I’ve beaten my head against all kinds of standards, HTTP, HTML, CSS, XML, Namespaces, XPath, XSD, XSLT, Word – if there’s content written in it, I’ve probably had to deal with it at some point and if it’s at all web related I’ve probably had a lot to do with it. Now I don’t mean to say that I know everything and I certainly don’t want to imply that I’m any more knowledgeable than anyone else – quite the opposite, I still have a lot to learn and there are a lot of people that I’m constantly learning from. What I am trying to say however, is that as a product becomes more mature, the coolness factor of it’s development tends to wear off. 5 years ago, the ability to replace a standard text area on a HTML page with a WYSIWYG HTML editor was nothing short of astounding. These days most browsers have (very) primitive WYSIWYG editing modes built in. In the past week or two however, I’ve gotten my teeth sunk into some awesome new features that once again have me really excited about the technology space I’m in. With the features I’ve put in during the last week or two and a couple of the features that will go in this week and next, our boring little editor is rocketing forward in usability. When Ephox started making it’s editors the general practice was to look at Word, FrontPage and DreamWeaver to see what they did and how they did it then try to find a way to make that possible within the confines of a browser. Now I’m looking at how those programs handle things and finding them lacking and quite buggy. I used to think their behavior was how it was supposed to work and that working out what the user meant was just difficult – I’m really excited to have discovered that’s not the case: most programs are just really buggy and make it look hard. Now I’m not about to suggest that our product is completely bug free, it’s not – it has a lot of room for improvement (and I’m excited that we’re really focussing on making those improvements happen), but when you look back over the past few years and see the journey that Ephox has taken to get here and follow the progress of the entire content management industry, it’s really quite exciting that we’ve gotten here. So keep an eye out for our 4.0 release (or whatever the heck marketing decide to call it) when it comes out (waiting on management and marketing to decide what the best time for a release would be and what features we want in it). It won’t be announced on Slashdot and seeing as we don’t really sell to end users and most people don’t read the kind of news sites we do get mentioned on, you’ll probably never notice the release, but I’m telling you – it’ll be awesome. I’m excited.
Java Is Now Officially Fast
I don’t use our product outside of debug mode often enough apparently. Having played around with the wiki system I mentioned previously (you do read your planet’s from bottom to top right?), I suddenly noticed that the progress bar our editor applet shows while it’s starting up wasn’t displaying. Turns out the applet was loaded and ready instantly so the screen wasn’t getting a chance to repaint. Awesome! This by the way is only on a 1.4Ghz AMD machine with 512Mb RAM so it’s about the average for corporate desktops these days, maybe a little above and it is running the Java 1.5 beta which provided some massive improvements in start up time. Still, our ActiveX based editor doesn’t load this fast. Even better, I’m using an old version of the applet since I didn’t have a recent copy on my laptop when I came home tonight and couldn’t be bothered downloading a new version. It should be a faster still with the performance improvements we’ve put in.
I Love Regex, I Hate Regex
I’ve been playing around with writing a mini-wiki that uses the full compliment of HTML as it’s syntax (instead of forcing me to learn yet another markup language) and use EditLive! for Java as the editor – eating ones own dog food and all that. Frankly, that’s the way a wiki should work, no messing around with mark up at all, just simple, easy to use WYSIWYG markup. Anyway, I wrote the back end in PHP since we don’t have any PHP examples in our SDK and I couldn’t be bothered working out why perl refused to install the MySQL drivers. Loading and saving from the database is simple enough, and I settled in to make the CamelCase works hyperlinks. The obvious answer: regex. The obvious problem: working out which regex expression to use (I don’t use regex often since I usually live in the land of custom automatons instead). I’ve wound up with:
The Default Namespace
Byron complains about what he calls a limitation of XPath. It’s not actually a limitation of XPath at all but rather a very common mistake people make when working with XML namespaces. Lets take a tour into the dark depths of XML namespaces to discover what’s really going on. Originally, XML didn’t have namespaces at all, every element was identified purely by it’s name. So the element <html>
was known as html
and the world was simple. Then people discovered that they wanted to combine XML documents and that quite often they’d wind up with two elements called html
that had completely different meanings and uses – ie: they were actually two different elements. To solve this problem, the clever folk over at the W3C changed the way XML elements and attributes were named. Now instead of a simple string, elements would be named using a QName (short for Qualified Name). A QName is a compound data object consisting of an URI and the regular name we’re used to. Now the important thing to note in this distinction is that it is impossible to refer to an element only by it’s local-name, you have to use a QName to refer to it and that QName must have a namespace attribute (all QName’s do). It is however possible to assign nil to the namespace attribute of a QName, which is referred to as an element in the nil namespace. Now, one of the important things about adding namespaces to XML was backwards compatibility so there had to be some way to assign a namespace to all those elements which were previously referred to only by their local-name. This is where the default namespace comes in (it also happens to be quite convenient). By default in an XML document, any element that doesn’t have a prefix to declare which namespace it is in, is assigned the “default namespace”. The default namespace however isn’t an actual namespace, it’s just a default value for the namespace attribute. By default, the default namespace is the nil namespace. Now, in XML you can specify what the default namespace is by adding an xmlns
attribute, eg: xmlns="http://www.w3.org/1999/xhtml"
. The new value for the default namespace is then in effect for that element and any element under it (unless it’s changed again). It is important to note here that an XML document doesn’t have a default namespace, but rather each element has a value which it inherits (attributes never ever use the default namespace). There can therefore be as many different values for the default namespace as there are elements. So if we were to add Byron’s idea of matching any element in the default namespace, it would never match anything because every element (and attribute) would have an explicit namespace. Worse still, the default namespace would change depending on which element we were in and what the specific representation of the XML was (maybe the default namespace was left as nil and every element used a prefix or maybe no elements were prefixed and the default namespace was changed all through the document). Depending on the representation of XML instead of the actual data it represents is very bad practice and will cause problems. This leads right into the other common mistake in Byron’s post: //*[name() == 'foo']
will do nothing particularly useful. What it will do is match any element with a local-name of foo, in any namespace as long as the element was represented without a prefix. It will not match <my:foo xmlns:my="http://www.intencha.com/my" />
because name()
for my:foo
will return my:foo
. Elements are the same if they have the same QName regardless of whether or not a prefix used or if different prefixes were used. The correct way to select any element with a local-name of foo in any namespace regardless of what prefix was used is: //*[local-name() = 'foo']
If I were to take a shot at selecting any node which had a namespace-uri the same as the default namespace in effect in the original representation of the XML document’s root node it would be: //*[string(/*/namespace::*[name() = ""]) = namespace-uri()]
Which is to say:
The Curse Of Testing Text
One of the major challenges in my job is testing our product. Now most people think that testing is reasonably easy but requires discipline, this is not true if the product you write happens to be a styled text editor and it's nearly impossible to do really well if you're working with something as flexibly defined as HTML.
The problems start at the unit testing level. Try taking the standard JTextPane class and writing unit tests for it. How do you test that it can render a HTML list correctly? You could write a test to make sure that the list numbering at least comes out in order and in the right format but that still won't guarantee that the list numbers actually paint correctly. For instance, we recently had a bug where the list numbers painted correctly if the list fit on screen, but if it required scrolling, whatever item was at the top of the screen was numbered 1 even if it was actually the third item in the list. Our list numbering unit tests had no way of picking up on that because it was the actual rendering code (which we didn't write) that was wrong.
Object.equals()
Andrae Muys provides some excellent advice on implementing Object.equals()
however I do have to correct one thing. Andrae presents the code:
class A {
int n1;
public boolean equals(Object o) {
if (!(o instanceof A)) {
return false;
} else {
A rhs = (A)o;
return this.n1 == rhs.n1;
}
}
}
and suggests that it is incorrect. It is not. This is absolutely, 100% the correct way to implement equals for that class. The alternative he presents on the other hand is incorrect:
Riverfire
I was fortunate enough to be invited out to a friends place to see the fireworks last night. They happen to own the penthouse apartment on the river front with an awesome view of nearly all the fireworks. The fireworks are launched from numerous sites along the river so being able to see all of them is really quite unusual. Better yet though, one of the launching barges is positioned directly in front on the balcony as if it were a private show just for us. That particular barge this year had some issues launching it’s fireworks. There were about three sections where all the other barges launched fireworks in unison but “our barge” sat there doing nothing. Apparently it was unplanned because at the end our barge suddenly started firing off every firework it had missed all at once. A display that was intended to take about 10 minutes was sent up in about a minute flat. Extremely impressive! Also impressive was the dump and burn which was close enough to feel the heat hit you and felt like you could just reach out and catch the plane. I’ve never been overly impressed by fly-bys when standing at ground level at South Bank but from the penthouse it really is quite an experience. Oh, and Iain has photos.
When Marketing Goes Wrong
I’m currently wearing one of the shirts that James Gosling hurled into the crowd by various means which depicts Duke aiming a rocket launcher at a weird looking demon with four arms labeled complexity. Earlier today I was accosted by a very young man who asked what that was on my shirt. When I pointed to Duke and said “this guy’s called Duke” the response was: “he’s crap. I like this guy ‘cause he has four arms”. I’m not sure that’s the reaction the designer was after….