April 5th, 2010


good news for searching math and science work

W3C¹ has just announced a Recommendation² regularizing the math and science character set for XML³, which should filter down to HTML. For example, there seem to be five ways to refer to the Greek letter epsilon (ε), having rules clarifies them nicely. It will be much easier to search for equations and formulae, which are used in everything from financial calculations to architecture.

Anything that pins down textual representation of concepts is always going to be a good thing for search. That's why search people are so enamored with Unicode. Most modern search engines convert from old-fashioned system-specific character sets to Unicode before indexing. When the search terms get the same treatment, they will match the index terms and the search will be successful.

So basically, I think this is a very good thing. Search engine indexers need to log and report trouble with character sets, because there are so many messy ones out there, and indexing files as random glyphs is a bad thing. But this will make it a bit easier.

¹ W3C is the World Wide Web Consortium, the closest thing to a governing body for the web. Their standards define the protocols, so most browsers can view most web sites.

² Recommendations are the W3C name for standards. IIRC they use different terminology because the official International Standards Organization (which can take decades to get things done), was territorial about the word "standard".

³ I'm looking at you, Microsoft, with your special XML files for Office programs.
