Resist Bad Data, Part I: The Horrid Pain of Incomplete XHTML Entities and Encodings

The avant-garde of the software world may have migrated to greener pastures, munching on more lush hardware and grazing on more dynamic software. However, if you work on software in a more “established” industry like retail or publishing (or, as in my case, the intersection of them), then you’re accustomed to the entrenched practices of institutions. A scant few are worse than others, refusing to give up their fax machines in exchange for scanners and PDFs. However, most of these mentioned ancients do move, although at a slower pace. In that sense, there are many more days until XML becomes discarded as the standard format for data exchange. Take ONIX, for example.

And since I support the consumption of this XML standard, I must anticipate the various issues that might be encountered with it. For those of you who don’t deal with this type of madness, the rest of this post probably means nothing to you. For those of you that do, however…you are my brothers and sisters, my comrades in the trenches. You have my full empathy. And due to our shared bond, I am compelled to help you.

Of course, when you receive a XML file, you want to validate its structure (and, in some cases, its content). “But who would send improperly formatted data, especially if you have a business relationship? Surely they would have validated it before releasing it onto the world?” Oh, how I wish that were true. On the plus side, most providers of XML data do get the basics down. For example, they have opening and closing tags, and they know how to spell the name of their own company in the comments. On the negative side, they may not understand the XML standard completely, and since they don’t run a XML package to validate their own files, the content (i.e., the inner text within the tags) can cause the whole file to be invalid in the eyes of a XML parser. I’m sure that you know what I’m talking about, my comrades.

Take for example the following ONIX XML:

<TitleText>&#9996; I Don’t Know How to Create a XML File Properly &#9996; I Should Just Color Books with My Fellow Ni&#x000F1;os for 3 A&#x000F1;os &#9996; - Help Me Color Ni&#x000F1;os - &#Xae Ni&#x000F1;os! - Ni&#x000os!</TitleText>
<Subtitle>Yay&#9996Yay All Play and No Work for Me &gt; &sum; Just Play D&D and D & D &#99 with My &#9996 Boys - Moy Fun &#8364; (Spanish Edition) &#x000F1; &#</Subtitle>

There are a few incomplete encodings here (like “&#Xae” and “&#9996” and “&#”) that will cause the file to fail validation. (And, no, “&#sum;” would not fail here, since it’s valid in the eyes of the ONIX DTD.) And since I don’t like manually combing through a 600 MB file and fixing each grotesque instance, we should create an automated solution, using something dangerously powerful. Yes…I am talking about regular expressions. Of course, this issue isn’t exactly a new one, since developers have been talking about it again and again for a while. However, most of the solutions don’t address all the issues at once, like presented above.

So, after spending the good part of a day desperately trying to remember the idiosyncrasies of regular expressions (control groups, etc.), I came up with a more encompassing solution:

$line =~ s/(&#?x?[A-Za-z0-9]+;)|&#\d*/$1/g;

If applied via Perl to the sample XML mentioned above, it results in the following:

<TitleText>&#9996; I Don’t Know How to Create a XML File Properly &#9996; I Should Just Color Books with My Fellow Ni&#x000F1;os for 3 A&#x000F1;os &#9996; - Help Me Color Ni&#x000F1;os - Xae Ni&#x000F1;os! - Nix000os!</TitleText>
<Subtitle>YayYay All Play and No Work for Me &gt; &sum; Just Play D&D and D & D with My Boys - Moy Fun &#8364; (Spanish Edition) &#x000F1; </Subtitle>

And voilà! Your validation issues are all gone, and the rest of your data has not been mauled or decimated. Well, not terribly, anyway. Plus, it’s pretty darn fast. (Unless of course you’re running Perl on Windows. Then you might as well take a long lunch and a nice nap before it’s finished.) Now, in my case, I wanted to only remove the numeric encodings via the second control group (i.e., &#\d*), and I wanted to keep the hex encodings (like “&#x000”) and alphabet encodings (like “&#Xae”) for further analysis. So, you may want to modify the expression if you want to handle the latter two in a different way. Also, it should be noted that it does not handle incomplete HTML entities. For example, if the provider gives you something like “&gt” where it’s missing the semicolon, this expression will not help you. Also, if the provider gives you an incorrect value for an encoding (like “&#x000;”), it definitely won’t help you. However, you could modify it or use it as a template for an expression that targets those problems specifically.

In future posts, I’ll talk about other options for this kind of situation, by making use of either C# or Java. Hopefully, this post will save you the hours that I had to spend. And if you have any useful advice for how to address such bad data, I’d be welcome to hear it. Since data providers will always issue bad data, we’ll always need more tools at our disposal. Unfortunately, though, I am forced to ignore any tips on arson or demolition, since violence is never the answer.

At least, that’s what I’ve been told.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s