Category Archives: ONIX

Resist Bad Data, Part 2: How to Filter Incomplete XHTML Entities and Encodings

So, as stated in the piece before, dealing with poorly formed XML is a necessary (though infuriating) part of the day for some people, including me. In that article, I described which regular expression to use (with Perl) in order to cleanse your files of such garbage. Which is fine, especially when you’re scripting…but these days, I’m fairly sure that the clear majority of systems are less about batch jobs and more are programming-oriented (distributed architectures, microservices, etc.). So, obviously, it’d be nice to have a solution that could be part of a platform with a more robust programming language.

So, how do we do it in Java? Easy enough. We can just make use of the same regular expression mentioned last time, since Java has a fairly straightforward approach that mimics other languages (like Perl):


java.nio.file.Path wiki_path = java.nio.file.Paths.get("C:/onix_test_data/test_files", "test_file_ONIX.xml");

java.nio.charset.Charset charset = java.nio.charset.Charset.forName("ISO-8859-1");

try {

StringBuilder FileContents = new StringBuilder();

List lines = java.nio.file.Files.readAllLines(wiki_path, charset);

for (String line : lines) {
     FileContents.append(line + "\n");

String sAllFileContents = FileContents.toString();

String sUpdatedFileContents = sAllFileContents.replaceAll("(&#?x?[A-Za-z0-9]+;)|&#\\d*", "$1");

try( out = new "C:/onix_test_data/output/test_file_ONIX.filtered.xml" ) ) {
     out.println( sUpdatedFileContents );
} catch (IOException e) {


And what about C#? Well, as I’ve pointed out again and again, the .NET platform isn’t exactly a source of inspiration when it comes to handling XML. And, of course, it has to be different when it comes to everything, including regular expressions. So, as my personal recommendation, I would make use of regular expressions for pattern matching and make use of the callbacks for actual filtering:


var sAlteredOutput = Regex.Replace(sOnixTestXml, @"&#?x?[A-Za-z0-9]*;?", FilterMatcher, RegexOptions.Singleline);


static public string FilterMatcher(Match m)

string sResult = "";

    if (m.Value == "&#")
        sResult = "";
    else if (m.Value.StartsWith("&#") && (m.Value.Length > 2))
        int nEncodingVal = 0;
        string sInsideEncoding = m.Value.Substring(2);

        if (m.Value.EndsWith(";"))
            sResult = m.Value;
        else if (!Int32.TryParse(sInsideEncoding, out nEncodingVal))
            int nIdx;
            for (nIdx = 0; nIdx < sInsideEncoding.ToCharArray().Length; ++nIdx)
                char cTemp = sInsideEncoding.ToCharArray()[nIdx];

                if (!(cTemp == 'x') && !Char.IsDigit(cTemp)) break;

            if (nIdx == sInsideEncoding.ToCharArray().Length)
                Result = m.Value;
                sResult = m.Value.Substring(nIdx + 2);
            sResult = "";
    else if (m.Value.StartsWith("&"))
        sResult = m.Value;

catch (Exception ex)
    sResult = m.Value;

return sResult;



In fact, I’d rather use this solution with callbacks since I love the idea of having more programmatic control. But don’t tell Microsoft that. 😛

And there you go. Two solutions to help you work around the various forms of ineptitude of data suppliers…of which there seems to be no end in sight.


Resist Bad Data, Part I: The Horrid Pain of Incomplete XHTML Entities and Encodings

The avant-garde of the software world may have migrated to greener pastures, munching on more lush hardware and grazing on more dynamic software. However, if you work on software in a more “established” industry like retail or publishing (or, as in my case, the intersection of them), then you’re accustomed to the entrenched practices of institutions. A scant few are worse than others, refusing to give up their fax machines in exchange for scanners and PDFs. However, most of these mentioned ancients do move, although at a slower pace. In that sense, there are many more days until XML becomes discarded as the standard format for data exchange. Take ONIX, for example.

And since I support the consumption of this XML standard, I must anticipate the various issues that might be encountered with it. For those of you who don’t deal with this type of madness, the rest of this post probably means nothing to you. For those of you that do, however…you are my brothers and sisters, my comrades in the trenches. You have my full empathy. And due to our shared bond, I am compelled to help you.

Of course, when you receive a XML file, you want to validate its structure (and, in some cases, its content). “But who would send improperly formatted data, especially if you have a business relationship? Surely they would have validated it before releasing it onto the world?” Oh, how I wish that were true. On the plus side, most providers of XML data do get the basics down. For example, they have opening and closing tags, and they know how to spell the name of their own company in the comments. On the negative side, they may not understand the XML standard completely, and since they don’t run a XML package to validate their own files, the content (i.e., the inner text within the tags) can cause the whole file to be invalid in the eyes of a XML parser. I’m sure that you know what I’m talking about, my comrades.

Take for example the following ONIX XML:

<TitleText>&#9996; I Don’t Know How to Create a XML File Properly &#9996; I Should Just Color Books with My Fellow Ni&#x000F1;os for 3 A&#x000F1;os &#9996; - Help Me Color Ni&#x000F1;os - &#Xae Ni&#x000F1;os! - Ni&#x000os!</TitleText>
<Subtitle>Yay&#9996Yay All Play and No Work for Me &gt; &sum; Just Play D&D and D & D &#99 with My &#9996 Boys - Moy Fun &#8364; (Spanish Edition) &#x000F1; &#</Subtitle>

There are a few incomplete encodings here (like “&#Xae” and “&#9996” and “&#”) that will cause the file to fail validation. (And, no, “&#sum;” would not fail here, since it’s valid in the eyes of the ONIX DTD.) And since I don’t like manually combing through a 600 MB file and fixing each grotesque instance, we should create an automated solution, using something dangerously powerful. Yes…I am talking about regular expressions. Of course, this issue isn’t exactly a new one, since developers have been talking about it again and again for a while. However, most of the solutions don’t address all the issues at once, like presented above.

So, after spending the good part of a day desperately trying to remember the idiosyncrasies of regular expressions (control groups, etc.), I came up with a more encompassing solution:

$line =~ s/(&#?x?[A-Za-z0-9]+;)|&#\d*/$1/g;

If applied via Perl to the sample XML mentioned above, it results in the following:

<TitleText>&#9996; I Don’t Know How to Create a XML File Properly &#9996; I Should Just Color Books with My Fellow Ni&#x000F1;os for 3 A&#x000F1;os &#9996; - Help Me Color Ni&#x000F1;os - Xae Ni&#x000F1;os! - Nix000os!</TitleText>
<Subtitle>YayYay All Play and No Work for Me &gt; &sum; Just Play D&D and D & D with My Boys - Moy Fun &#8364; (Spanish Edition) &#x000F1; </Subtitle>

And voilà! Your validation issues are all gone, and the rest of your data has not been mauled or decimated. Well, not terribly, anyway. Plus, it’s pretty darn fast. (Unless of course you’re running Perl on Windows. Then you might as well take a long lunch and a nice nap before it’s finished.) Now, in my case, I wanted to only remove the numeric encodings via the second control group (i.e., &#\d*), and I wanted to keep the hex encodings (like “&#x000”) and alphabet encodings (like “&#Xae”) for further analysis. So, you may want to modify the expression if you want to handle the latter two in a different way. Also, it should be noted that it does not handle incomplete HTML entities. For example, if the provider gives you something like “&gt” where it’s missing the semicolon, this expression will not help you. Also, if the provider gives you an incorrect value for an encoding (like “&#x000;”), it definitely won’t help you. However, you could modify it or use it as a template for an expression that targets those problems specifically.

In future posts, I’ll talk about other options for this kind of situation, by making use of either C# or Java. Hopefully, this post will save you the hours that I had to spend. And if you have any useful advice for how to address such bad data, I’d be welcome to hear it. Since data providers will always issue bad data, we’ll always need more tools at our disposal. Unfortunately, though, I am forced to ignore any tips on arson or demolition, since violence is never the answer.

At least, that’s what I’ve been told.

ONIX Data Library : Now Available on iOS

Well…I might be a little “tricksy” in that announcement, since it might not be exactly what you think. No, I haven’t yet ported the solution to Swift. (After playing with a few projects pulled from Github, I noticed how different Objective-C and Swift are from the state of iOS development 5 years ago. Seems like there might be a little bit of a learning curve there.)

However, the good news from Microsoft with .NET Core keeps coming. So, on top of delivering the port to Linux, they released the preview of Visual Studio for Mac only a few weeks ago. And the reaction seems to be generally positive! Now, all Mac-centric companies that deal with book data can make use of my ONIX Data Library. We just welcomed another 7 people to the fold!

Given to Open Source: ONIX Data Library

The ONIX standard…huh? Am I right? What…you’ve never heard of it?!?

Yeah, well…I guess that makes sense. However, if you’ve worked on any project regarding the publishing industry, then there’s a good chance that you have heard of it. Basically, it’s the international standard for representing electronic data regarding books (along with other select media formats). Titles, prices, commentaries…most of that data is passed between companies in the ONIX format. It can be frustrating to work with at times…but work with it you must.

Strangely, though, there aren’t many tools or libraries out there which focus on it. Now, you might be saying, “Of course there are no libraries or tools out there…there are more people that you use Sanskrit than use this standard.” Well…that might be true; I’m not sure. However, there are enough people out there (including developers) who work with it; there should be something out there to help us brave few. And when I found nearly nothing for the .NET platform, I decided to make one of my own.

It was a little awkward at first during development, since I found a few platform issues regarding XML in my adventures. However, after a few weeks of work, I finally had something substantial. So, I am proud to introduce the world’s first open-source serialization/parser library for ONIX in C#, complete with a few pretty ribbons attached! It’s bound to be of some use to somebody…all 5 people who happen to use both ONIX and the .NET platform. Everyone else may say “blah”, but those scant few are going to be ecstatic. We’re going to throw a pizza party just for us, and everybody else is going to be soooo jealous.