Resist Bad Data, Part 2: How to Filter Incomplete XHTML Entities and Encodings

So, as stated in the piece before, dealing with poorly formed XML is a necessary (though infuriating) part of the day for some people, including me. In that article, I described which regular expression to use (with Perl) in order to cleanse your files of such garbage. Which is fine, especially when you’re scripting…but these days, I’m fairly sure that the clear majority of systems are less about batch jobs and more are programming-oriented (distributed architectures, microservices, etc.). So, obviously, it’d be nice to have a solution that could be part of a platform with a more robust programming language.

So, how do we do it in Java? Easy enough. We can just make use of the same regular expression mentioned last time, since Java has a fairly straightforward approach that mimics other languages (like Perl):

—————-

java.nio.file.Path wiki_path = java.nio.file.Paths.get("C:/onix_test_data/test_files", "test_file_ONIX.xml");

java.nio.charset.Charset charset = java.nio.charset.Charset.forName("ISO-8859-1");

try {

StringBuilder FileContents = new StringBuilder();

List lines = java.nio.file.Files.readAllLines(wiki_path, charset);

for (String line : lines) {
     FileContents.append(line + "\n");
}

String sAllFileContents = FileContents.toString();

String sUpdatedFileContents = sAllFileContents.replaceAll("(&#?x?[A-Za-z0-9]+;)|&#\\d*", "$1");

try(java.io.PrintWriter out = new java.io.PrintWriter( "C:/onix_test_data/output/test_file_ONIX.filtered.xml" ) ) {
     out.println( sUpdatedFileContents );
}
} catch (IOException e) {
     System.out.println(e);
}

—————-

And what about C#? Well, as I’ve pointed out again and again, the .NET platform isn’t exactly a source of inspiration when it comes to handling XML. And, of course, it has to be different when it comes to everything, including regular expressions. So, as my personal recommendation, I would make use of regular expressions for pattern matching and make use of the callbacks for actual filtering:

—————-

var sAlteredOutput = Regex.Replace(sOnixTestXml, @"&#?x?[A-Za-z0-9]*;?", FilterMatcher, RegexOptions.Singleline);

...

static public string FilterMatcher(Match m)
{

string sResult = "";

try
{
    if (m.Value == "&#")
        sResult = "";
    else if (m.Value.StartsWith("&#") && (m.Value.Length > 2))
    {
        int nEncodingVal = 0;
        string sInsideEncoding = m.Value.Substring(2);

        if (m.Value.EndsWith(";"))
            sResult = m.Value;
        else if (!Int32.TryParse(sInsideEncoding, out nEncodingVal))
        {
            int nIdx;
            for (nIdx = 0; nIdx < sInsideEncoding.ToCharArray().Length; ++nIdx)
            {
                char cTemp = sInsideEncoding.ToCharArray()[nIdx];

                if (!(cTemp == 'x') && !Char.IsDigit(cTemp)) break;
            }

            if (nIdx == sInsideEncoding.ToCharArray().Length)
                Result = m.Value;
            else
                sResult = m.Value.Substring(nIdx + 2);
        }
        else
            sResult = "";
    }
    else if (m.Value.StartsWith("&"))
        sResult = m.Value;

}
catch (Exception ex)
{
    sResult = m.Value;
}

return sResult;
}

}

—————-

In fact, I’d rather use this solution with callbacks since I love the idea of having more programmatic control. But don’t tell Microsoft that. 😛

And there you go. Two solutions to help you work around the various forms of ineptitude of data suppliers…of which there seems to be no end in sight.

Advertisements

Resist Bad Data, Part I: The Horrid Pain of Incomplete XHTML Entities and Encodings

The avant-garde of the software world may have migrated to greener pastures, munching on more lush hardware and grazing on more dynamic software. However, if you work on software in a more “established” industry like retail or publishing (or, as in my case, the intersection of them), then you’re accustomed to the entrenched practices of institutions. A scant few are worse than others, refusing to give up their fax machines in exchange for scanners and PDFs. However, most of these mentioned ancients do move, although at a slower pace. In that sense, there are many more days until XML becomes discarded as the standard format for data exchange. Take ONIX, for example.

And since I support the consumption of this XML standard, I must anticipate the various issues that might be encountered with it. For those of you who don’t deal with this type of madness, the rest of this post probably means nothing to you. For those of you that do, however…you are my brothers and sisters, my comrades in the trenches. You have my full empathy. And due to our shared bond, I am compelled to help you.

Of course, when you receive a XML file, you want to validate its structure (and, in some cases, its content). “But who would send improperly formatted data, especially if you have a business relationship? Surely they would have validated it before releasing it onto the world?” Oh, how I wish that were true. On the plus side, most providers of XML data do get the basics down. For example, they have opening and closing tags, and they know how to spell the name of their own company in the comments. On the negative side, they may not understand the XML standard completely, and since they don’t run a XML package to validate their own files, the content (i.e., the inner text within the tags) can cause the whole file to be invalid in the eyes of a XML parser. I’m sure that you know what I’m talking about, my comrades.

Take for example the following ONIX XML:

<TitleText>&#9996; I Don’t Know How to Create a XML File Properly &#9996; I Should Just Color Books with My Fellow Ni&#x000F1;os for 3 A&#x000F1;os &#9996; - Help Me Color Ni&#x000F1;os - &#Xae Ni&#x000F1;os! - Ni&#x000os!</TitleText>
<Subtitle>Yay&#9996Yay All Play and No Work for Me &gt; &sum; Just Play D&D and D & D &#99 with My &#9996 Boys - Moy Fun &#8364; (Spanish Edition) &#x000F1; &#</Subtitle>

There are a few incomplete encodings here (like “&#Xae” and “&#9996” and “&#”) that will cause the file to fail validation. (And, no, “&#sum;” would not fail here, since it’s valid in the eyes of the ONIX DTD.) And since I don’t like manually combing through a 600 MB file and fixing each grotesque instance, we should create an automated solution, using something dangerously powerful. Yes…I am talking about regular expressions. Of course, this issue isn’t exactly a new one, since developers have been talking about it again and again for a while. However, most of the solutions don’t address all the issues at once, like presented above.

So, after spending the good part of a day desperately trying to remember the idiosyncrasies of regular expressions (control groups, etc.), I came up with a more encompassing solution:

$line =~ s/(&#?x?[A-Za-z0-9]+;)|&#\d*/$1/g;

If applied via Perl to the sample XML mentioned above, it results in the following:

<TitleText>&#9996; I Don’t Know How to Create a XML File Properly &#9996; I Should Just Color Books with My Fellow Ni&#x000F1;os for 3 A&#x000F1;os &#9996; - Help Me Color Ni&#x000F1;os - Xae Ni&#x000F1;os! - Nix000os!</TitleText>
<Subtitle>YayYay All Play and No Work for Me &gt; &sum; Just Play D&D and D & D with My Boys - Moy Fun &#8364; (Spanish Edition) &#x000F1; </Subtitle>

And voilà! Your validation issues are all gone, and the rest of your data has not been mauled or decimated. Well, not terribly, anyway. Plus, it’s pretty darn fast. (Unless of course you’re running Perl on Windows. Then you might as well take a long lunch and a nice nap before it’s finished.) Now, in my case, I wanted to only remove the numeric encodings via the second control group (i.e., &#\d*), and I wanted to keep the hex encodings (like “&#x000”) and alphabet encodings (like “&#Xae”) for further analysis. So, you may want to modify the expression if you want to handle the latter two in a different way. Also, it should be noted that it does not handle incomplete HTML entities. For example, if the provider gives you something like “&gt” where it’s missing the semicolon, this expression will not help you. Also, if the provider gives you an incorrect value for an encoding (like “&#x000;”), it definitely won’t help you. However, you could modify it or use it as a template for an expression that targets those problems specifically.

In future posts, I’ll talk about other options for this kind of situation, by making use of either C# or Java. Hopefully, this post will save you the hours that I had to spend. And if you have any useful advice for how to address such bad data, I’d be welcome to hear it. Since data providers will always issue bad data, we’ll always need more tools at our disposal. Unfortunately, though, I am forced to ignore any tips on arson or demolition, since violence is never the answer.

At least, that’s what I’ve been told.

Red Shirt Tour NYC

A few weeks ago, Microsoft VP Scott Guthrie stopped in NYC as part of his promotional tour for Azure’s cloud services. Normally, I don’t really care for these long informercials, but I decided to go in this case for two reasons. One, it took place in Cooper Union’s Great Hall, which was something historic I had always wanted to check out. Two, even though I’ve played with Azure’s offerings on occasion, I was curious what Guthrie would highlight in his presentation, especially after friends and colleagues had talked up Azure in the last couple of years. So, I went, and I was surprisingly glad that I did.

A few years ago, when I learned of some of Microsoft’s ambitions in the cloud space, I bought some MSFT shares and thought that they might catch up with Amazon in the cloud space. Impatiently, after a year, I began to have my doubts, and I sold off the shares. If you look at its latest price, that was clearly a mistake. I should have held onto them, and while listening to Guthrie’s presentation, it became painfully obvious as to why. Of course, he talked about the inherent power of Azure, with its various data centers around the world. (Which were all shown in a dramatic video seemingly directed by Michael Bay, with an intensely dramatic score blaring in the background.) However, it was the maturity of the platform, with its various tools and considerations for the user that was impressive.

Even though I’ve only dabbled with cloud platforms, I especially appreciate their raw power and penchant for structure. Even when you create apps that are meant to be deployed for the cloud, it heavily reinforces structure in their templates for developers. For example, if you build a web app within Visual Studio destined for Azure, it refers to queues in the project template by default (which are automatically available in Azure). You might question its need to reinforce such a feature through templates, thinking that the usage of queues in a web app as obvious. “What vital web service that receives a POST wouldn’t automatically queue that request, since the resources to fulfill the request might be temporary unavailable? Who wouldn’t do that?” Oh, you’d be surprised. Let’s just say that such bug-ridden deployments are not unheard of.

First, Guthrie talked about the typical use cases of cloud offerings, like creating and deploying a serverless app with ease or creating a container and deploying it with Kubernetes. Even when he talked about deploying apps to specific data centers, those features were interesting but expected. (I still have mixed feelings about the Azure Portal dashboard, but since that’s an argument about interfaces, that’s an entirely different subject.)

However, I was more impressed when he started talking about the other free tools offered by the platform that were supplemental. For one, its inherent monitoring system was akin to Splunk, and it could be used to monitor and query your entire setup (apps, databases, hosted servers, etc.), and you could customize your system instances with networking rules (like lock port 54545 every morning). Next, the number of devops options seemed more diverse than I remembered (with build options including Maven). There was even a tool which would suggest how you could reduce your expenses, like by consolidating servers and minimizing resources (storage, # of dedicated CPUs, etc.). After watching some examples of machine learning and image recognition, my only regret was that I had no reason to use them in my own projects.

After several years of concentrated effort, they had created an impressive, mature cloud platform that definitely could give Amazon a run for its money. I only had two points of contention. One, this promotional tour was somewhat eponymously named The Red Shirt Tour, since I was told by staff that Guthrie has a certain love for them. Aside from a terrible name for the tour, I would say ditch the symbolic shirt, since that will forever more belong to Jobs. Pick a hat, and in order to reinforce Azure, go for a blue beret instead.

Two, they served Subway for lunch. To which the answer is always no.

Yes, it was free. However, just like if you were offered torture for free, you shouldn’t accept it.

Aside from that, though, I walked away impressed. Nicely done, Guthrie and MS!

And While I’m Asking For Things

Speaking of requests, I have another quick one. My department has used the Oracle.ManagedDataAccess.dll for quite a while now in our C# (i.e., .NET) deployments, and even though we’ve had some performance issues at times, we’ve created some workarounds and eventually achieved harmony with this miscreant spawn of Oracle (who has been known to be developer-unfriendly). However, we came across a new one the other day:

Value cannot be null.
Parameter name: byteArray
Server stack trace:
at System.BitConverter.ToString(Byte[] value, Int32 startIndex, Int32 length)
at OracleInternal.TTC.TTCLob.GetLobIdString(Byte[] lobLocator)
at OracleInternal.ServiceObjects.OracleDataReaderImpl.CollectTempLOBsToBeFreed(Int32 rowNumber)
at Oracle.ManagedDataAccess.Client.OracleDataReader.ProcessAnyTempLOBs(Int32 rowNumber)
at Oracle.ManagedDataAccess.Client.OracleDataReader.Read()

Basically, if you have a null value in the CLOB column of an Oracle table, you might get this error when you attempt to access that row in the result set. After trying the most recent version of the library (as recommended by Oracle), it became obvious that they still haven’t fixed it. Strangely, the error seems to occur randomly, since it won’t always happen when running the executable against the same data. After spending some time in an effort to diagnose the error, I eventually capitulated, accepted Oracle’s foibles, and simply set all CLOB columns to EMPTY_CLOB(). Problem solved.

But for posterity and all those after me, I implore you, Oracle: can you address this simple problem in your library? I’d ask you to open-source that library, but we all know that’s not gonna happen. So, please go ahead and fix the bug for us. It’d probably take you a whole 15 minutes, the same amount of time that Ellison needs to check the rigging on one of his catamarans. We’d all appreciate it!

A Simple Feature Request for the Spring Community

Not long ago, I was asked by some colleagues to help troubleshoot an issue; they were having some difficulty with the production deployment of a monolith web service using Wildfly and Spring. How could I resist, considering how much I love event-driven, IOC frameworks and their ridiculously verbose log files? (You can cut the sarcasm with a knife.) In any case, it seems that the server was failing during initialization. After wading through the gazillions of lines written due to an exception from the JedisConnectionFactory, I finally found an Null exception from our code that was a clue. It seemed to indicate that a variable was missing from from our active profile. So, I looked inside the standalone.xml file to see if we were pointing to the right profile:

<server xmlns="urn:jboss:domain:1.4">

<extensions>
<!-- list of extension -->
</extensions>

<system-properties>
<property name="spring.profiles.active" value="production"/>
</system-properties>

That was the right profile name, all right…So I copied the deployed .WAR file and opened it up. Alas, the profile wasn’t inside the “/resources” directory! It seems that they weren’t deploying the right version. Problem solved.

So, that leads me to my point. Can the Spring community do me a big favor? If the target profile does not exist, then maybe the log file (among its gazillions of other lines) should say that THE TARGET PROFILE DOES NOT EXIST!

That would be a big timesaver and most appreciated.