Resist Bad Data, Part 2: How to Filter Incomplete XHTML Entities and Encodings

So, as stated in the piece before, dealing with poorly formed XML is a necessary (though infuriating) part of the day for some people, including me. In that article, I described which regular expression to use (with Perl) in order to cleanse your files of such garbage. Which is fine, especially when you’re scripting…but these days, I’m fairly sure that the clear majority of systems are less about batch jobs and more are programming-oriented (distributed architectures, microservices, etc.). So, obviously, it’d be nice to have a solution that could be part of a platform with a more robust programming language.

So, how do we do it in Java? Easy enough. We can just make use of the same regular expression mentioned last time, since Java has a fairly straightforward approach that mimics other languages (like Perl):

—————-

java.nio.file.Path wiki_path = java.nio.file.Paths.get("C:/onix_test_data/test_files", "test_file_ONIX.xml");

java.nio.charset.Charset charset = java.nio.charset.Charset.forName("ISO-8859-1");

try {

StringBuilder FileContents = new StringBuilder();

List lines = java.nio.file.Files.readAllLines(wiki_path, charset);

for (String line : lines) {
     FileContents.append(line + "\n");
}

String sAllFileContents = FileContents.toString();

String sUpdatedFileContents = sAllFileContents.replaceAll("(&#?x?[A-Za-z0-9]+;)|&#\\d*", "$1");

try(java.io.PrintWriter out = new java.io.PrintWriter( "C:/onix_test_data/output/test_file_ONIX.filtered.xml" ) ) {
     out.println( sUpdatedFileContents );
}
} catch (IOException e) {
     System.out.println(e);
}

—————-

And what about C#? Well, as I’ve pointed out again and again, the .NET platform isn’t exactly a source of inspiration when it comes to handling XML. And, of course, it has to be different when it comes to everything, including regular expressions. So, as my personal recommendation, I would make use of regular expressions for pattern matching and make use of the callbacks for actual filtering:

—————-

var sAlteredOutput = Regex.Replace(sOnixTestXml, @"&#?x?[A-Za-z0-9]*;?", FilterMatcher, RegexOptions.Singleline);

...

static public string FilterMatcher(Match m)
{

string sResult = "";

try
{
    if (m.Value == "&#")
        sResult = "";
    else if (m.Value.StartsWith("&#") && (m.Value.Length > 2))
    {
        int nEncodingVal = 0;
        string sInsideEncoding = m.Value.Substring(2);

        if (m.Value.EndsWith(";"))
            sResult = m.Value;
        else if (!Int32.TryParse(sInsideEncoding, out nEncodingVal))
        {
            int nIdx;
            for (nIdx = 0; nIdx < sInsideEncoding.ToCharArray().Length; ++nIdx)
            {
                char cTemp = sInsideEncoding.ToCharArray()[nIdx];

                if (!(cTemp == 'x') && !Char.IsDigit(cTemp)) break;
            }

            if (nIdx == sInsideEncoding.ToCharArray().Length)
                Result = m.Value;
            else
                sResult = m.Value.Substring(nIdx + 2);
        }
        else
            sResult = "";
    }
    else if (m.Value.StartsWith("&"))
        sResult = m.Value;

}
catch (Exception ex)
{
    sResult = m.Value;
}

return sResult;
}

}

—————-

In fact, I’d rather use this solution with callbacks since I love the idea of having more programmatic control. But don’t tell Microsoft that. 😛

And there you go. Two solutions to help you work around the various forms of ineptitude of data suppliers…of which there seems to be no end in sight.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s