Strip Illegal XML Characters based on W3C standard

W3C has defined a set of illegal characters for use in XML . You can find info about the same here:

XML 1.0 | XML 1.1

Here is a function to remove these characters from a specified XML file:

using System;
using System.IO;
using System.Text;
using System.Text.RegularExpressions;

namespace XMLUtils
{
    class Standards
    {
        /// <summary>
        /// Strips non-printable ascii characters 
        /// Refer to http://www.w3.org/TR/xml11/#charsets for XML 1.1
        /// Refer to http://www.w3.org/TR/2006/REC-xml-20060816/#charsets for XML 1.0
        /// </summary>
        /// <param name="filePath">Full path to the File</param>
        /// <param name="XMLVersion">XML Specification to use. Can be 1.0 or 1.1</param>
        private void StripIllegalXMLChars(string filePath, string XMLVersion)
        {
            //Remove illegal character sequences
            string tmpContents = File.ReadAllText(filePath, Encoding.UTF8);

            string pattern = String.Empty;
            switch (XMLVersion)
            {
                case "1.0":
                    pattern = @"#x((10?|[2-F])FFF[EF]|FDD[0-9A-F]|7F|8[0-46-9A-F]9[0-9A-F])";
                    break;
                case "1.1":
                    pattern = @"#x((10?|[2-F])FFF[EF]|FDD[0-9A-F]|[19][0-9A-F]|7F|8[0-46-9A-F]|0?[1-8BCEF])";
                    break;
                default:
                    throw new Exception("Error: Invalid XML Version!");
            }

            Regex regex = new Regex(pattern, RegexOptions.IgnoreCase);
            if (regex.IsMatch(tmpContents))
            {
                tmpContents = regex.Replace(tmpContents, String.Empty);
                File.WriteAllText(filePath, tmpContents, Encoding.UTF8);
            }
            tmpContents = string.Empty;
        }
    }
}

18 Responses

  1. Here is the php version:

    unction strip_invalid_xml_chars2( $in )
    {

    $out = “”;

    $length = strlen($in);

    for ( $i = 0; $i = 0×20)
    && ($current = 0xE000) &&
    ($current = 0×10000) && ($current <= 0×10FFFF)))
    {
    $out .= chr($current);
    }
    else
    {
    $out .= ” “;
    }

    }

    return $out;

    }

  2. Hi Ramesh,
    i just want the code for trimming non-printable characters.
    The one placed on this page is going beyond the margins.

    i would appriciate if you can just send it across to my mail id mentioned above.

    thanks,
    Nagesh

  3. Hi Nagesh,

    I have sent the code via email as requested.

    Cheers,
    Balaji

  4. Hi Ramesh,
    This class is just what I’ve been looking for. Would you mind mailing it to me as well?

    Thanks,
    Ryan

  5. Hi Ryan,

    I have mailed across the code.

    Cheers,
    Balaji

  6. Hi, Ryan ……

    Any chance at this late date you could mail me a copy of this code as well?

    And thanks very much for providing it.

  7. Hi, Balaji ……

    Any chance at this late date you could mail me a copy of this code as well?

    And thanks very much for providing it.

  8. Got it, Balaji — thanks very much for sharing it.

    Bud

  9. Hello, this is great.

    Can you send it to me as well? Thank you.

  10. Hi, would it be possible to get. Thanks very much.

  11. hi Balaji

    Could you please email the code as I am unable to view it due to margin issues on the webpage? I would really appreciate that.

    Thanks again

  12. Hi!

    I tried implementing your function, but now I get an error saying “Illegal characters in path”. Any suggestions please?

    many thanks,
    TS

  13. Hello Balaji,

    I just came across your implementation, is it possible to email me your solution…thanks very much.

    -Minhas

Leave a Reply