.net, C#, Code, Microsoft, Standards, Technical, XML

Strip Illegal XML Characters based on W3C standard

W3C has defined a set of illegal characters for use in XML . You can find info about the same here:

XML 1.0 | XML 1.1

Here is a function to remove these characters from a specified XML file:

using System;
using System.IO;
using System.Text;
using System.Text.RegularExpressions;

namespace XMLUtils
    class Standards
        /// <summary>
        /// Strips non-printable ascii characters 
        /// Refer to http://www.w3.org/TR/xml11/#charsets for XML 1.1
        /// Refer to http://www.w3.org/TR/2006/REC-xml-20060816/#charsets for XML 1.0
        /// </summary>
        /// <param name="filePath">Full path to the File</param>
        /// <param name="XMLVersion">XML Specification to use. Can be 1.0 or 1.1</param>
        private void StripIllegalXMLChars(string filePath, string XMLVersion)
            //Remove illegal character sequences
            string tmpContents = File.ReadAllText(filePath, Encoding.UTF8);

            string pattern = String.Empty;
            switch (XMLVersion)
                case "1.0":
                    pattern = @"#x((10?|[2-F])FFF[EF]|FDD[0-9A-F]|7F|8[0-46-9A-F]9[0-9A-F])";
                case "1.1":
                    pattern = @"#x((10?|[2-F])FFF[EF]|FDD[0-9A-F]|[19][0-9A-F]|7F|8[0-46-9A-F]|0?[1-8BCEF])";
                    throw new Exception("Error: Invalid XML Version!");

            Regex regex = new Regex(pattern, RegexOptions.IgnoreCase);
            if (regex.IsMatch(tmpContents))
                tmpContents = regex.Replace(tmpContents, String.Empty);
                File.WriteAllText(filePath, tmpContents, Encoding.UTF8);
            tmpContents = string.Empty;