{"id":1180,"date":"2014-08-23T18:02:22","date_gmt":"2014-08-23T17:02:22","guid":{"rendered":"https:\/\/www.autoitconsulting.com\/site\/?p=1180"},"modified":"2025-07-26T09:28:55","modified_gmt":"2025-07-26T08:28:55","slug":"utf-8-utf-16-text-encoding-detection-library","status":"publish","type":"post","link":"https:\/\/www.autoitconsulting.com\/site\/development\/utf-8-utf-16-text-encoding-detection-library\/","title":{"rendered":"UTF-8 and UTF-16 Text Encoding Detection Library"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\">Overview<\/h2>\n\n\n\n<p>This post shows how to detect UTF-8 and UTF-16 text and presents a fully functional C++ and C# library that can be used to help with the detection.<\/p>\n\n\n\n<p>I recently had to upgrade the text file handling feature of <a title=\"AutoIt\" href=\"https:\/\/www.autoitscript.com\/site\/autoit\/\" target=\"_blank\" rel=\"noopener noreferrer\">AutoIt<\/a> to better handle text files where no <a title=\"Byte order mark\" href=\"http:\/\/en.wikipedia.org\/wiki\/Byte_order_mark\" target=\"_blank\" rel=\"noopener noreferrer\">byte order mark<\/a> (BOM) was present. The older version of code I was using worked fine for UTF-8 files (with or without BOM) but it wasn&#8217;t able to detect UTF-16 files without a BOM. I tried to the the <a title=\"IsTextUnicode on MSDN\" href=\"http:\/\/msdn.microsoft.com\/en-us\/library\/windows\/desktop\/dd318672(v=vs.85).aspx\">IsTextUnicode<\/a> Win32 API function but this seemed extremely unreliable and wouldn&#8217;t detect UTF-16 Big-Endian text in my tests.<\/p>\n\n\n\n<p>Note, especially for UTF-16 detection, there is always an element of ambiguity. <a title=\"Notepad Redux\" href=\"http:\/\/blogs.msdn.com\/b\/oldnewthing\/archive\/2007\/04\/17\/2158334.aspx\" target=\"_blank\" rel=\"noopener noreferrer\">This post<\/a> by Raymond shows that however you try and detect encoding there will always be some sequence of bytes that will make your guesses look stupid.<\/p>\n\n\n\n<p>Here are the detection methods I&#8217;m currently using for the various types of text file. The order of the checks I perform are:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>BOM<\/li>\n\n\n\n<li>UTF-8<\/li>\n\n\n\n<li>UTF-16 (newline)<\/li>\n\n\n\n<li>UTF-16 (null distribution)<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">Downloads<\/h2>\n\n\n\n<p>The C# and C++ library can be downloaded from GitHub here:&nbsp;<a href=\"https:\/\/github.com\/AutoItConsulting\/text-encoding-detect\" target=\"_blank\" rel=\"noopener noreferrer\">https:\/\/github.com\/AutoItConsulting\/text-encoding-detect<br><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-1226\" src=\"https:\/\/www.autoitconsulting.com\/site\/wp-content\/uploads\/2018\/06\/download_github_106x51@2x.png\" alt=\"download_zip_106x51@2x\" width=\"106\" height=\"51\"><\/a><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">BOM Detection<\/h2>\n\n\n\n<p>I assume that if I find a BOM at the start of the file that it is valid. Although it&#8217;s possible that the BOM could just be ANSI text, it&#8217;s highly unlikely. The BOMs are as follows:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><tbody><tr><th>Encoding<\/th><th>BOM<\/th><\/tr><tr><td>UTF-8<\/td><td>0xEF, 0xBB, 0xBF<\/td><\/tr><tr><td>UTF-16 Little Endian<\/td><td>0xFF, 0xFE<\/td><\/tr><tr><td>UTF-16 Big Endian<\/td><td>0xFE, 0xFF<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">UTF-8 Detection<\/h2>\n\n\n\n<p>UTF-8 checking is reliable with a very low chance of false positives, so this is done first. If the text is valid UTF-8 but all the characters are in the range <strong>0-127<\/strong> then this is essentially ASCII text and can be treated as such &#8211; in this case I don&#8217;t continue to check for UTF-16.<\/p>\n\n\n\n<p>If a character is in the range of <strong>0-127<\/strong> then it is a single character and nothing more needs to be done. Values <strong>above 127<\/strong> indicate multibyte encoding using the next 1, 2 or 3 bytes.<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><tbody><tr><th>First byte<\/th><th>Number of bytes in sequence<\/th><\/tr><tr><td>&nbsp;0-127<\/td><td>1 byte<\/td><\/tr><tr><td>&nbsp;194-223<\/td><td>2 bytes<\/td><\/tr><tr><td>&nbsp;224-239<\/td><td>3 bytes<\/td><\/tr><tr><td>&nbsp;240-244<\/td><td>4 bytes<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>These additional bytes are in the range <strong>128-191<\/strong>. This scheme means that if we decode the text stream based on this method and no unexpected sequences occur then this is almost certainly UTF-8 text.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">UTF-16 Detection<\/h2>\n\n\n\n<p>UTF-16 text is generally made up of 2-byte sequences (technically, there can be a 4-byte sequence with surrogate pairs). Depending on the endianness of the file&nbsp;the unicode character 0x1234 could be represented in the character stream as &#8220;0x12 0x34&#8221; or &#8220;0x34 0x12&#8221;. &nbsp;The BOM is usually used to easily determine if the file is in big or little endian mode. Without a BOM this is a little more tricky to determine.<\/p>\n\n\n\n<p>I use two methods to try and determine if the text is UTF-16 and the endianness. The first is the newline characters 0x0a and 0x0d. Depending on the endianness they will be sequenced as &#8220;0x0a 0x00&#8221; or &#8220;0x00 0x0a&#8221;. If every instance of these characters in a text file is encoded the same way then&nbsp;that is a good sign that the text is UTF-16 and if it is big or little endian. The drawback of this method is that it won&#8217;t work for very small amounts of text, or files that don&#8217;t contain newlines.<\/p>\n\n\n\n<p>The second method relies on the fact that many files may contain large amounts of pure ASCII text in the range 0-127. This applies especially to files generally used in IT like scripts and logs. When encoded in UTF-16 these are represented as the ASCII character and a null character. For example, space, 0x20 would be encoded as &#8220;0x00 0x20&#8221; or &#8220;0x20 0x00&#8221;. Depending on the endianness this will result in a large amount of nulls in the odd or even byte positions. We just need to scan the file for these odd and even nulls and if there is a significant percentage in the expected position then we can assume the text is UTF-16.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">The Library<\/h2>\n\n\n\n<p>The C# and C++ library can be downloaded from GitHub here:&nbsp;<a href=\"https:\/\/github.com\/AutoItConsulting\/text-encoding-detect\" target=\"_blank\" rel=\"noopener noreferrer\">https:\/\/github.com\/AutoItConsulting\/text-encoding-detect<br><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-1226\" src=\"https:\/\/www.autoitconsulting.com\/site\/wp-content\/uploads\/2018\/06\/download_github_106x51@2x.png\" alt=\"download_zip_106x51@2x\" width=\"106\" height=\"51\"><\/a><\/p>\n\n\n\n<p>Using C# as the example, the two main public functions are:<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span style=\"display:flex;align-items:center;padding:10px 0px 0 16px;font-size:0.8em;width:100%;text-align:left;background-color:#1E1E1E;font-style:italic;color:#D4D4D4\"><span style=\"border-bottom:1px solid rgba(234, 191, 191, 0.2)\">C#<\/span><\/span><span role=\"button\" tabindex=\"0\" style=\"color:#D4D4D4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><pre class=\"code-block-pro-copy-button-pre\" aria-hidden=\"true\"><textarea class=\"code-block-pro-copy-button-textarea\" tabindex=\"-1\" aria-hidden=\"true\" readonly>public Encoding CheckBOM(byte[] buffer, int size)\npublic Encoding DetectEncoding(byte[] buffer, int size)<\/textarea><\/pre><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki dark-plus\" style=\"background-color: #1E1E1E\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #569CD6\">public<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #4EC9B0\">Encoding<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #DCDCAA\">CheckBOM<\/span><span style=\"color: #D4D4D4\">(<\/span><span style=\"color: #569CD6\">byte<\/span><span style=\"color: #D4D4D4\">[] <\/span><span style=\"color: #9CDCFE\">buffer<\/span><span style=\"color: #D4D4D4\">, <\/span><span style=\"color: #569CD6\">int<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #9CDCFE\">size<\/span><span style=\"color: #D4D4D4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">public Encoding DetectEncoding(<\/span><span style=\"color: #569CD6\">byte<\/span><span style=\"color: #D4D4D4\">[] <\/span><span style=\"color: #9CDCFE\">buffer<\/span><span style=\"color: #D4D4D4\">, <\/span><span style=\"color: #569CD6\">int<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #9CDCFE\">size<\/span><span style=\"color: #D4D4D4\">)<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<p>These functions return the Encoding which is the following enum:<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span style=\"display:flex;align-items:center;padding:10px 0px 0 16px;font-size:0.8em;width:100%;text-align:left;background-color:#1E1E1E;font-style:italic;color:#D4D4D4\"><span style=\"border-bottom:1px solid rgba(234, 191, 191, 0.2)\">C#<\/span><\/span><span role=\"button\" tabindex=\"0\" style=\"color:#D4D4D4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><pre class=\"code-block-pro-copy-button-pre\" aria-hidden=\"true\"><textarea class=\"code-block-pro-copy-button-textarea\" tabindex=\"-1\" aria-hidden=\"true\" readonly>public enum Encoding\n{\n    None,               \/\/ Unknown or binary\n    ANSI,               \/\/ 0-255\n    ASCII,              \/\/ 0-127\n    UTF8_BOM,           \/\/ UTF8 with BOM\n    UTF8_NOBOM,         \/\/ UTF8 without BOM\n    UTF16_LE_BOM,       \/\/ UTF16 LE with BOM\n    UTF16_LE_NOBOM,     \/\/ UTF16 LE without BOM\n    UTF16_BE_BOM,       \/\/ UTF16-BE with BOM\n    UTF16_BE_NOBOM      \/\/ UTF16-BE without BOM\n}<\/textarea><\/pre><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki dark-plus\" style=\"background-color: #1E1E1E\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #569CD6\">public<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #569CD6\">enum<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #4EC9B0\">Encoding<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">{<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    <\/span><span style=\"color: #9CDCFE\">None<\/span><span style=\"color: #D4D4D4\">,               <\/span><span style=\"color: #6A9955\">\/\/ Unknown or binary<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    <\/span><span style=\"color: #9CDCFE\">ANSI<\/span><span style=\"color: #D4D4D4\">,               <\/span><span style=\"color: #6A9955\">\/\/ 0-255<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    <\/span><span style=\"color: #9CDCFE\">ASCII<\/span><span style=\"color: #D4D4D4\">,              <\/span><span style=\"color: #6A9955\">\/\/ 0-127<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    <\/span><span style=\"color: #9CDCFE\">UTF8_BOM<\/span><span style=\"color: #D4D4D4\">,           <\/span><span style=\"color: #6A9955\">\/\/ UTF8 with BOM<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    <\/span><span style=\"color: #9CDCFE\">UTF8_NOBOM<\/span><span style=\"color: #D4D4D4\">,         <\/span><span style=\"color: #6A9955\">\/\/ UTF8 without BOM<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    <\/span><span style=\"color: #9CDCFE\">UTF16_LE_BOM<\/span><span style=\"color: #D4D4D4\">,       <\/span><span style=\"color: #6A9955\">\/\/ UTF16 LE with BOM<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    <\/span><span style=\"color: #9CDCFE\">UTF16_LE_NOBOM<\/span><span style=\"color: #D4D4D4\">,     <\/span><span style=\"color: #6A9955\">\/\/ UTF16 LE without BOM<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    <\/span><span style=\"color: #9CDCFE\">UTF16_BE_BOM<\/span><span style=\"color: #D4D4D4\">,       <\/span><span style=\"color: #6A9955\">\/\/ UTF16-BE with BOM<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    <\/span><span style=\"color: #9CDCFE\">UTF16_BE_NOBOM<\/span><span style=\"color: #D4D4D4\">      <\/span><span style=\"color: #6A9955\">\/\/ UTF16-BE without BOM<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">}<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<p>The DetectEncoding function takes a byte buffer and a size parameter. The larger the buffer that is used, the more accurate the result will be. I&#8217;d recommend at least 4KB.<\/p>\n\n\n\n<p>Here is an example of passing a buffer to the DetectEncoding function:<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span style=\"display:flex;align-items:center;padding:10px 0px 0 16px;font-size:0.8em;width:100%;text-align:left;background-color:#1E1E1E;font-style:italic;color:#D4D4D4\"><span style=\"border-bottom:1px solid rgba(234, 191, 191, 0.2)\">C#<\/span><\/span><span role=\"button\" tabindex=\"0\" style=\"color:#D4D4D4;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><pre class=\"code-block-pro-copy-button-pre\" aria-hidden=\"true\"><textarea class=\"code-block-pro-copy-button-textarea\" tabindex=\"-1\" aria-hidden=\"true\" readonly>\/\/ Detect encoding\nvar textDetect = new TextEncodingDetect();\nTextEncodingDetect.Encoding encoding = textDetect.DetectEncoding(buffer, buffer.Length);\n\nConsole.Write(\"Encoding: \");\nif (encoding == TextEncodingDetect.Encoding.None)\n{\n    Console.WriteLine(\"Binary\");\n}\nelse if (encoding == TextEncodingDetect.Encoding.ASCII)\n{\n    Console.WriteLine(\"ASCII (chars in the 0-127 range)\");\n}\nelse if (encoding == TextEncodingDetect.Encoding.ANSI)\n{\n    Console.WriteLine(\"ANSI (chars in the range 0-255 range)\");\n}\nelse if (encoding == TextEncodingDetect.Encoding.UTF8_BOM || encoding == TextEncodingDetect.Encoding.UTF8_NOBOM)\n{\n    Console.WriteLine(\"UTF-8\");\n}\nelse if (encoding == TextEncodingDetect.Encoding.UTF16_LE_BOM || encoding == TextEncodingDetect.Encoding.UTF16_LE_NOBOM)\n{\n    Console.WriteLine(\"UTF-16 Little Endian\");\n}\nelse if (encoding == TextEncodingDetect.Encoding.UTF16_BE_BOM || encoding == TextEncodingDetect.Encoding.UTF16_BE_NOBOM)\n{\n    Console.WriteLine(\"UTF-16 Big Endian\");\n}<\/textarea><\/pre><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki dark-plus\" style=\"background-color: #1E1E1E\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #6A9955\">\/\/ Detect encoding<\/span><\/span>\n<span class=\"line\"><span style=\"color: #569CD6\">var<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #9CDCFE\">textDetect<\/span><span style=\"color: #D4D4D4\"> = <\/span><span style=\"color: #569CD6\">new<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #4EC9B0\">TextEncodingDetect<\/span><span style=\"color: #D4D4D4\">();<\/span><\/span>\n<span class=\"line\"><span style=\"color: #4EC9B0\">TextEncodingDetect<\/span><span style=\"color: #D4D4D4\">.<\/span><span style=\"color: #4EC9B0\">Encoding<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #9CDCFE\">encoding<\/span><span style=\"color: #D4D4D4\"> = <\/span><span style=\"color: #9CDCFE\">textDetect<\/span><span style=\"color: #D4D4D4\">.<\/span><span style=\"color: #DCDCAA\">DetectEncoding<\/span><span style=\"color: #D4D4D4\">(<\/span><span style=\"color: #9CDCFE\">buffer<\/span><span style=\"color: #D4D4D4\">, <\/span><span style=\"color: #9CDCFE\">buffer<\/span><span style=\"color: #D4D4D4\">.<\/span><span style=\"color: #9CDCFE\">Length<\/span><span style=\"color: #D4D4D4\">);<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #9CDCFE\">Console<\/span><span style=\"color: #D4D4D4\">.<\/span><span style=\"color: #DCDCAA\">Write<\/span><span style=\"color: #D4D4D4\">(<\/span><span style=\"color: #CE9178\">&quot;Encoding: &quot;<\/span><span style=\"color: #D4D4D4\">);<\/span><\/span>\n<span class=\"line\"><span style=\"color: #C586C0\">if<\/span><span style=\"color: #D4D4D4\"> (<\/span><span style=\"color: #9CDCFE\">encoding<\/span><span style=\"color: #D4D4D4\"> == <\/span><span style=\"color: #9CDCFE\">TextEncodingDetect<\/span><span style=\"color: #D4D4D4\">.<\/span><span style=\"color: #9CDCFE\">Encoding<\/span><span style=\"color: #D4D4D4\">.<\/span><span style=\"color: #9CDCFE\">None<\/span><span style=\"color: #D4D4D4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">{<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    <\/span><span style=\"color: #9CDCFE\">Console<\/span><span style=\"color: #D4D4D4\">.<\/span><span style=\"color: #DCDCAA\">WriteLine<\/span><span style=\"color: #D4D4D4\">(<\/span><span style=\"color: #CE9178\">&quot;Binary&quot;<\/span><span style=\"color: #D4D4D4\">);<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">}<\/span><\/span>\n<span class=\"line\"><span style=\"color: #C586C0\">else<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #C586C0\">if<\/span><span style=\"color: #D4D4D4\"> (<\/span><span style=\"color: #9CDCFE\">encoding<\/span><span style=\"color: #D4D4D4\"> == <\/span><span style=\"color: #9CDCFE\">TextEncodingDetect<\/span><span style=\"color: #D4D4D4\">.<\/span><span style=\"color: #9CDCFE\">Encoding<\/span><span style=\"color: #D4D4D4\">.<\/span><span style=\"color: #9CDCFE\">ASCII<\/span><span style=\"color: #D4D4D4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">{<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    <\/span><span style=\"color: #9CDCFE\">Console<\/span><span style=\"color: #D4D4D4\">.<\/span><span style=\"color: #DCDCAA\">WriteLine<\/span><span style=\"color: #D4D4D4\">(<\/span><span style=\"color: #CE9178\">&quot;ASCII (chars in the 0-127 range)&quot;<\/span><span style=\"color: #D4D4D4\">);<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">}<\/span><\/span>\n<span class=\"line\"><span style=\"color: #C586C0\">else<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #C586C0\">if<\/span><span style=\"color: #D4D4D4\"> (<\/span><span style=\"color: #9CDCFE\">encoding<\/span><span style=\"color: #D4D4D4\"> == <\/span><span style=\"color: #9CDCFE\">TextEncodingDetect<\/span><span style=\"color: #D4D4D4\">.<\/span><span style=\"color: #9CDCFE\">Encoding<\/span><span style=\"color: #D4D4D4\">.<\/span><span style=\"color: #9CDCFE\">ANSI<\/span><span style=\"color: #D4D4D4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">{<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    <\/span><span style=\"color: #9CDCFE\">Console<\/span><span style=\"color: #D4D4D4\">.<\/span><span style=\"color: #DCDCAA\">WriteLine<\/span><span style=\"color: #D4D4D4\">(<\/span><span style=\"color: #CE9178\">&quot;ANSI (chars in the range 0-255 range)&quot;<\/span><span style=\"color: #D4D4D4\">);<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">}<\/span><\/span>\n<span class=\"line\"><span style=\"color: #C586C0\">else<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #C586C0\">if<\/span><span style=\"color: #D4D4D4\"> (<\/span><span style=\"color: #9CDCFE\">encoding<\/span><span style=\"color: #D4D4D4\"> == <\/span><span style=\"color: #9CDCFE\">TextEncodingDetect<\/span><span style=\"color: #D4D4D4\">.<\/span><span style=\"color: #9CDCFE\">Encoding<\/span><span style=\"color: #D4D4D4\">.<\/span><span style=\"color: #9CDCFE\">UTF8_BOM<\/span><span style=\"color: #D4D4D4\"> || <\/span><span style=\"color: #9CDCFE\">encoding<\/span><span style=\"color: #D4D4D4\"> == <\/span><span style=\"color: #9CDCFE\">TextEncodingDetect<\/span><span style=\"color: #D4D4D4\">.<\/span><span style=\"color: #9CDCFE\">Encoding<\/span><span style=\"color: #D4D4D4\">.<\/span><span style=\"color: #9CDCFE\">UTF8_NOBOM<\/span><span style=\"color: #D4D4D4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">{<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    <\/span><span style=\"color: #9CDCFE\">Console<\/span><span style=\"color: #D4D4D4\">.<\/span><span style=\"color: #DCDCAA\">WriteLine<\/span><span style=\"color: #D4D4D4\">(<\/span><span style=\"color: #CE9178\">&quot;UTF-8&quot;<\/span><span style=\"color: #D4D4D4\">);<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">}<\/span><\/span>\n<span class=\"line\"><span style=\"color: #C586C0\">else<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #C586C0\">if<\/span><span style=\"color: #D4D4D4\"> (<\/span><span style=\"color: #9CDCFE\">encoding<\/span><span style=\"color: #D4D4D4\"> == <\/span><span style=\"color: #9CDCFE\">TextEncodingDetect<\/span><span style=\"color: #D4D4D4\">.<\/span><span style=\"color: #9CDCFE\">Encoding<\/span><span style=\"color: #D4D4D4\">.<\/span><span style=\"color: #9CDCFE\">UTF16_LE_BOM<\/span><span style=\"color: #D4D4D4\"> || <\/span><span style=\"color: #9CDCFE\">encoding<\/span><span style=\"color: #D4D4D4\"> == <\/span><span style=\"color: #9CDCFE\">TextEncodingDetect<\/span><span style=\"color: #D4D4D4\">.<\/span><span style=\"color: #9CDCFE\">Encoding<\/span><span style=\"color: #D4D4D4\">.<\/span><span style=\"color: #9CDCFE\">UTF16_LE_NOBOM<\/span><span style=\"color: #D4D4D4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">{<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    <\/span><span style=\"color: #9CDCFE\">Console<\/span><span style=\"color: #D4D4D4\">.<\/span><span style=\"color: #DCDCAA\">WriteLine<\/span><span style=\"color: #D4D4D4\">(<\/span><span style=\"color: #CE9178\">&quot;UTF-16 Little Endian&quot;<\/span><span style=\"color: #D4D4D4\">);<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">}<\/span><\/span>\n<span class=\"line\"><span style=\"color: #C586C0\">else<\/span><span style=\"color: #D4D4D4\"> <\/span><span style=\"color: #C586C0\">if<\/span><span style=\"color: #D4D4D4\"> (<\/span><span style=\"color: #9CDCFE\">encoding<\/span><span style=\"color: #D4D4D4\"> == <\/span><span style=\"color: #9CDCFE\">TextEncodingDetect<\/span><span style=\"color: #D4D4D4\">.<\/span><span style=\"color: #9CDCFE\">Encoding<\/span><span style=\"color: #D4D4D4\">.<\/span><span style=\"color: #9CDCFE\">UTF16_BE_BOM<\/span><span style=\"color: #D4D4D4\"> || <\/span><span style=\"color: #9CDCFE\">encoding<\/span><span style=\"color: #D4D4D4\"> == <\/span><span style=\"color: #9CDCFE\">TextEncodingDetect<\/span><span style=\"color: #D4D4D4\">.<\/span><span style=\"color: #9CDCFE\">Encoding<\/span><span style=\"color: #D4D4D4\">.<\/span><span style=\"color: #9CDCFE\">UTF16_BE_NOBOM<\/span><span style=\"color: #D4D4D4\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">{<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">    <\/span><span style=\"color: #9CDCFE\">Console<\/span><span style=\"color: #D4D4D4\">.<\/span><span style=\"color: #DCDCAA\">WriteLine<\/span><span style=\"color: #D4D4D4\">(<\/span><span style=\"color: #CE9178\">&quot;UTF-16 Big Endian&quot;<\/span><span style=\"color: #D4D4D4\">);<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D4D4D4\">}<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">Null and Binary Handling<\/h2>\n\n\n\n<p>One quirk of the library is how I chose to handle nulls (0x00). These are technically valid in UTF-8 sequences, but I&#8217;ve assumed that any file that contains a null is not ANSI\/ASCII\/UTF-8. Allowing nulls for UTF-8 can cause a false return where UTF-16 text containing just ASCII can appear to be valid UTF-8. To disable this behaviour just set the&nbsp;<strong>NullSuggestsBinary<\/strong> property on the library to&nbsp;<strong>false<\/strong>&nbsp;before calling&nbsp;<strong>DetectEncoding<\/strong>. In practice, most text files don&#8217;t contain nulls and the defaults are valid.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Overview This post shows how to detect UTF-8 and UTF-16 text and presents a fully functional C++ and C# library that can be used to help with the detection. I recently had to upgrade the text file handling feature of AutoIt to better handle text files where no byte order mark (BOM) was present. The [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":1181,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[12],"tags":[53,52,90,49,51,50],"class_list":["post-1180","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-development","tag-csharp","tag-cplusplus","tag-development","tag-unicode","tag-utf-16","tag-utf-8"],"_links":{"self":[{"href":"https:\/\/www.autoitconsulting.com\/site\/wp-json\/wp\/v2\/posts\/1180","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.autoitconsulting.com\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.autoitconsulting.com\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.autoitconsulting.com\/site\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.autoitconsulting.com\/site\/wp-json\/wp\/v2\/comments?post=1180"}],"version-history":[{"count":12,"href":"https:\/\/www.autoitconsulting.com\/site\/wp-json\/wp\/v2\/posts\/1180\/revisions"}],"predecessor-version":[{"id":100096,"href":"https:\/\/www.autoitconsulting.com\/site\/wp-json\/wp\/v2\/posts\/1180\/revisions\/100096"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.autoitconsulting.com\/site\/wp-json\/wp\/v2\/media\/1181"}],"wp:attachment":[{"href":"https:\/\/www.autoitconsulting.com\/site\/wp-json\/wp\/v2\/media?parent=1180"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.autoitconsulting.com\/site\/wp-json\/wp\/v2\/categories?post=1180"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.autoitconsulting.com\/site\/wp-json\/wp\/v2\/tags?post=1180"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}