How do you handle XML and HTML text files whether your strings in memory are UNICODE
/wchar_t
, MBCS
, or plain C? Loading and saving text in a Unicode or ANSI file may involve a conversion to and from the text encoding of your C++ strings.
Hey, in case you are just trying to use a text editor to save your ANSI file into a Unicode encoding see Convert ANSI file to Unicode.
UTF-8 is the recommended encoding for XML and HTML files. If your text is all ASCII, then it is also valid UTF-8. But if your XML file is not ASCII and not Unicode (i.e. not UTF-8 or UTF-16/UCS-2) then you really should use an XML Declaration to declare your encoding. For example if it is the default U.S. ANSI charset use this declaration:
<?xml version="1.0" encoding="windows-1252"?>
In Windows programming, the term ANSI is used to collectively refer to all the non-Unicode single and multibyte character sets that can be selected as the system locale code page. These include the single byte systems for Europe and the "double byte" for Chinese, Japanese and Korean which actually use one or two bytes per character.
In character sets like Windows-1252 and Cyrillic Windows-1251, a character is always one byte with a value up to 255. The problem is that values 128 to 255 (hex 80
to ff
) are assigned to specific characters that can differ between charsets. For example, the Euro symbol (€) is hex 80
(decimal 128) in Windows-1252, but in Windows-1251 the value hex 80
represents the capital letter DJE (Ђ) and hex 88
is the Euro. So, when the computer sees the value 80
, it depends on the current system locale setting how it is going to display that character.
character | Windows-1252 Latin-1 |
Windows-1251 Cyrillic |
UTF-8 | UTF-16 |
---|---|---|---|---|
€ (Euro) | 80 (128) | 88 (136) | E2 82 AC | 20AC |
® (Registered) | AE (174) | AE (174) | C2 AE | 00AE |
In double-byte character sets like GB2312 (Chinese Simplified), a character is one byte when it is in the ASCII range, but it can be either one or two bytes otherwise (some code points over 80
are lead bytes that are interpreted together with a trailing byte). For example, the ASCII character z value does not change (much) between the different encodings, but the sample Chinese character is completely different:
character | GB2312 | UTF-8 | UTF-16 |
---|---|---|---|
z | 7A | 7A | 007A |
中 (middle) | D6 D0 | E4 B8 AD | 4E2D |
Windows systems allow non-UNICODE programs to work with one ANSI character set at a time. To change charsets, you have to change the computer's system locale setting (that's why Unicode is great because you do not have to do this).
In terms of the charset of your in-memory text, there are 3 ways to build your Windows C++ program:
UNICODE
means use wide char UTF-16 (or it's precursor UCS-2)MBCS
means use the Windows system locale charset, single or multi byte (DBCS)OS X and Linux C++ programs do not have the ANSI and UNICODE modes that Windows programs do. But you can use char
-based strings which are usually UTF-8, or wchar_t
-based UTF-32 strings. The corresponding STL string classes are std::string
and std::wstring
. Although I recommend UTF-8 which requires less space and fewer conversions, CMarkup supports wide strings on these platforms with the MARKUP_WCHAR
define.
Update December 17, 2008: With CMarkup release 10.1, the Save and Load methods, and the underlying WriteTextFile and ReadTextFile functions have greatly expanded character conversion capabilities to handle most common ANSI and double-byte encodings specified in the XML declaration or HTML Content-Type meta tag (see GetDeclaredEncoding) of the document.
Whichever charset you use in memory, CMarkup converts between that and the encoding of your file when you read or write the file.
In Windows, the CMarkup text conversion functionality uses the MultiByteToWideChar
and WideCharToMultiByte
Windows APIs, see the preprocessor define MARKUP_WINCONV
. In Visual C++ MARKUP_WINCONV
is automatically selected. In g++ for cygwin and other compilers for Windows, add MARKUP_WINCONV
to your preprocessor defines or specify -DMARKUP_WINCONV
on the command line.
If not on Windows, CMarkup uses the iconv
API available on OS X, Linux and some other platforms, see the preprocessor define MARKUP_ICONV
. The g++ GNU compiler will automatically select MARKUP_ICONV
. Put MARKUP_ICONV
in your preprocessor defines if needed. On OS X you may need to specify the iconv library to the linker. The following command is used to compile the CMarkup test program on OS X:
g++ main.cpp Markup.cpp MarkupTest.cpp -liconv
Define MARKUP_STDCONV
to use neither Windows conversion APIs nor iconv
. See non-Unicode text handling in CMarkup.
If you do not use either MARKUP_WINCONV
or MARKUP_ICONV
, CMarkup still supports conversion between Unicode encodings (UTF-8 and UTF-16 and wchar_t
which is UTF-32 on OS X and Linux), as well as the system locale encoding if you call setlocale
. A non-Unicode encoding is supported with the ANSI C mbtowc
and wctomb
functions to convert to/from Unicode.
This is the default build mode in older Windows Visual C++ projects using the system locale ANSI code page in strings and Windows APIs. It means that your strings in memory are not Unicode.
CMarkup is a bit slower in an MBCS
build than a non-MBCS build because it must compute the character length (1 or 2 bytes) as it processes strings. This is not the case with UTF-8 because although UTF-8 is multibyte, this character length computation is not necessary during most UTF-8 string processing because of how UTF-8 is designed. See the next section on using UTF-8 internally.
If a Unicode file is loaded into an MBCS
build, you will lose any characters not supported by the charset being used in memory. Or if the file is a different character set than the one in memory you can lose characters. The conversion process generally replaces these lost characters with a question mark and reports in the result string (GetResult) that characters were lost during conversion.
Great. Just FYI, the automatic conversion from UTF-8 to ANSI in memory in an MBCS build was implemented in CMarkup developer release 7.3, and made part of the evaluation version in release 9.0. CMarkup release 10.1 further adds the ability to automatically convert from an ANSI file even if it does not correspond to the system locale ANSI encoding.
If you do not have MBCS
(nor UNICODE
nor MARKUP_WCHAR
) in your compiler defines, CMarkup is designed to use UTF-8 internally. This allows you to keep the document in Unicode in memory, and avoid any of the multi-language text loss described above in the MBCS
build. You should convert strings to ANSI only as needed for displaying or passing to Windows APIs with AToUTF8 and UTF8ToA.
If your file is UTF-8 (as recommended for XML), this allows you to keep the document as UTF-8 in memory too without conversion on load and save.
UTF-8 is the recommended Unicode encoding for XML, but sometimes UTF-16 is used. CMarkup will detect UTF-16 files (this includes UCS-2) containing a Byte Order Mark (BOM). See UTF-16 Files and the Byte Order Mark (BOM).
See Also:
UTF-8 Files and the Preamble
Setting the XML Declaration With CMarkup
wchar_t string on Linux, OS X and Windows
Re: Upgrade to 10.1
David Emmerich 28-Nov-2008
We had discovered a problem reading a UTF-8 XML file and I was researching if CMarkup had a way of converting UTF-8 to ANSI. When I noticed the 10.1 version on your web site, I figured I had better try the latest before doing anything else. I downloaded it and [replaced 6.5] with no change in my code, everything just worked, the conversion happening automatically. Good job!