Unicode and UTF-8 in Str Library |
![]() |
This article provides a brief overview of Unicode and extended character set support in Str Library, and includes a few links to more in-depth information on the web.
Unicode can be characterized as a classification system for the vast number of the world's different scripts and character sets, starting with the simple latin alphabet (commonly contained in the ASCII subset, which can be represented with 7 bits per character), going through non-latin but still simple character sets (cyrillic, turkish latin extensions, european latin extensions, and such) and ending with special characters, very complex sets (such as far eastern languages relying on ideograms), dead languages used by scientists, even some cool extensions like made-up languages from J.R.R.Tolkien's books.
Many people confuse the true meaning of Unicode; it is not an encoding scheme, it is just a standard for defining all characters used in some form or another in the entire world. Unicode itself can be encoded in many ways. The two most widely used encodings are:
For a number of reasons we recommend that programs on all platforms that need to support international characters use the UCS-2 or UCS-4 encoding with Str Library - that is, define the symbol STR_UNICODE and presume that each character has a fixed-length internal representation of exactly 16 bits (32 bits on Linux).
However, for applications that must utilize 8-bit encoding, we also support UTF-8. The programmer must be more careful when using these characters with Str Library, especially when dealing with single characters and not whole strings. We have attempted to place appropriate cautionary notes in this documentation in methods and datatypes where special care must be taken when using UTF-8.
All simple (non-class based) characters in Str Library are represented with one of the following C-compatible datatypes:
For performance reasons, the Char class in Str Library will hold a single character represented by a STRCHAR datatype. Therefore, if you are building the library in ANSI mode, it is not suitable for cases where an UTF-8 encoded character of more than 1-byte length may occur.
String and character class methods can generally be classified as two types: locale independent and locale dependent.
The majority of methods a program will use are locale-independent. These include copying strings, catenating strings, removing individual characters from a string, and almost all other operations.
A small number of methods are locale-dependent. Their main difference from the previous category is that they may behave differently depending on the system locale selected. This applies both to Unix, Linux and Windows environments. These methods are:
Str Library uses the underlying C RTL exclusively on Unix / Linux platforms. On Windows, it sometimes uses Windows API functions, at other times - C RTL functions. This has no effect on the operation of the application because on Windows the C RTL ultimately uses Windows locale information itself.
| To... | Under Unix / Linux | Under Windows |
| Select a locale | Use the set_locale posix function and Str Library will automatically adjust itself to the selected locale. Use either the default locale coming from an OS environment variable, or explicitly declare your own (and be sure your system has prebuilt locale information corresponding to your choice) | Use the default system locale, or select an application locale with the Windows API. |
It is important to notice that the current locale typically defines rules only for a single cultural environment; for example, you may have selected a Turkish locale, and the collating sequences will be correct for Turkish. However, if you define STR_UNICODE and rely on UCS-2 / UCS-4 encoding within the application, you can process correctly all languages; the only thing that will be Turkish-specific is the collation sequence. But if your data happens to contain Hebrew letters also, they will be correctly recognized by IsAlpha / IsUpper / etc.
No matter what the chosen locale is, regular Latin-1 (ASCII) characters are always processed uniformly on all systems.
On Windows, there is a convention used to easily distinguish between text files (or just chunks of text data) encoded in UCS-2 or ASCII, UTF-8 or some MBCS character set. This practice is also spreading to Unix / Linux systems so it might be appropriate to use for cross-platform code too. The algorithm for reading text data is as follows:
Str Library will not prepend or attempt to analyze this optional two-byte header by itself.
Also note that when you use Str Library in Unicode mode, all 16-bit characters are represented internally in the native form for the CPU of that machine. So if you write out string data from a Unix application that you expect may be picked up by a Windows application, make sure you emit either 0xFF 0xFE or 0xFE 0xFF when writing, and when reading, be prepared to swap the high-order and low-order byte of each 16-bit value read if the header indicates that the endianness of the file data is not the same as the endianness of the CPU your program is running on.
More detailed discussions can be found in the following publicly available documents on the web:
| The definitive Markus Kuhn FAQ on using Unicode and UTF-8 in the Unix/Linux environment. Not a light reading, but contains excellent in-depth information. | |
| Unicode Support in the Solaris Operating Environment | A Sun whitepaper providing additional information on current and future UTF-8 support under Solaris. |
| The UTF and BOM FAQ from unicode.org | An excellent official FAQ containing information for UTF-8 for beginners and experienced developers alike. |
See also: UTF-8 Encoding, STRCHAR type, Conditional symbols