UTF-8 Encoding

Web Site

The two methods described here will convert between internal Unicode representation and the UTF8 encoding standard, widely used on the Internet as well as in Linux C/C++ applications to represent Unicode characters with a variable-length encoding. These functions are available only in Unicode builds (i.e. when the internal representation of characters in the system is 16-bit). Also, they are cannot be used when the STR_NO_RTLIBRARY flag is turned on.

Don't know what UTF-8 is? Refer to Unicode and UTF-8 in Str Library.

 

 
Str::ToUTF8
Only available in Unicode builds
 
STRACHAR* ToUTF8 (STRACHAR* buffer, int buflen) const

Use this method to convert a string to its UTF-8 representation.

Typically, application programs should let ToUTF8 allocate a memory block to contain the converted string.  In that case, the buffer parameter must be null, and buflen must be -1.  The value returned by the method points to a dynamically allocated memory block, containing a null-terminated UTF-8 representation of the data. After the application has finished using this memory block, it should be released by calling Str::OS_free

Alternatively, the caller may supply its own buffer, and provide its length (in bytes) in the buflen parameter.  In this situation the method result will always be equal to the buffer parameter, unless an error (buffer not big enough) occurs; in the latter case, a value of -1 (typecasted to STRACHAR*) is returned.

If the string contains characters that cannot be represented in the UTF-8 encoding (i.e. 'surrogate' codes in code positions U+D800 to U+DFFF, as well as U+FFFE and U+FFFF), the method will throw a StrException with an error code of SE_MalformedUtf8Char.

 

 
Str::FromUTF8
Only available in Unicode builds
 
void FromUTF8(const STRACHAR* buffer, int length = -1)

Use this method to "import" an UTF-8 encoded string into a Str object.  The previous content of the Str object is destroyed.

The buffer parameter must point to the beginning of a UTF-8 encoded character string.  If the string is null-terminated, the length parameter can be specified as -1.  Otherwise, the length of the string in bytes (not in characters, which in UTF-8 can have variable length) must be passed.

Windows only:If the UTF-8 string contains characters that cannot be represented in Str Library's UCS-2 encoding (i.e. Unicode symbols that are beyond the 16-bit BMP character set 99.9% of all applications in the world use), or the UTF-8 string is malformed, the method will throw a StrException with an error code of SE_MalformedUtf8Char.

 

The UTF-8 encoding rules

The following passage comes from the excellent UTF-8 and Unicode FAQ:

The following byte sequences are used to represent a character:

U-00000000 - U-0000007F: 0xxxxxxx
U-00000080 - U-000007FF: 110xxxxx 10xxxxxx
U-00000800 - U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
U-00010000 - U-001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
U-00200000 - U-03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
U-04000000 - U-7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

The xxx bit positions are filled with the bits of the character code number in binary representation. The rightmost x bit is the least-significant bit. Only the shortest possible multibyte sequence which can represent the code number of the character can be used. Note that in multibyte sequences, the number of leading 1 bits in the first byte is identical to the number of bytes in the entire sequence.

 

See alsoUnicode, ANSI / Unicode mode, STRCHAR type