utf8rewind
1.4.1
System library for processing UTF-8 encoded text
|
Public interface for UTF-8 functions. More...
Go to the source code of this file.
Macros | |
Version information | |
Macros used to identify the version of the library. | |
#define | UTF8_VERSION_MAKE(_major, _minor, _bugfix) ((_major) * 10000) + ((_minor) * 100) + (_bugfix) |
Macro for creating a version number from a major, minor and bugfix number. More... | |
#define | UTF8_VERSION_MAJOR 1 |
The major version number of this release. More... | |
#define | UTF8_VERSION_MINOR 4 |
The minor version number of this release. More... | |
#define | UTF8_VERSION_BUGFIX 1 |
The bugfix version number of this release. More... | |
#define | UTF8_VERSION UTF8_VERSION_MAKE(UTF8_VERSION_MAJOR, UTF8_VERSION_MINOR, UTF8_VERSION_BUGFIX) |
The version number as an integer. More... | |
#define | UTF8_VERSION_STRING "1.4.1" |
The verion number as a string. More... | |
#define | UTF8_VERSION_GUARD(_major, _minor, _bugfix) (UTF8_VERSION >= UTF8_VERSION_MAKE(_major, _minor, _bugfix)) |
Check if feature is supported by the current release. More... | |
Error codes | |
Values returned by functions on error. | |
#define | UTF8_ERR_NONE (0) |
No errors. More... | |
#define | UTF8_ERR_INVALID_DATA (-1) |
Input data is invalid. More... | |
#define | UTF8_ERR_INVALID_FLAG (-2) |
Input flag is invalid. More... | |
#define | UTF8_ERR_NOT_ENOUGH_SPACE (-3) |
Not enough space in buffer to store result. More... | |
#define | UTF8_ERR_OVERLAPPING_PARAMETERS (-4) |
Input and output buffers overlap in memory. More... | |
Global configuration | |
Defines used for determining the global configuration of the system and your application. | |
#define | UTF8_WCHAR_SIZE (2) |
Specifies the size of the wchar_t type. On Windows this is 2, on POSIX systems it is 4. If not specified on the command line, the compiler tries to automatically determine the size of the wchar_t type based on the environment. More... | |
#define | UTF8_WCHAR_UTF16 (1) |
The wchar_t type is treated as UTF-16 (2 bytes). More... | |
#define | UTF8_API |
Calling convention for public functions. More... | |
Normalization flags | |
Flags used as input for utf8normalize and the result of utf8isnormalized. | |
#define | UTF8_NORMALIZE_COMPOSE 0x00000001 |
Normalize input to Normalization Form C (NFC). More... | |
#define | UTF8_NORMALIZE_DECOMPOSE 0x00000002 |
Normalize input to Normalization Form D (NFD). More... | |
#define | UTF8_NORMALIZE_COMPATIBILITY 0x00000004 |
Change Normalization Form from NFC to NFKC or from NFD to NFKD. More... | |
#define | UTF8_NORMALIZATION_RESULT_YES (0) |
Text is stable and does not have to be normalized. More... | |
#define | UTF8_NORMALIZATION_RESULT_MAYBE (1) |
Text is unstable, but normalization may be skipped. More... | |
#define | UTF8_NORMALIZATION_RESULT_NO (2) |
Text is unstable and must be normalized. More... | |
Category flags | |
Flags to be used with utf8iscategory, to check whether code points in a string are part of that category. | |
#define | UTF8_CATEGORY_LETTER_UPPERCASE 0x00000001 |
Uppercase letter code points, Lu in the Unicode database. More... | |
#define | UTF8_CATEGORY_LETTER_LOWERCASE 0x00000002 |
Lowercase letter code points, Ll in the Unicode database. More... | |
#define | UTF8_CATEGORY_LETTER_TITLECASE 0x00000004 |
Titlecase letter code points, Lt in the Unicode database. More... | |
#define | UTF8_CATEGORY_LETTER_MODIFIER 0x00000008 |
Modifier letter code points, Lm in the Unicode database. More... | |
#define | UTF8_CATEGORY_LETTER_OTHER 0x00000010 |
Other letter code points, Lo in the Unicode database. More... | |
#define | UTF8_CATEGORY_LETTER |
Combined flag for all letter categories. More... | |
#define | UTF8_CATEGORY_CASE_MAPPED |
Combined flag for all letter categories with case mapping. More... | |
#define | UTF8_CATEGORY_MARK_NON_SPACING 0x00000020 |
Non-spacing mark code points, Mn in the Unicode database. More... | |
#define | UTF8_CATEGORY_MARK_SPACING 0x00000040 |
Spacing mark code points, Mc in the Unicode database. More... | |
#define | UTF8_CATEGORY_MARK_ENCLOSING 0x00000080 |
Enclosing mark code points, Me in the Unicode database. More... | |
#define | UTF8_CATEGORY_MARK |
Combined flag for all mark categories. More... | |
#define | UTF8_CATEGORY_NUMBER_DECIMAL 0x00000100 |
Decimal number code points, Nd in the Unicode database. More... | |
#define | UTF8_CATEGORY_NUMBER_LETTER 0x00000200 |
Letter number code points, Nl in the Unicode database. More... | |
#define | UTF8_CATEGORY_NUMBER_OTHER 0x00000400 |
Other number code points, No in the Unicode database. More... | |
#define | UTF8_CATEGORY_NUMBER |
Combined flag for all number categories. More... | |
#define | UTF8_CATEGORY_PUNCTUATION_CONNECTOR 0x00000800 |
Connector punctuation category, Pc in the Unicode database. More... | |
#define | UTF8_CATEGORY_PUNCTUATION_DASH 0x00001000 |
Dash punctuation category, Pd in the Unicode database. More... | |
#define | UTF8_CATEGORY_PUNCTUATION_OPEN 0x00002000 |
Open punctuation category, Ps in the Unicode database. More... | |
#define | UTF8_CATEGORY_PUNCTUATION_CLOSE 0x00004000 |
Close punctuation category, Pe in the Unicode database. More... | |
#define | UTF8_CATEGORY_PUNCTUATION_INITIAL 0x00008000 |
Initial punctuation category, Pi in the Unicode database. More... | |
#define | UTF8_CATEGORY_PUNCTUATION_FINAL 0x00010000 |
Final punctuation category, Pf in the Unicode database. More... | |
#define | UTF8_CATEGORY_PUNCTUATION_OTHER 0x00020000 |
Other punctuation category, Po in the Unicode database. More... | |
#define | UTF8_CATEGORY_PUNCTUATION |
Combined flag for all punctuation categories. More... | |
#define | UTF8_CATEGORY_SYMBOL_MATH 0x00040000 |
Math symbol category, Sm in the Unicode database. More... | |
#define | UTF8_CATEGORY_SYMBOL_CURRENCY 0x00080000 |
Currency symbol category, Sc in the Unicode database. More... | |
#define | UTF8_CATEGORY_SYMBOL_MODIFIER 0x00100000 |
Modifier symbol category, Sk in the Unicode database. More... | |
#define | UTF8_CATEGORY_SYMBOL_OTHER 0x00200000 |
Other symbol category, So in the Unicode database. More... | |
#define | UTF8_CATEGORY_SYMBOL |
Combined flag for all symbol categories. More... | |
#define | UTF8_CATEGORY_SEPARATOR_SPACE 0x00400000 |
Space separator category, Zs in the Unicode database. More... | |
#define | UTF8_CATEGORY_SEPARATOR_LINE 0x00800000 |
Line separator category, Zl in the Unicode database. More... | |
#define | UTF8_CATEGORY_SEPARATOR_PARAGRAPH 0x01000000 |
Paragraph separator category, Zp in the Unicode database. More... | |
#define | UTF8_CATEGORY_SEPARATOR |
Combined flag for all separator categories. More... | |
#define | UTF8_CATEGORY_CONTROL 0x02000000 |
Control category, Cc in the Unicode database. More... | |
#define | UTF8_CATEGORY_FORMAT 0x04000000 |
Format category, Cf in the Unicode database. More... | |
#define | UTF8_CATEGORY_SURROGATE 0x08000000 |
Surrogate category, Cs in the Unicode database. More... | |
#define | UTF8_CATEGORY_PRIVATE_USE 0x10000000 |
Private use category, Co in the Unicode database. More... | |
#define | UTF8_CATEGORY_UNASSIGNED 0x20000000 |
Unassigned category, Cn in the Unicode database. More... | |
#define | UTF8_CATEGORY_COMPATIBILITY 0x40000000 |
Flag used for maintaining backwards compatibility with POSIX functions, not found in the Unicode database. More... | |
#define | UTF8_CATEGORY_IGNORE_GRAPHEME_CLUSTER 0x80000000 |
Flag used for checking only the general category of code points at the start of a grapheme cluster. More... | |
#define | UTF8_CATEGORY_ISCNTRL |
Flag used for maintaining backwards compatibility with POSIX iscntrl function. More... | |
#define | UTF8_CATEGORY_ISPRINT |
Flag used for maintaining backwards compatibility with POSIX isprint function. More... | |
#define | UTF8_CATEGORY_ISSPACE |
Flag used for maintaining backwards compatibility with POSIX isspace function. More... | |
#define | UTF8_CATEGORY_ISBLANK |
Flag used for maintaining backwards compatibility with POSIX isblank function. More... | |
#define | UTF8_CATEGORY_ISGRAPH |
Flag used for maintaining backwards compatibility with POSIX isgraph function. More... | |
#define | UTF8_CATEGORY_ISPUNCT |
Flag used for maintaining backwards compatibility with POSIX ispunct function. More... | |
#define | UTF8_CATEGORY_ISALNUM |
Flag used for maintaining backwards compatibility with POSIX isalnum function. More... | |
#define | UTF8_CATEGORY_ISALPHA |
Flag used for maintaining backwards compatibility with POSIX isalpha function. More... | |
#define | UTF8_CATEGORY_ISUPPER |
Flag used for maintaining backwards compatibility with POSIX isupper function. More... | |
#define | UTF8_CATEGORY_ISLOWER |
Flag used for maintaining backwards compatibility with POSIX islower function. More... | |
#define | UTF8_CATEGORY_ISDIGIT |
Flag used for maintaining backwards compatibility with POSIX isdigit function. More... | |
#define | UTF8_CATEGORY_ISXDIGIT |
Flag used for maintaining backwards compatibility with POSIX isxdigit function. More... | |
Typedefs | |
typedef uint16_t | utf16_t |
UTF-16 encoded code point. More... | |
typedef uint32_t | unicode_t |
UTF-32 encoded code point. More... | |
Functions | |
UTF8_API size_t | utf8len (const char *text) |
Get the length in code points of a UTF-8 encoded string. More... | |
UTF8_API size_t | utf16toutf8 (const utf16_t *input, size_t inputSize, char *target, size_t targetSize, int32_t *errors) |
Convert a UTF-16 encoded string to a UTF-8 encoded string. More... | |
UTF8_API size_t | utf32toutf8 (const unicode_t *input, size_t inputSize, char *target, size_t targetSize, int32_t *errors) |
Convert a UTF-32 encoded string to a UTF-8 encoded string. More... | |
UTF8_API size_t | widetoutf8 (const wchar_t *input, size_t inputSize, char *target, size_t targetSize, int32_t *errors) |
Convert a wide string to a UTF-8 encoded string. More... | |
UTF8_API size_t | utf8toutf16 (const char *input, size_t inputSize, utf16_t *target, size_t targetSize, int32_t *errors) |
Convert a UTF-8 encoded string to a UTF-16 encoded string. More... | |
UTF8_API size_t | utf8toutf32 (const char *input, size_t inputSize, unicode_t *target, size_t targetSize, int32_t *errors) |
Convert a UTF-8 encoded string to a UTF-32 encoded string. More... | |
UTF8_API size_t | utf8towide (const char *input, size_t inputSize, wchar_t *target, size_t targetSize, int32_t *errors) |
Convert a UTF-8 encoded string to a wide string. More... | |
UTF8_API const char * | utf8seek (const char *text, size_t textSize, const char *textStart, off_t offset, int direction) |
Seek into a UTF-8 encoded string. More... | |
UTF8_API size_t | utf8toupper (const char *input, size_t inputSize, char *target, size_t targetSize, int32_t *errors) |
Convert UTF-8 encoded text to uppercase. More... | |
UTF8_API size_t | utf8tolower (const char *input, size_t inputSize, char *target, size_t targetSize, int32_t *errors) |
Convert UTF-8 encoded text to lowercase. More... | |
UTF8_API size_t | utf8totitle (const char *input, size_t inputSize, char *target, size_t targetSize, int32_t *errors) |
Convert UTF-8 encoded text to titlecase. More... | |
UTF8_API size_t | utf8casefold (const char *input, size_t inputSize, char *target, size_t targetSize, int32_t *errors) |
Remove case distinction from UTF-8 encoded text. More... | |
UTF8_API uint8_t | utf8isnormalized (const char *input, size_t inputSize, size_t flags, size_t *offset) |
Check if a string is stable in the specified Unicode Normalization Form. More... | |
UTF8_API size_t | utf8normalize (const char *input, size_t inputSize, char *target, size_t targetSize, size_t flags, int32_t *errors) |
Normalize a string to the specified Unicode Normalization Form. More... | |
UTF8_API size_t | utf8iscategory (const char *input, size_t inputSize, size_t flags) |
Check if the input string conforms to the category specified by the flags. More... | |
Public interface for UTF-8 functions.
#define UTF8_VERSION_MAKE | ( | _major, | |
_minor, | |||
_bugfix | |||
) | ((_major) * 10000) + ((_minor) * 100) + (_bugfix) |
Macro for creating a version number from a major, minor and bugfix number.
#define UTF8_VERSION_MAJOR 1 |
The major version number of this release.
#define UTF8_VERSION_MINOR 4 |
The minor version number of this release.
#define UTF8_VERSION_BUGFIX 1 |
The bugfix version number of this release.
#define UTF8_VERSION UTF8_VERSION_MAKE(UTF8_VERSION_MAJOR, UTF8_VERSION_MINOR, UTF8_VERSION_BUGFIX) |
The version number as an integer.
#define UTF8_VERSION_STRING "1.4.1" |
The verion number as a string.
#define UTF8_VERSION_GUARD | ( | _major, | |
_minor, | |||
_bugfix | |||
) | (UTF8_VERSION >= UTF8_VERSION_MAKE(_major, _minor, _bugfix)) |
Check if feature is supported by the current release.
#define UTF8_ERR_NONE (0) |
No errors.
#define UTF8_ERR_INVALID_DATA (-1) |
Input data is invalid.
#define UTF8_ERR_INVALID_FLAG (-2) |
Input flag is invalid.
#define UTF8_ERR_NOT_ENOUGH_SPACE (-3) |
Not enough space in buffer to store result.
#define UTF8_ERR_OVERLAPPING_PARAMETERS (-4) |
Input and output buffers overlap in memory.
#define UTF8_WCHAR_SIZE (2) |
Specifies the size of the wchar_t
type. On Windows this is 2, on POSIX systems it is 4. If not specified on the command line, the compiler tries to automatically determine the size of the wchar_t
type based on the environment.
#define UTF8_WCHAR_UTF16 (1) |
The wchar_t
type is treated as UTF-16 (2 bytes).
#define UTF8_API |
Calling convention for public functions.
#define UTF8_NORMALIZE_COMPOSE 0x00000001 |
Normalize input to Normalization Form C (NFC).
#define UTF8_NORMALIZE_DECOMPOSE 0x00000002 |
Normalize input to Normalization Form D (NFD).
#define UTF8_NORMALIZE_COMPATIBILITY 0x00000004 |
Change Normalization Form from NFC to NFKC or from NFD to NFKD.
#define UTF8_NORMALIZATION_RESULT_YES (0) |
Text is stable and does not have to be normalized.
#define UTF8_NORMALIZATION_RESULT_MAYBE (1) |
Text is unstable, but normalization may be skipped.
#define UTF8_NORMALIZATION_RESULT_NO (2) |
Text is unstable and must be normalized.
#define UTF8_CATEGORY_LETTER_UPPERCASE 0x00000001 |
Uppercase letter code points, Lu in the Unicode database.
#define UTF8_CATEGORY_LETTER_LOWERCASE 0x00000002 |
Lowercase letter code points, Ll in the Unicode database.
#define UTF8_CATEGORY_LETTER_TITLECASE 0x00000004 |
Titlecase letter code points, Lt in the Unicode database.
#define UTF8_CATEGORY_LETTER_MODIFIER 0x00000008 |
Modifier letter code points, Lm in the Unicode database.
#define UTF8_CATEGORY_LETTER_OTHER 0x00000010 |
Other letter code points, Lo in the Unicode database.
#define UTF8_CATEGORY_LETTER |
Combined flag for all letter categories.
#define UTF8_CATEGORY_CASE_MAPPED |
Combined flag for all letter categories with case mapping.
#define UTF8_CATEGORY_MARK_NON_SPACING 0x00000020 |
Non-spacing mark code points, Mn in the Unicode database.
#define UTF8_CATEGORY_MARK_SPACING 0x00000040 |
Spacing mark code points, Mc in the Unicode database.
#define UTF8_CATEGORY_MARK_ENCLOSING 0x00000080 |
Enclosing mark code points, Me in the Unicode database.
#define UTF8_CATEGORY_MARK |
Combined flag for all mark categories.
#define UTF8_CATEGORY_NUMBER_DECIMAL 0x00000100 |
Decimal number code points, Nd in the Unicode database.
#define UTF8_CATEGORY_NUMBER_LETTER 0x00000200 |
Letter number code points, Nl in the Unicode database.
#define UTF8_CATEGORY_NUMBER_OTHER 0x00000400 |
Other number code points, No in the Unicode database.
#define UTF8_CATEGORY_NUMBER |
Combined flag for all number categories.
#define UTF8_CATEGORY_PUNCTUATION_CONNECTOR 0x00000800 |
Connector punctuation category, Pc in the Unicode database.
#define UTF8_CATEGORY_PUNCTUATION_DASH 0x00001000 |
Dash punctuation category, Pd in the Unicode database.
#define UTF8_CATEGORY_PUNCTUATION_OPEN 0x00002000 |
Open punctuation category, Ps in the Unicode database.
#define UTF8_CATEGORY_PUNCTUATION_CLOSE 0x00004000 |
Close punctuation category, Pe in the Unicode database.
#define UTF8_CATEGORY_PUNCTUATION_INITIAL 0x00008000 |
Initial punctuation category, Pi in the Unicode database.
#define UTF8_CATEGORY_PUNCTUATION_FINAL 0x00010000 |
Final punctuation category, Pf in the Unicode database.
#define UTF8_CATEGORY_PUNCTUATION_OTHER 0x00020000 |
Other punctuation category, Po in the Unicode database.
#define UTF8_CATEGORY_PUNCTUATION |
Combined flag for all punctuation categories.
#define UTF8_CATEGORY_SYMBOL_MATH 0x00040000 |
Math symbol category, Sm in the Unicode database.
#define UTF8_CATEGORY_SYMBOL_CURRENCY 0x00080000 |
Currency symbol category, Sc in the Unicode database.
#define UTF8_CATEGORY_SYMBOL_MODIFIER 0x00100000 |
Modifier symbol category, Sk in the Unicode database.
#define UTF8_CATEGORY_SYMBOL_OTHER 0x00200000 |
Other symbol category, So in the Unicode database.
#define UTF8_CATEGORY_SYMBOL |
Combined flag for all symbol categories.
#define UTF8_CATEGORY_SEPARATOR_SPACE 0x00400000 |
Space separator category, Zs in the Unicode database.
#define UTF8_CATEGORY_SEPARATOR_LINE 0x00800000 |
Line separator category, Zl in the Unicode database.
#define UTF8_CATEGORY_SEPARATOR_PARAGRAPH 0x01000000 |
Paragraph separator category, Zp in the Unicode database.
#define UTF8_CATEGORY_SEPARATOR |
Combined flag for all separator categories.
#define UTF8_CATEGORY_CONTROL 0x02000000 |
Control category, Cc in the Unicode database.
#define UTF8_CATEGORY_FORMAT 0x04000000 |
Format category, Cf in the Unicode database.
#define UTF8_CATEGORY_SURROGATE 0x08000000 |
Surrogate category, Cs in the Unicode database.
#define UTF8_CATEGORY_PRIVATE_USE 0x10000000 |
Private use category, Co in the Unicode database.
#define UTF8_CATEGORY_UNASSIGNED 0x20000000 |
Unassigned category, Cn in the Unicode database.
#define UTF8_CATEGORY_COMPATIBILITY 0x40000000 |
Flag used for maintaining backwards compatibility with POSIX functions, not found in the Unicode database.
#define UTF8_CATEGORY_IGNORE_GRAPHEME_CLUSTER 0x80000000 |
Flag used for checking only the general category of code points at the start of a grapheme cluster.
#define UTF8_CATEGORY_ISCNTRL |
Flag used for maintaining backwards compatibility with POSIX iscntrl
function.
#define UTF8_CATEGORY_ISPRINT |
Flag used for maintaining backwards compatibility with POSIX isprint
function.
#define UTF8_CATEGORY_ISSPACE |
Flag used for maintaining backwards compatibility with POSIX isspace
function.
#define UTF8_CATEGORY_ISBLANK |
Flag used for maintaining backwards compatibility with POSIX isblank
function.
#define UTF8_CATEGORY_ISGRAPH |
Flag used for maintaining backwards compatibility with POSIX isgraph
function.
#define UTF8_CATEGORY_ISPUNCT |
Flag used for maintaining backwards compatibility with POSIX ispunct
function.
#define UTF8_CATEGORY_ISALNUM |
Flag used for maintaining backwards compatibility with POSIX isalnum
function.
#define UTF8_CATEGORY_ISALPHA |
Flag used for maintaining backwards compatibility with POSIX isalpha
function.
#define UTF8_CATEGORY_ISUPPER |
Flag used for maintaining backwards compatibility with POSIX isupper
function.
#define UTF8_CATEGORY_ISLOWER |
Flag used for maintaining backwards compatibility with POSIX islower
function.
#define UTF8_CATEGORY_ISDIGIT |
Flag used for maintaining backwards compatibility with POSIX isdigit
function.
#define UTF8_CATEGORY_ISXDIGIT |
Flag used for maintaining backwards compatibility with POSIX isxdigit
function.
UTF-16 encoded code point.
UTF-32 encoded code point.
UTF8_API size_t utf8len | ( | const char * | text | ) |
UTF8_API size_t utf16toutf8 | ( | const utf16_t * | input, |
size_t | inputSize, | ||
char * | target, | ||
size_t | targetSize, | ||
int32_t * | errors | ||
) |
Convert a UTF-16 encoded string to a UTF-8 encoded string.
Example:
[in] | input | UTF-16 encoded string. |
[in] | inputSize | Size of the input in bytes. |
[out] | target | Output buffer for the result, can be NULL. |
[in] | targetSize | Size of the output buffer in bytes. |
[out] | errors | Output for errors. |
UTF8_ERR_NONE | No errors. |
UTF8_ERR_INVALID_DATA | Failed to decode data. |
UTF8_ERR_OVERLAPPING_PARAMETERS | Input and output buffers overlap in memory. |
UTF8_ERR_NOT_ENOUGH_SPACE | Target buffer size is insufficient for result. |
UTF8_API size_t utf32toutf8 | ( | const unicode_t * | input, |
size_t | inputSize, | ||
char * | target, | ||
size_t | targetSize, | ||
int32_t * | errors | ||
) |
Convert a UTF-32 encoded string to a UTF-8 encoded string.
Example:
[in] | input | UTF-32 encoded string. |
[in] | inputSize | Size of the input in bytes. |
[out] | target | Output buffer for the result, can be NULL. |
[in] | targetSize | Size of the output buffer in bytes. |
[out] | errors | Output for errors. |
UTF8_ERR_NONE | No errors. |
UTF8_ERR_INVALID_DATA | Failed to decode data. |
UTF8_ERR_OVERLAPPING_PARAMETERS | Input and output buffers overlap in memory. |
UTF8_ERR_NOT_ENOUGH_SPACE | Target buffer size is insufficient for result. |
UTF8_API size_t widetoutf8 | ( | const wchar_t * | input, |
size_t | inputSize, | ||
char * | target, | ||
size_t | targetSize, | ||
int32_t * | errors | ||
) |
Convert a wide string to a UTF-8 encoded string.
Depending on the platform, wide strings are either UTF-16 or UTF-32 encoded. This function takes a wide string as input and automatically calls the correct conversion function.
This allows for a cross-platform treatment of wide text and is preferable to using the UTF-16 or UTF-32 versions directly.
Example:
[in] | input | Wide-encoded string. |
[in] | inputSize | Size of the input in bytes. |
[out] | target | Output buffer for the result, can be NULL. |
[in] | targetSize | Size of the output buffer in bytes. |
[out] | errors | Output for errors. |
UTF8_ERR_NONE | No errors. |
UTF8_ERR_INVALID_DATA | Failed to decode data. |
UTF8_ERR_OVERLAPPING_PARAMETERS | Input and output buffers overlap in memory. |
UTF8_ERR_NOT_ENOUGH_SPACE | Target buffer size is insufficient for result. |
UTF8_API size_t utf8toutf16 | ( | const char * | input, |
size_t | inputSize, | ||
utf16_t * | target, | ||
size_t | targetSize, | ||
int32_t * | errors | ||
) |
Convert a UTF-8 encoded string to a UTF-16 encoded string.
Erroneous byte sequences such as missing or illegal bytes or overlong encoding of code points (e.g. using five bytes to encode a sequence that can be represented by two bytes) are converted to the replacement character U+FFFD.
Example:
[in] | input | UTF-8 encoded string. |
[in] | inputSize | Size of the input in bytes. |
[out] | target | Output buffer for the result, can be NULL. |
[in] | targetSize | Size of the output buffer in bytes. |
[out] | errors | Output for errors. |
UTF8_ERR_NONE | No errors. |
UTF8_ERR_INVALID_DATA | Failed to decode data. |
UTF8_ERR_OVERLAPPING_PARAMETERS | Input and output buffers overlap in memory. |
UTF8_ERR_NOT_ENOUGH_SPACE | Target buffer size is insufficient for result. |
UTF8_API size_t utf8toutf32 | ( | const char * | input, |
size_t | inputSize, | ||
unicode_t * | target, | ||
size_t | targetSize, | ||
int32_t * | errors | ||
) |
Convert a UTF-8 encoded string to a UTF-32 encoded string.
Erroneous byte sequences such as missing or illegal bytes or overlong encoding of code points (e.g. using five bytes to encode a sequence that can be represented by two bytes) are converted to the replacement character U+FFFD.
Example:
[in] | input | UTF-8 encoded string. |
[in] | inputSize | Size of the input in bytes. |
[out] | target | Output buffer for the result, can be NULL. |
[in] | targetSize | Size of the output buffer in bytes. |
[out] | errors | Output for errors. |
UTF8_ERR_NONE | No errors. |
UTF8_ERR_INVALID_DATA | Failed to decode data. |
UTF8_ERR_OVERLAPPING_PARAMETERS | Input and output buffers overlap in memory. |
UTF8_ERR_NOT_ENOUGH_SPACE | Target buffer size is insufficient for result. |
UTF8_API size_t utf8towide | ( | const char * | input, |
size_t | inputSize, | ||
wchar_t * | target, | ||
size_t | targetSize, | ||
int32_t * | errors | ||
) |
Convert a UTF-8 encoded string to a wide string.
Depending on the platform, wide strings are either UTF-16 or UTF-32 encoded. This function takes a UTF-8 encoded string as input and automatically calls the correct conversion function.
This allows for a cross-platform treatment of wide text and is preferable to using the UTF-16 or UTF-32 versions directly.
Erroneous byte sequences such as missing or illegal bytes or overlong encoding of code points (e.g. using five bytes to encode a sequence that can be represented by two bytes) are converted to the replacement character U+FFFD.
Example:
[in] | input | UTF-8 encoded string. |
[in] | inputSize | Size of the input in bytes. |
[out] | target | Output buffer for the result, can be NULL. |
[in] | targetSize | Size of the output buffer in bytes. |
[out] | errors | Output for errors. |
UTF8_ERR_NONE | No errors. |
UTF8_ERR_INVALID_DATA | Failed to decode data. |
UTF8_ERR_OVERLAPPING_PARAMETERS | Input and output buffers overlap in memory. |
UTF8_ERR_NOT_ENOUGH_SPACE | Target buffer size is insufficient for result. |
UTF8_API const char* utf8seek | ( | const char * | text, |
size_t | textSize, | ||
const char * | textStart, | ||
off_t | offset, | ||
int | direction | ||
) |
Seek into a UTF-8 encoded string.
Working with UTF-8 encoded strings can be tricky due to the nature of the variable-length encoding. Because one character no longer equals one byte, it can be difficult to skip around in a UTF-8 encoded string without decoding the code points.
This function provides an interface similar to fseek
in order to enable skipping to another part of the string.
textStart
must come before text
in memory when seeking from the current or end position.Example:
[in] | text | Input string. |
[in] | textSize | Size of input string in bytes. |
[in] | textStart | Start of input string. |
[in] | offset | Requested offset in code points. |
[in] | direction | Direction to seek in.
|
UTF8_API size_t utf8toupper | ( | const char * | input, |
size_t | inputSize, | ||
char * | target, | ||
size_t | targetSize, | ||
int32_t * | errors | ||
) |
Convert UTF-8 encoded text to uppercase.
This function allows conversion of UTF-8 encoded strings to uppercase without first changing the encoding to UTF-32. Conversion is fully compliant with the Unicode 7.0 standard.
Although most code points can be converted in-place, there are notable exceptions. For example, U+00DF (LATIN SMALL LETTER SHARP S) maps to "U+0053 U+0053" (LATIN CAPITAL LETTER S and LATIN CAPITAL LETTER S) when converted to uppercase. Therefor, it is advised to first determine the size in bytes of the output by calling the function with a NULL output buffer.
Only a handful of scripts make a distinction between upper and lowercase. In addition to modern scripts, such as Latin, Greek, Armenian and Cyrillic, a few historic or archaic scripts have case. The vast majority of scripts do not have case distinctions.
toUpper(toLower(x)) != toLower(toUpper(x))
.Example:
[in] | input | UTF-8 encoded string. |
[in] | inputSize | Size of the input in bytes. |
[out] | target | Output buffer for the result, can be NULL. |
[in] | targetSize | Size of the output buffer in bytes. |
[out] | errors | Output for errors. |
UTF8_ERR_NONE | No errors. |
UTF8_ERR_INVALID_DATA | Failed to decode data. |
UTF8_ERR_OVERLAPPING_PARAMETERS | Input and output buffers overlap in memory. |
UTF8_ERR_NOT_ENOUGH_SPACE | Target buffer size is insufficient for result. |
UTF8_API size_t utf8tolower | ( | const char * | input, |
size_t | inputSize, | ||
char * | target, | ||
size_t | targetSize, | ||
int32_t * | errors | ||
) |
Convert UTF-8 encoded text to lowercase.
This function allows conversion of UTF-8 encoded strings to lowercase without first changing the encoding to UTF-32. Conversion is fully compliant with the Unicode 7.0 standard.
Although most code points can be converted to lowercase in-place, there are notable exceptions. For example, U+0130 (LATIN CAPITAL LETTER I WITH DOT ABOVE) maps to "U+0069 U+0307" (LATIN SMALL LETTER I and COMBINING DOT ABOVE) when converted to lowercase. Therefor, it is advised to first determine the size in bytes of the output by calling the function with a NULL output buffer.
Only a handful of scripts make a distinction between upper- and lowercase. In addition to modern scripts, such as Latin, Greek, Armenian and Cyrillic, a few historic or archaic scripts have case. The vast majority of scripts do not have case distinctions.
toUpper(toLower(x)) != toLower(toUpper(x))
.Example:
[in] | input | UTF-8 encoded string. |
[in] | inputSize | Size of the input in bytes. |
[out] | target | Output buffer for the result, can be NULL. |
[in] | targetSize | Size of the output buffer in bytes. |
[out] | errors | Output for errors. |
UTF8_ERR_NONE | No errors. |
UTF8_ERR_INVALID_DATA | Failed to decode data. |
UTF8_ERR_OVERLAPPING_PARAMETERS | Input and output buffers overlap in memory. |
UTF8_ERR_NOT_ENOUGH_SPACE | Target buffer size is insufficient for result. |
UTF8_API size_t utf8totitle | ( | const char * | input, |
size_t | inputSize, | ||
char * | target, | ||
size_t | targetSize, | ||
int32_t * | errors | ||
) |
Convert UTF-8 encoded text to titlecase.
This function allows conversion of UTF-8 encoded strings to titlecase without first changing the encoding to UTF-32. Conversion is fully compliant with the Unicode 7.0 standard.
Titlecase requires a bit more explanation than uppercase and lowercase, because it is not a common text transformation. Titlecase uses uppercase for the first letter of each word and lowercase for the rest. Words are defined as "collections of code points with general category Lu, Ll, Lt, Lm or Lo according to the Unicode database".
Effectively, any type of punctuation can break up a word, even if this is not grammatically valid. This happens because the titlecasing algorithm does not and cannot take grammar rules into account.
Text | Titlecase |
---|---|
The running man | The Running Man |
NATO Alliance | Nato Alliance |
You're amazing at building libraries | You'Re Amazing At Building Libraries |
Although most code points can be converted to titlecase in-place, there are notable exceptions. For example, U+00DF (LATIN SMALL LETTER SHARP S) maps to "U+0053 U+0073" (LATIN CAPITAL LETTER S and LATIN SMALL LETTER S) when converted to titlecase. Therefor, it is advised to first determine the size in bytes of the output by calling the function with a NULL output buffer.
Only a handful of scripts make a distinction between upper- and lowercase. In addition to modern scripts, such as Latin, Greek, Armenian and Cyrillic, a few historic or archaic scripts have case. The vast majority of scripts do not have case distinctions.
toUpper(toLower(x)) != toLower(toUpper(x))
.Example:
[in] | input | UTF-8 encoded string. |
[in] | inputSize | Size of the input in bytes. |
[out] | target | Output buffer for the result, can be NULL. |
[in] | targetSize | Size of the output buffer in bytes. |
[out] | errors | Output for errors. |
UTF8_ERR_NONE | No errors. |
UTF8_ERR_INVALID_DATA | Failed to decode data. |
UTF8_ERR_OVERLAPPING_PARAMETERS | Input and output buffers overlap in memory. |
UTF8_ERR_NOT_ENOUGH_SPACE | Target buffer size is insufficient for result. |
UTF8_API size_t utf8casefold | ( | const char * | input, |
size_t | inputSize, | ||
char * | target, | ||
size_t | targetSize, | ||
int32_t * | errors | ||
) |
Remove case distinction from UTF-8 encoded text.
Case folding is the process of eliminating differences between code points concerning case mapping. It is most commonly used for comparing strings in a case-insensitive manner. Conversion is fully compliant with the Unicode 7.0 standard.
Although similar to lowercasing text, there are significant differences. For one, case folding does not take locale into account when converting. In some cases, case folding can be up to 20% faster than lowercasing the same text, but the result cannot be treated as correct lowercased text.
Only two locale-specific exception are made when case folding text. In Turkish, U+0049 LATIN CAPITAL LETTER I maps to U+0131 LATIN SMALL LETTER DOTLESS I and U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE maps to U+0069 LATIN SMALL LETTER I.
Although most code points can be case folded in-place, there are notable exceptions. For example, U+0130 (LATIN CAPITAL LETTER I WITH DOT ABOVE) maps to "U+0069 U+0307" (LATIN SMALL LETTER I and COMBINING DOT ABOVE) when converted to lowercase. Therefor, it is advised to first determine the size in bytes of the output by calling the function with a NULL output buffer.
Only a handful of scripts make a distinction between upper- and lowercase. In addition to modern scripts, such as Latin, Greek, Armenian and Cyrillic, a few historic or archaic scripts have case. The vast majority of scripts do not have case distinctions.
Example:
[in] | input | UTF-8 encoded string. |
[in] | inputSize | Size of the input in bytes. |
[out] | target | Output buffer for the result, can be NULL. |
[in] | targetSize | Size of the output buffer in bytes. |
[out] | errors | Output for errors. |
UTF8_ERR_NONE | No errors. |
UTF8_ERR_INVALID_DATA | Failed to decode data. |
UTF8_ERR_OVERLAPPING_PARAMETERS | Input and output buffers overlap in memory. |
UTF8_ERR_NOT_ENOUGH_SPACE | Target buffer size is insufficient for result. |
UTF8_API uint8_t utf8isnormalized | ( | const char * | input, |
size_t | inputSize, | ||
size_t | flags, | ||
size_t * | offset | ||
) |
Check if a string is stable in the specified Unicode Normalization Form.
This function can be used as a preprocessing step, before attempting to normalize a string. Normalization is a very expensive process, it is often cheaper to first determine if the string is unstable in the requested normalization form.
The result of the check will be YES if the string is stable and MAYBE or NO if it is unstable. If the result is MAYBE, the string does not necessarily have to be normalized.
If the result is unstable, the offset parameter is set to the offset for the first unstable code point. If the string is stable, the offset is equivalent to the length of the string in bytes.
You must specify the desired Unicode Normalization Form by using a combination of flags:
Unicode | Flags |
---|---|
Normalization Form C (NFC) | UTF8_NORMALIZE_COMPOSE |
Normalization Form KC (NFKC) | UTF8_NORMALIZE_COMPOSE + UTF8_NORMALIZE_COMPATIBILITY |
Normalization Form D (NFD) | UTF8_NORMALIZE_DECOMPOSE |
Normalization Form KD (NFKD) | UTF8_NORMALIZE_DECOMPOSE + UTF8_NORMALIZE_COMPATIBILITY |
For more information, please review Unicode Standard Annex #15 - Unicode Normalization Forms.
Example:
[in] | input | UTF-8 encoded string. |
[in] | inputSize | Size of the input in bytes. |
[in] | flags | Desired normalization form. Must be a combination of UTF8_NORMALIZE_COMPOSE, UTF8_NORMALIZE_DECOMPOSE and UTF8_NORMALIZE_COMPATIBILITY. |
[out] | offset | Offset to first unstable code point or length of input in bytes if stable. |
UTF8_NORMALIZATION_RESULT_YES | Input is stable and does not have to be normalized. |
UTF8_NORMALIZATION_RESULT_MAYBE | Input is unstable, but normalization may be skipped. |
UTF8_NORMALIZATION_RESULT_NO | Input is unstable and must be normalized. |
UTF8_API size_t utf8normalize | ( | const char * | input, |
size_t | inputSize, | ||
char * | target, | ||
size_t | targetSize, | ||
size_t | flags, | ||
int32_t * | errors | ||
) |
Normalize a string to the specified Unicode Normalization Form.
The Unicode standard defines two standards for equivalence between characters: canonical and compatibility equivalence. Canonically equivalent characters and sequence represent the same abstract character and must be rendered with the same appearance and behavior. Compatibility equivalent characters have a weaker equivalence and may be rendered differently.
Unicode Normalization Forms are formally defined standards that can be used to test whether any two strings of characters are equivalent to each other. This equivalence may be canonical or compatibility.
The algorithm puts all combining marks into a specified order and uses the rules for decomposition and composition to transform the string into one of four Unicode Normalization Forms. A binary comparison can then be used to determine equivalence.
These are the Unicode Normalization Forms:
Form | Description |
---|---|
Normalization Form D (NFD) | Canonical decomposition |
Normalization Form C (NFC) | Canonical decomposition, followed by canonical composition |
Normalization Form KD (NFKD) | Compatibility decomposition |
Normalization Form KC (NFKC) | Compatibility decomposition, followed by canonical composition |
utf8normalize
can be used to transform text into one of these forms. You must specify the desired Unicode Normalization Form by using a combination of flags:
Form | Flags |
---|---|
Normalization Form D (NFD) | UTF8_NORMALIZE_DECOMPOSE |
Normalization Form C (NFC) | UTF8_NORMALIZE_COMPOSE |
Normalization Form KD (NFKD) | UTF8_NORMALIZE_DECOMPOSE + UTF8_NORMALIZE_COMPATIBILITY |
Normalization Form KC (NFKC) | UTF8_NORMALIZE_COMPOSE + UTF8_NORMALIZE_COMPATIBILITY |
For more information, please review Unicode Standard Annex #15 - Unicode Normalization Forms.
Example:
[in] | input | UTF-8 encoded string. |
[in] | inputSize | Size of the input in bytes. |
[out] | target | Output buffer for the result, can be NULL. |
[in] | targetSize | Size of the output buffer in bytes. |
[in] | flags | Desired normalization form. Must be a combination of UTF8_NORMALIZE_COMPOSE, UTF8_NORMALIZE_DECOMPOSE and UTF8_NORMALIZE_COMPATIBILITY. |
[out] | errors | Output for errors. |
UTF8_ERR_NONE | No errors. |
UTF8_ERR_INVALID_FLAG | Invalid combination of flags was specified. |
UTF8_ERR_INVALID_DATA | Failed to decode data. |
UTF8_ERR_OVERLAPPING_PARAMETERS | Input and output buffers overlap in memory. |
UTF8_ERR_NOT_ENOUGH_SPACE | Target buffer size is insufficient for result. |
UTF8_API size_t utf8iscategory | ( | const char * | input, |
size_t | inputSize, | ||
size_t | flags | ||
) |
Check if the input string conforms to the category specified by the flags.
This function can be used to check if the code points in a string are part of a category. Valid flags are part of the UTF8_CATEGORY_* list of defines. The category for a code point is defined as part of the entry in UnicodeData.txt, the data file for the Unicode code point database.
utf8seek
to seek in the input first before matching it with the category flags.By default, the function will treat grapheme clusters as a single code point. This means that a string like:
Code point | Canonical combining class | General category | Name |
---|---|---|---|
U+0045 | 0 | Lu (Uppercase letter) | LATIN CAPITAL LETTER E |
U+0300 | 230 | Mn (Non-spacing mark) | COMBINING GRAVE ACCENT |
Will match with UTF8_CATEGORY_LETTER_UPPERCASE
fully, because the COMBINING GRAVE ACCENT is treated as part of the grapheme cluster. This is useful when e.g. creating a text parser, because you do not have to normalize the text first.
If this is undesired behavior, specify the UTF8_CATEGORY_IGNORE_GRAPHEME_CLUSTER
flag.
In order to main backwards compatibility with POSIX functions like isdigit
and isspace
, compatibility flags have been provided. Note, however, that the result is only guaranteed to be correct for code points in the Basic Latin range, between U+0000 and 0+007F. Combining a compatibility flag with a regular category flag will result in undefined behavior.
Example:
[in] | input | UTF-8 encoded string. |
[in] | inputSize | Size of the input in bytes. |
[in] | flags | Requested category. Must be a combination of UTF8_CATEGORY_* flags or a single UTF8_CATEGORY_IS* flag. |