utf8rewind
|
Functions for working with UTF-8 encoded text. More...
Go to the source code of this file.
Macros | |
#define | UTF8_ERR_INVALID_CHARACTER (-1) |
#define | UTF8_ERR_INVALID_DATA (-2) |
#define | UTF8_ERR_NOT_ENOUGH_SPACE (-3) |
#define | UTF8_ERR_OUT_OF_RANGE (-4) |
#define | UTF8_ERR_UNHANDLED_SURROGATE_PAIR (-5) |
#define | UTF8_ERR_UNMATCHED_HIGH_SURROGATE_PAIR (-6) |
#define | UTF8_ERR_UNMATCHED_LOW_SURROGATE_PAIR (-7) |
#define | UTF8_WCHAR_SIZE (2) |
#define | UTF8_WCHAR_UTF16 (1) |
Typedefs | |
typedef uint32_t | unicode_t |
typedef uint16_t | ucs2_t |
typedef uint16_t | utf16_t |
Functions | |
int8_t | utf8charvalid (char encodedCharacter) |
Check if a character is valid according to UTF-8 encoding. More... | |
size_t | utf8charlen (char encodedCharacter) |
Returns the length in bytes of the encoded character. More... | |
size_t | utf8len (const char *text) |
Get the length in codepoints of a UTF-8 encoded string. More... | |
size_t | utf16toutf8 (const utf16_t *input, size_t inputSize, char *target, size_t targetSize, int32_t *errors) |
Convert a UTF-16 encoded string to a UTF-8 encoded string. More... | |
size_t | utf32toutf8 (const unicode_t *input, size_t inputSize, char *target, size_t targetSize, int32_t *errors) |
Convert a UTF-32 encoded string to a UTF-8 encoded string. More... | |
size_t | widetoutf8 (const wchar_t *input, size_t inputSize, char *target, size_t targetSize, int32_t *errors) |
Convert a wide string to a UTF-8 encoded string. More... | |
size_t | utf8toutf16 (const char *input, size_t inputSize, utf16_t *target, size_t targetSize, int32_t *errors) |
Convert a UTF-8 encoded string to a UTF-16 encoded string. More... | |
size_t | utf8toutf32 (const char *input, size_t inputSize, unicode_t *target, size_t targetSize, int32_t *errors) |
Convert a UTF-8 encoded string to a UTF-32 encoded string. More... | |
size_t | utf8towide (const char *input, size_t inputSize, wchar_t *target, size_t targetSize, int32_t *errors) |
Convert a UTF-8 encoded string to a wide string. More... | |
const char * | utf8seek (const char *text, const char *textStart, off_t offset, int direction) |
Seek into a UTF-8 encoded string. More... | |
Functions for working with UTF-8 encoded text.
#define UTF8_ERR_INVALID_CHARACTER (-1) |
#define UTF8_ERR_INVALID_DATA (-2) |
#define UTF8_ERR_NOT_ENOUGH_SPACE (-3) |
#define UTF8_ERR_OUT_OF_RANGE (-4) |
#define UTF8_ERR_UNHANDLED_SURROGATE_PAIR (-5) |
#define UTF8_ERR_UNMATCHED_HIGH_SURROGATE_PAIR (-6) |
#define UTF8_ERR_UNMATCHED_LOW_SURROGATE_PAIR (-7) |
typedef uint16_t ucs2_t |
UCS-2 encoded codepoint.
typedef uint32_t unicode_t |
Unicode codepoint.
typedef uint16_t utf16_t |
UTF-16 encoded codepoint.
size_t utf16toutf8 | ( | const utf16_t * | input, |
size_t | inputSize, | ||
char * | target, | ||
size_t | targetSize, | ||
int32_t * | errors | ||
) |
Convert a UTF-16 encoded string to a UTF-8 encoded string.
Example:
[in] | input | UTF-16 encoded string. |
[in] | inputSize | Size of the input in bytes. |
[out] | target | Output buffer for the result. |
[in] | targetSize | Size of the output buffer in bytes. |
[out] | errors | Output for errors. |
UTF8_ERR_INVALID_DATA | Input does not contain enough bytes for encoding. |
UTF8_ERR_UNMATCHED_HIGH_SURROGATE_PAIR | High surrogate pair was not matched. |
UTF8_ERR_UNMATCHED_LOW_SURROGATE_PAIR | Low surrogate pair was not matched. |
UTF8_ERR_NOT_ENOUGH_SPACE | Target buffer could not contain result. |
UTF8_ERR_INVALID_CHARACTER | Codepoint could not be encoded. |
size_t utf32toutf8 | ( | const unicode_t * | input, |
size_t | inputSize, | ||
char * | target, | ||
size_t | targetSize, | ||
int32_t * | errors | ||
) |
Convert a UTF-32 encoded string to a UTF-8 encoded string.
Example:
[in] | input | UTF-32 encoded string. |
[in] | inputSize | Size of the input in bytes. |
[out] | target | Output buffer for the result. |
[in] | targetSize | Size of the output buffer in bytes. |
[out] | errors | Output for errors. |
UTF8_ERR_INVALID_DATA | Input does not contain enough bytes for encoding. |
UTF8_ERR_UNMATCHED_HIGH_SURROGATE_PAIR | High surrogate pair was not matched. |
UTF8_ERR_UNMATCHED_LOW_SURROGATE_PAIR | Low surrogate pair was not matched. |
UTF8_ERR_NOT_ENOUGH_SPACE | Target buffer could not contain result. |
UTF8_ERR_INVALID_CHARACTER | Codepoint could not be encoded. |
size_t utf8charlen | ( | char | encodedCharacter | ) |
Returns the length in bytes of the encoded character.
A UTF-8 encoded codepoint must start with a special byte. This byte indicates how many bytes are used to encode the codepoint, up to a maximum of 4.
This function can be used to determine the amount of bytes used to encode a codepoint.
[in] | encodedCharacter | Byte to check. |
int8_t utf8charvalid | ( | char | encodedCharacter | ) |
Check if a character is valid according to UTF-8 encoding.
[in] | encodedCharacter | Byte to check. |
size_t utf8len | ( | const char * | text | ) |
const char* utf8seek | ( | const char * | text, |
const char * | textStart, | ||
off_t | offset, | ||
int | direction | ||
) |
Seek into a UTF-8 encoded string.
Working with UTF-8 encoded strings can be tricky due to the nature of the variable-length encoding. Because one character no longer equals one byte, it can be difficult to skip around in a UTF-8 encoded string without decoding the codepoints.
This function provides an interface similar to fseek
in order to enable skipping to another part of the string.
textStart
must come before text
in memory when seeking from the current or end position.Example:
[in] | text | Input string. |
[in] | textStart | Start of input string. |
[in] | offset | Requested offset in codepoints. |
[in] | direction | Direction to seek in.
|
size_t utf8toutf16 | ( | const char * | input, |
size_t | inputSize, | ||
utf16_t * | target, | ||
size_t | targetSize, | ||
int32_t * | errors | ||
) |
Convert a UTF-8 encoded string to a UTF-16 encoded string.
Example:
[in] | input | UTF-8 encoded string. |
[in] | inputSize | Size of the input in bytes. |
[out] | target | Output buffer for the result. |
[in] | targetSize | Size of the output buffer in bytes. |
[out] | errors | Output for errors. |
UTF8_ERR_INVALID_DATA | Input does not contain enough bytes for decoding. |
UTF8_ERR_NOT_ENOUGH_SPACE | Target buffer could not contain result. |
size_t utf8toutf32 | ( | const char * | input, |
size_t | inputSize, | ||
unicode_t * | target, | ||
size_t | targetSize, | ||
int32_t * | errors | ||
) |
Convert a UTF-8 encoded string to a UTF-32 encoded string.
Example:
[in] | input | UTF-8 encoded string. |
[in] | inputSize | Size of the input in bytes. |
[out] | target | Output buffer for the result. |
[in] | targetSize | Size of the output buffer in bytes. |
[out] | errors | Output for errors. |
UTF8_ERR_INVALID_DATA | Input does not contain enough bytes for decoding. |
UTF8_ERR_NOT_ENOUGH_SPACE | Target buffer could not contain result. |
size_t utf8towide | ( | const char * | input, |
size_t | inputSize, | ||
wchar_t * | target, | ||
size_t | targetSize, | ||
int32_t * | errors | ||
) |
Convert a UTF-8 encoded string to a wide string.
Depending on the platform, wide strings are either UTF-16 or UTF-32 encoded. This function takes a UTF-8 encoded string as input and automatically calls the correct conversion function.
This allows for a cross-platform treatment of wide text and is preferable to using the UTF-16 or UTF-32 versions directly.
Example:
[in] | input | UTF-8 encoded string. |
[in] | inputSize | Size of the input in bytes. |
[out] | target | Output buffer for the result. |
[in] | targetSize | Size of the output buffer in bytes. |
[out] | errors | Output for errors. |
UTF8_ERR_INVALID_DATA | Input does not contain enough bytes for decoding. |
UTF8_ERR_NOT_ENOUGH_SPACE | Target buffer could not contain result. |
size_t widetoutf8 | ( | const wchar_t * | input, |
size_t | inputSize, | ||
char * | target, | ||
size_t | targetSize, | ||
int32_t * | errors | ||
) |
Convert a wide string to a UTF-8 encoded string.
Depending on the platform, wide strings are either UTF-16 or UTF-32 encoded. This function takes a wide string as input and automatically calls the correct conversion function.
This allows for a cross-platform treatment of wide text and is preferable to using the UTF-16 or UTF-32 versions directly.
Example:
[in] | input | Wide-encoded string. |
[in] | inputSize | Size of the input in bytes. |
[out] | target | Output buffer for the result. |
[in] | targetSize | Size of the output buffer in bytes. |
[out] | errors | Output for errors. |
UTF8_ERR_INVALID_DATA | Input does not contain enough bytes for encoding. |
UTF8_ERR_UNMATCHED_HIGH_SURROGATE_PAIR | High surrogate pair was not matched. |
UTF8_ERR_UNMATCHED_LOW_SURROGATE_PAIR | Low surrogate pair was not matched. |
UTF8_ERR_NOT_ENOUGH_SPACE | Target buffer could not contain result. |
UTF8_ERR_INVALID_CHARACTER | Codepoint could not be encoded. |