utf8rewind
|
Functions for working with UTF-8 encoded text. More...
Go to the source code of this file.
Macros | |
#define | UTF8_ERR_INVALID_CHARACTER (-1) |
#define | UTF8_ERR_INVALID_DATA (-2) |
#define | UTF8_ERR_NOT_ENOUGH_SPACE (-3) |
#define | UTF8_ERR_OUT_OF_RANGE (-4) |
#define | UTF8_ERR_UNHANDLED_SURROGATE_PAIR (-5) |
#define | UTF8_ERR_UNMATCHED_HIGH_SURROGATE_PAIR (-6) |
#define | UTF8_ERR_UNMATCHED_LOW_SURROGATE_PAIR (-7) |
Typedefs | |
typedef unsigned int | unicode_t |
typedef unsigned short | ucs2_t |
typedef unsigned short | utf16_t |
Functions | |
int | utf8charvalid (char encodedCharacter) |
Check if a character is valid according to UTF-8 encoding. More... | |
int | utf8charlen (char encodedCharacter) |
Returns the length in bytes of the encoded character. More... | |
int | utf8len (const char *text) |
Get the length in codepoints of a UTF-8 encoded string. More... | |
int | utf8encode (unicode_t codepoint, char *target, size_t targetSize) |
Encode a Unicode codepoint to UTF-8. More... | |
int | utf8convertucs2 (ucs2_t codepoint, char *target, size_t targetSize) |
Convert a UCS-2 codepoint to UTF-8. More... | |
int | wctoutf8 (const wchar_t *input, size_t inputSize, char *target, size_t targetSize) |
Convert a UTF-16 encoded string to UTF-8. More... | |
int | utf8decode (const char *text, unicode_t *result) |
Decode a UTF-8 encoded codepoint to a Unicode codepoint. More... | |
int | utf8towc (const char *input, size_t inputSize, wchar_t *target, size_t targetSize) |
Convert a UTF-8 encoded string to UTF-16. More... | |
const char * | utf8seek (const char *text, const char *textStart, off_t offset, int direction) |
Seek into a UTF-8 encoded string. More... | |
Functions for working with UTF-8 encoded text.
#define UTF8_ERR_INVALID_CHARACTER (-1) |
#define UTF8_ERR_INVALID_DATA (-2) |
#define UTF8_ERR_NOT_ENOUGH_SPACE (-3) |
#define UTF8_ERR_OUT_OF_RANGE (-4) |
#define UTF8_ERR_UNHANDLED_SURROGATE_PAIR (-5) |
#define UTF8_ERR_UNMATCHED_HIGH_SURROGATE_PAIR (-6) |
#define UTF8_ERR_UNMATCHED_LOW_SURROGATE_PAIR (-7) |
typedef unsigned short ucs2_t |
UCS-2 encoded codepoint.
typedef unsigned int unicode_t |
Unicode codepoint.
typedef unsigned short utf16_t |
UTF-16 encoded codepoint.
int utf8charlen | ( | char | encodedCharacter | ) |
Returns the length in bytes of the encoded character.
A UTF-8 encoded codepoint must start with a special byte. This byte indicates how many bytes are used to encode the codepoint, up to a maximum of 6.
This function can be used to determine the amount of bytes used to encode a codepoint.
encodedCharacter | Character to check. |
int utf8charvalid | ( | char | encodedCharacter | ) |
Check if a character is valid according to UTF-8 encoding.
encodedCharacter | Character to check. |
int utf8convertucs2 | ( | ucs2_t | codepoint, |
char * | target, | ||
size_t | targetSize | ||
) |
Convert a UCS-2 codepoint to UTF-8.
UCS-2 encoding is similar to UTF-16 encoding, except that it does not use surrogate pairs to encode values beyond U+FFFF.
This encoding was standard on Microsoft Windows XP. Newer versions of Windows use UTF-16 instead.
If 0 is specified as the target buffer, this function returns the number of bytes needed to store the codepoint.
Example:
codepoint | UCS-2 encoded codepoint. |
target | String to write the result to. |
targetSize | Amount of bytes remaining in the string. |
int utf8decode | ( | const char * | text, |
unicode_t * | result | ||
) |
Decode a UTF-8 encoded codepoint to a Unicode codepoint.
The result of this function can be used to offset the input string in order to decode all characters in a string.
Example:
text | Input string. |
result | String to write the result to. |
int utf8encode | ( | unicode_t | codepoint, |
char * | target, | ||
size_t | targetSize | ||
) |
Encode a Unicode codepoint to UTF-8.
Unicode codepoints must be in the range 0 - U+10FFFF, however the range U+D800 to U+DFFF is reserved for surrogate pairs and cannot be encoded.
Example:
codepoint | Unicode codepoint. |
target | String to write the result to. |
targetSize | Amount of bytes remaining in the string. |
int utf8len | ( | const char * | text | ) |
Get the length in codepoints of a UTF-8 encoded string.
Example:
text | UTF-8 encoded string. |
const char* utf8seek | ( | const char * | text, |
const char * | textStart, | ||
off_t | offset, | ||
int | direction | ||
) |
Seek into a UTF-8 encoded string.
Working with UTF-8 encoded strings can be tricky due to the nature of the variable-length encoding. Because one character no longer equals one byte, it can be difficult to skip around in a UTF-8 encoded string without decoding the codepoints.
This function provides an interface similar to fseek
in order to enable skipping to another part of the string.
Example:
Directions:
SEEK_SET
Offset is from the start of the string.SEEK_CUR
Offset is from the current position of the string.SEEK_END
Offset is from the end of the string.textStart
must come before text
in memory when seeking from the current or end position.text | Input string. |
textStart | Start of input string. |
offset | Requested offset in codepoints. |
direction | Direction to seek in. |
int utf8towc | ( | const char * | input, |
size_t | inputSize, | ||
wchar_t * | target, | ||
size_t | targetSize | ||
) |
Convert a UTF-8 encoded string to UTF-16.
Example:
input | UTF-8 encoded string. |
inputSize | Size of the input in bytes. |
target | String to write the result to. |
targetSize | Amount of bytes remaining in the string. |
int wctoutf8 | ( | const wchar_t * | input, |
size_t | inputSize, | ||
char * | target, | ||
size_t | targetSize | ||
) |
Convert a UTF-16 encoded string to UTF-8.
UTF-16 encoded text consists of two up to four bytes per encoded codepoint. A codepoint may consist of a high and low surrogate pair, which allows the encoding of the full range of Unicode characters that would otherwise not fit in a single 16-bit integer.
If 0 is specified as the target buffer, this function returns the number of bytes needed to store the string.
Example:
input | UTF-16 encoded string. |
inputSize | Size of the input in bytes. |
target | String to write the result to. |
targetSize | Amount of bytes remaining in the string. |