utf8rewind  1.4.1
System library for processing UTF-8 encoded text
Typedefs | Functions
utf8rewind.h File Reference

Public interface for UTF-8 functions. More...

Go to the source code of this file.

Macros

Version information

Macros used to identify the version of the library.

#define UTF8_VERSION_MAKE(_major, _minor, _bugfix)   ((_major) * 10000) + ((_minor) * 100) + (_bugfix)
 Macro for creating a version number from a major, minor and bugfix number. More...
 
#define UTF8_VERSION_MAJOR   1
 The major version number of this release. More...
 
#define UTF8_VERSION_MINOR   4
 The minor version number of this release. More...
 
#define UTF8_VERSION_BUGFIX   1
 The bugfix version number of this release. More...
 
#define UTF8_VERSION   UTF8_VERSION_MAKE(UTF8_VERSION_MAJOR, UTF8_VERSION_MINOR, UTF8_VERSION_BUGFIX)
 The version number as an integer. More...
 
#define UTF8_VERSION_STRING   "1.4.1"
 The verion number as a string. More...
 
#define UTF8_VERSION_GUARD(_major, _minor, _bugfix)   (UTF8_VERSION >= UTF8_VERSION_MAKE(_major, _minor, _bugfix))
 Check if feature is supported by the current release. More...
 
Error codes

Values returned by functions on error.

#define UTF8_ERR_NONE   (0)
 No errors. More...
 
#define UTF8_ERR_INVALID_DATA   (-1)
 Input data is invalid. More...
 
#define UTF8_ERR_INVALID_FLAG   (-2)
 Input flag is invalid. More...
 
#define UTF8_ERR_NOT_ENOUGH_SPACE   (-3)
 Not enough space in buffer to store result. More...
 
#define UTF8_ERR_OVERLAPPING_PARAMETERS   (-4)
 Input and output buffers overlap in memory. More...
 
Global configuration

Defines used for determining the global configuration of the system and your application.

#define UTF8_WCHAR_SIZE   (2)
 Specifies the size of the wchar_t type. On Windows this is 2, on POSIX systems it is 4. If not specified on the command line, the compiler tries to automatically determine the size of the wchar_t type based on the environment. More...
 
#define UTF8_WCHAR_UTF16   (1)
 The wchar_t type is treated as UTF-16 (2 bytes). More...
 
#define UTF8_API
 Calling convention for public functions. More...
 
Normalization flags

Flags used as input for utf8normalize and the result of utf8isnormalized.

#define UTF8_NORMALIZE_COMPOSE   0x00000001
 Normalize input to Normalization Form C (NFC). More...
 
#define UTF8_NORMALIZE_DECOMPOSE   0x00000002
 Normalize input to Normalization Form D (NFD). More...
 
#define UTF8_NORMALIZE_COMPATIBILITY   0x00000004
 Change Normalization Form from NFC to NFKC or from NFD to NFKD. More...
 
#define UTF8_NORMALIZATION_RESULT_YES   (0)
 Text is stable and does not have to be normalized. More...
 
#define UTF8_NORMALIZATION_RESULT_MAYBE   (1)
 Text is unstable, but normalization may be skipped. More...
 
#define UTF8_NORMALIZATION_RESULT_NO   (2)
 Text is unstable and must be normalized. More...
 
Category flags

Flags to be used with utf8iscategory, to check whether code points in a string are part of that category.

#define UTF8_CATEGORY_LETTER_UPPERCASE   0x00000001
 Uppercase letter code points, Lu in the Unicode database. More...
 
#define UTF8_CATEGORY_LETTER_LOWERCASE   0x00000002
 Lowercase letter code points, Ll in the Unicode database. More...
 
#define UTF8_CATEGORY_LETTER_TITLECASE   0x00000004
 Titlecase letter code points, Lt in the Unicode database. More...
 
#define UTF8_CATEGORY_LETTER_MODIFIER   0x00000008
 Modifier letter code points, Lm in the Unicode database. More...
 
#define UTF8_CATEGORY_LETTER_OTHER   0x00000010
 Other letter code points, Lo in the Unicode database. More...
 
#define UTF8_CATEGORY_LETTER
 Combined flag for all letter categories. More...
 
#define UTF8_CATEGORY_CASE_MAPPED
 Combined flag for all letter categories with case mapping. More...
 
#define UTF8_CATEGORY_MARK_NON_SPACING   0x00000020
 Non-spacing mark code points, Mn in the Unicode database. More...
 
#define UTF8_CATEGORY_MARK_SPACING   0x00000040
 Spacing mark code points, Mc in the Unicode database. More...
 
#define UTF8_CATEGORY_MARK_ENCLOSING   0x00000080
 Enclosing mark code points, Me in the Unicode database. More...
 
#define UTF8_CATEGORY_MARK
 Combined flag for all mark categories. More...
 
#define UTF8_CATEGORY_NUMBER_DECIMAL   0x00000100
 Decimal number code points, Nd in the Unicode database. More...
 
#define UTF8_CATEGORY_NUMBER_LETTER   0x00000200
 Letter number code points, Nl in the Unicode database. More...
 
#define UTF8_CATEGORY_NUMBER_OTHER   0x00000400
 Other number code points, No in the Unicode database. More...
 
#define UTF8_CATEGORY_NUMBER
 Combined flag for all number categories. More...
 
#define UTF8_CATEGORY_PUNCTUATION_CONNECTOR   0x00000800
 Connector punctuation category, Pc in the Unicode database. More...
 
#define UTF8_CATEGORY_PUNCTUATION_DASH   0x00001000
 Dash punctuation category, Pd in the Unicode database. More...
 
#define UTF8_CATEGORY_PUNCTUATION_OPEN   0x00002000
 Open punctuation category, Ps in the Unicode database. More...
 
#define UTF8_CATEGORY_PUNCTUATION_CLOSE   0x00004000
 Close punctuation category, Pe in the Unicode database. More...
 
#define UTF8_CATEGORY_PUNCTUATION_INITIAL   0x00008000
 Initial punctuation category, Pi in the Unicode database. More...
 
#define UTF8_CATEGORY_PUNCTUATION_FINAL   0x00010000
 Final punctuation category, Pf in the Unicode database. More...
 
#define UTF8_CATEGORY_PUNCTUATION_OTHER   0x00020000
 Other punctuation category, Po in the Unicode database. More...
 
#define UTF8_CATEGORY_PUNCTUATION
 Combined flag for all punctuation categories. More...
 
#define UTF8_CATEGORY_SYMBOL_MATH   0x00040000
 Math symbol category, Sm in the Unicode database. More...
 
#define UTF8_CATEGORY_SYMBOL_CURRENCY   0x00080000
 Currency symbol category, Sc in the Unicode database. More...
 
#define UTF8_CATEGORY_SYMBOL_MODIFIER   0x00100000
 Modifier symbol category, Sk in the Unicode database. More...
 
#define UTF8_CATEGORY_SYMBOL_OTHER   0x00200000
 Other symbol category, So in the Unicode database. More...
 
#define UTF8_CATEGORY_SYMBOL
 Combined flag for all symbol categories. More...
 
#define UTF8_CATEGORY_SEPARATOR_SPACE   0x00400000
 Space separator category, Zs in the Unicode database. More...
 
#define UTF8_CATEGORY_SEPARATOR_LINE   0x00800000
 Line separator category, Zl in the Unicode database. More...
 
#define UTF8_CATEGORY_SEPARATOR_PARAGRAPH   0x01000000
 Paragraph separator category, Zp in the Unicode database. More...
 
#define UTF8_CATEGORY_SEPARATOR
 Combined flag for all separator categories. More...
 
#define UTF8_CATEGORY_CONTROL   0x02000000
 Control category, Cc in the Unicode database. More...
 
#define UTF8_CATEGORY_FORMAT   0x04000000
 Format category, Cf in the Unicode database. More...
 
#define UTF8_CATEGORY_SURROGATE   0x08000000
 Surrogate category, Cs in the Unicode database. More...
 
#define UTF8_CATEGORY_PRIVATE_USE   0x10000000
 Private use category, Co in the Unicode database. More...
 
#define UTF8_CATEGORY_UNASSIGNED   0x20000000
 Unassigned category, Cn in the Unicode database. More...
 
#define UTF8_CATEGORY_COMPATIBILITY   0x40000000
 Flag used for maintaining backwards compatibility with POSIX functions, not found in the Unicode database. More...
 
#define UTF8_CATEGORY_IGNORE_GRAPHEME_CLUSTER   0x80000000
 Flag used for checking only the general category of code points at the start of a grapheme cluster. More...
 
#define UTF8_CATEGORY_ISCNTRL
 Flag used for maintaining backwards compatibility with POSIX iscntrl function. More...
 
#define UTF8_CATEGORY_ISPRINT
 Flag used for maintaining backwards compatibility with POSIX isprint function. More...
 
#define UTF8_CATEGORY_ISSPACE
 Flag used for maintaining backwards compatibility with POSIX isspace function. More...
 
#define UTF8_CATEGORY_ISBLANK
 Flag used for maintaining backwards compatibility with POSIX isblank function. More...
 
#define UTF8_CATEGORY_ISGRAPH
 Flag used for maintaining backwards compatibility with POSIX isgraph function. More...
 
#define UTF8_CATEGORY_ISPUNCT
 Flag used for maintaining backwards compatibility with POSIX ispunct function. More...
 
#define UTF8_CATEGORY_ISALNUM
 Flag used for maintaining backwards compatibility with POSIX isalnum function. More...
 
#define UTF8_CATEGORY_ISALPHA
 Flag used for maintaining backwards compatibility with POSIX isalpha function. More...
 
#define UTF8_CATEGORY_ISUPPER
 Flag used for maintaining backwards compatibility with POSIX isupper function. More...
 
#define UTF8_CATEGORY_ISLOWER
 Flag used for maintaining backwards compatibility with POSIX islower function. More...
 
#define UTF8_CATEGORY_ISDIGIT
 Flag used for maintaining backwards compatibility with POSIX isdigit function. More...
 
#define UTF8_CATEGORY_ISXDIGIT
 Flag used for maintaining backwards compatibility with POSIX isxdigit function. More...
 

Typedefs

typedef uint16_t utf16_t
 UTF-16 encoded code point. More...
 
typedef uint32_t unicode_t
 UTF-32 encoded code point. More...
 

Functions

UTF8_API size_t utf8len (const char *text)
 Get the length in code points of a UTF-8 encoded string. More...
 
UTF8_API size_t utf16toutf8 (const utf16_t *input, size_t inputSize, char *target, size_t targetSize, int32_t *errors)
 Convert a UTF-16 encoded string to a UTF-8 encoded string. More...
 
UTF8_API size_t utf32toutf8 (const unicode_t *input, size_t inputSize, char *target, size_t targetSize, int32_t *errors)
 Convert a UTF-32 encoded string to a UTF-8 encoded string. More...
 
UTF8_API size_t widetoutf8 (const wchar_t *input, size_t inputSize, char *target, size_t targetSize, int32_t *errors)
 Convert a wide string to a UTF-8 encoded string. More...
 
UTF8_API size_t utf8toutf16 (const char *input, size_t inputSize, utf16_t *target, size_t targetSize, int32_t *errors)
 Convert a UTF-8 encoded string to a UTF-16 encoded string. More...
 
UTF8_API size_t utf8toutf32 (const char *input, size_t inputSize, unicode_t *target, size_t targetSize, int32_t *errors)
 Convert a UTF-8 encoded string to a UTF-32 encoded string. More...
 
UTF8_API size_t utf8towide (const char *input, size_t inputSize, wchar_t *target, size_t targetSize, int32_t *errors)
 Convert a UTF-8 encoded string to a wide string. More...
 
UTF8_API const char * utf8seek (const char *text, size_t textSize, const char *textStart, off_t offset, int direction)
 Seek into a UTF-8 encoded string. More...
 
UTF8_API size_t utf8toupper (const char *input, size_t inputSize, char *target, size_t targetSize, int32_t *errors)
 Convert UTF-8 encoded text to uppercase. More...
 
UTF8_API size_t utf8tolower (const char *input, size_t inputSize, char *target, size_t targetSize, int32_t *errors)
 Convert UTF-8 encoded text to lowercase. More...
 
UTF8_API size_t utf8totitle (const char *input, size_t inputSize, char *target, size_t targetSize, int32_t *errors)
 Convert UTF-8 encoded text to titlecase. More...
 
UTF8_API size_t utf8casefold (const char *input, size_t inputSize, char *target, size_t targetSize, int32_t *errors)
 Remove case distinction from UTF-8 encoded text. More...
 
UTF8_API uint8_t utf8isnormalized (const char *input, size_t inputSize, size_t flags, size_t *offset)
 Check if a string is stable in the specified Unicode Normalization Form. More...
 
UTF8_API size_t utf8normalize (const char *input, size_t inputSize, char *target, size_t targetSize, size_t flags, int32_t *errors)
 Normalize a string to the specified Unicode Normalization Form. More...
 
UTF8_API size_t utf8iscategory (const char *input, size_t inputSize, size_t flags)
 Check if the input string conforms to the category specified by the flags. More...
 

Detailed Description

Public interface for UTF-8 functions.

Macro Definition Documentation

#define UTF8_VERSION_MAKE (   _major,
  _minor,
  _bugfix 
)    ((_major) * 10000) + ((_minor) * 100) + (_bugfix)

Macro for creating a version number from a major, minor and bugfix number.

#define UTF8_VERSION_MAJOR   1

The major version number of this release.

#define UTF8_VERSION_MINOR   4

The minor version number of this release.

#define UTF8_VERSION_BUGFIX   1

The bugfix version number of this release.

The version number as an integer.

#define UTF8_VERSION_STRING   "1.4.1"

The verion number as a string.

#define UTF8_VERSION_GUARD (   _major,
  _minor,
  _bugfix 
)    (UTF8_VERSION >= UTF8_VERSION_MAKE(_major, _minor, _bugfix))

Check if feature is supported by the current release.

#define UTF8_ERR_NONE   (0)

No errors.

#define UTF8_ERR_INVALID_DATA   (-1)

Input data is invalid.

#define UTF8_ERR_INVALID_FLAG   (-2)

Input flag is invalid.

#define UTF8_ERR_NOT_ENOUGH_SPACE   (-3)

Not enough space in buffer to store result.

#define UTF8_ERR_OVERLAPPING_PARAMETERS   (-4)

Input and output buffers overlap in memory.

#define UTF8_WCHAR_SIZE   (2)

Specifies the size of the wchar_t type. On Windows this is 2, on POSIX systems it is 4. If not specified on the command line, the compiler tries to automatically determine the size of the wchar_t type based on the environment.

#define UTF8_WCHAR_UTF16   (1)

The wchar_t type is treated as UTF-16 (2 bytes).

#define UTF8_API

Calling convention for public functions.

#define UTF8_NORMALIZE_COMPOSE   0x00000001

Normalize input to Normalization Form C (NFC).

#define UTF8_NORMALIZE_DECOMPOSE   0x00000002

Normalize input to Normalization Form D (NFD).

#define UTF8_NORMALIZE_COMPATIBILITY   0x00000004

Change Normalization Form from NFC to NFKC or from NFD to NFKD.

#define UTF8_NORMALIZATION_RESULT_YES   (0)

Text is stable and does not have to be normalized.

#define UTF8_NORMALIZATION_RESULT_MAYBE   (1)

Text is unstable, but normalization may be skipped.

#define UTF8_NORMALIZATION_RESULT_NO   (2)

Text is unstable and must be normalized.

#define UTF8_CATEGORY_LETTER_UPPERCASE   0x00000001

Uppercase letter code points, Lu in the Unicode database.

#define UTF8_CATEGORY_LETTER_LOWERCASE   0x00000002

Lowercase letter code points, Ll in the Unicode database.

#define UTF8_CATEGORY_LETTER_TITLECASE   0x00000004

Titlecase letter code points, Lt in the Unicode database.

#define UTF8_CATEGORY_LETTER_MODIFIER   0x00000008

Modifier letter code points, Lm in the Unicode database.

#define UTF8_CATEGORY_LETTER_OTHER   0x00000010

Other letter code points, Lo in the Unicode database.

#define UTF8_CATEGORY_LETTER
Value:
#define UTF8_CATEGORY_LETTER_UPPERCASE
Uppercase letter code points, Lu in the Unicode database.
Definition: utf8rewind.h:1273
#define UTF8_CATEGORY_LETTER_MODIFIER
Modifier letter code points, Lm in the Unicode database.
Definition: utf8rewind.h:1291
#define UTF8_CATEGORY_LETTER_LOWERCASE
Lowercase letter code points, Ll in the Unicode database.
Definition: utf8rewind.h:1279
#define UTF8_CATEGORY_LETTER_TITLECASE
Titlecase letter code points, Lt in the Unicode database.
Definition: utf8rewind.h:1285
#define UTF8_CATEGORY_LETTER_OTHER
Other letter code points, Lo in the Unicode database.
Definition: utf8rewind.h:1297

Combined flag for all letter categories.

#define UTF8_CATEGORY_CASE_MAPPED
Value:
#define UTF8_CATEGORY_LETTER_UPPERCASE
Uppercase letter code points, Lu in the Unicode database.
Definition: utf8rewind.h:1273
#define UTF8_CATEGORY_LETTER_LOWERCASE
Lowercase letter code points, Ll in the Unicode database.
Definition: utf8rewind.h:1279
#define UTF8_CATEGORY_LETTER_TITLECASE
Titlecase letter code points, Lt in the Unicode database.
Definition: utf8rewind.h:1285

Combined flag for all letter categories with case mapping.

#define UTF8_CATEGORY_MARK_NON_SPACING   0x00000020

Non-spacing mark code points, Mn in the Unicode database.

#define UTF8_CATEGORY_MARK_SPACING   0x00000040

Spacing mark code points, Mc in the Unicode database.

#define UTF8_CATEGORY_MARK_ENCLOSING   0x00000080

Enclosing mark code points, Me in the Unicode database.

#define UTF8_CATEGORY_MARK
Value:
#define UTF8_CATEGORY_MARK_SPACING
Spacing mark code points, Mc in the Unicode database.
Definition: utf8rewind.h:1326
#define UTF8_CATEGORY_MARK_NON_SPACING
Non-spacing mark code points, Mn in the Unicode database.
Definition: utf8rewind.h:1320
#define UTF8_CATEGORY_MARK_ENCLOSING
Enclosing mark code points, Me in the Unicode database.
Definition: utf8rewind.h:1332

Combined flag for all mark categories.

#define UTF8_CATEGORY_NUMBER_DECIMAL   0x00000100

Decimal number code points, Nd in the Unicode database.

#define UTF8_CATEGORY_NUMBER_LETTER   0x00000200

Letter number code points, Nl in the Unicode database.

#define UTF8_CATEGORY_NUMBER_OTHER   0x00000400

Other number code points, No in the Unicode database.

#define UTF8_CATEGORY_NUMBER
Value:
#define UTF8_CATEGORY_NUMBER_LETTER
Letter number code points, Nl in the Unicode database.
Definition: utf8rewind.h:1352
#define UTF8_CATEGORY_NUMBER_OTHER
Other number code points, No in the Unicode database.
Definition: utf8rewind.h:1358
#define UTF8_CATEGORY_NUMBER_DECIMAL
Decimal number code points, Nd in the Unicode database.
Definition: utf8rewind.h:1346

Combined flag for all number categories.

#define UTF8_CATEGORY_PUNCTUATION_CONNECTOR   0x00000800

Connector punctuation category, Pc in the Unicode database.

#define UTF8_CATEGORY_PUNCTUATION_DASH   0x00001000

Dash punctuation category, Pd in the Unicode database.

#define UTF8_CATEGORY_PUNCTUATION_OPEN   0x00002000

Open punctuation category, Ps in the Unicode database.

#define UTF8_CATEGORY_PUNCTUATION_CLOSE   0x00004000

Close punctuation category, Pe in the Unicode database.

#define UTF8_CATEGORY_PUNCTUATION_INITIAL   0x00008000

Initial punctuation category, Pi in the Unicode database.

#define UTF8_CATEGORY_PUNCTUATION_FINAL   0x00010000

Final punctuation category, Pf in the Unicode database.

#define UTF8_CATEGORY_PUNCTUATION_OTHER   0x00020000

Other punctuation category, Po in the Unicode database.

#define UTF8_CATEGORY_PUNCTUATION
Value:
#define UTF8_CATEGORY_PUNCTUATION_DASH
Dash punctuation category, Pd in the Unicode database.
Definition: utf8rewind.h:1378
#define UTF8_CATEGORY_PUNCTUATION_OPEN
Open punctuation category, Ps in the Unicode database.
Definition: utf8rewind.h:1384
#define UTF8_CATEGORY_PUNCTUATION_CONNECTOR
Connector punctuation category, Pc in the Unicode database.
Definition: utf8rewind.h:1372
#define UTF8_CATEGORY_PUNCTUATION_CLOSE
Close punctuation category, Pe in the Unicode database.
Definition: utf8rewind.h:1390
#define UTF8_CATEGORY_PUNCTUATION_FINAL
Final punctuation category, Pf in the Unicode database.
Definition: utf8rewind.h:1402
#define UTF8_CATEGORY_PUNCTUATION_INITIAL
Initial punctuation category, Pi in the Unicode database.
Definition: utf8rewind.h:1396
#define UTF8_CATEGORY_PUNCTUATION_OTHER
Other punctuation category, Po in the Unicode database.
Definition: utf8rewind.h:1408

Combined flag for all punctuation categories.

#define UTF8_CATEGORY_SYMBOL_MATH   0x00040000

Math symbol category, Sm in the Unicode database.

#define UTF8_CATEGORY_SYMBOL_CURRENCY   0x00080000

Currency symbol category, Sc in the Unicode database.

#define UTF8_CATEGORY_SYMBOL_MODIFIER   0x00100000

Modifier symbol category, Sk in the Unicode database.

#define UTF8_CATEGORY_SYMBOL_OTHER   0x00200000

Other symbol category, So in the Unicode database.

#define UTF8_CATEGORY_SYMBOL
Value:
#define UTF8_CATEGORY_SYMBOL_OTHER
Other symbol category, So in the Unicode database.
Definition: utf8rewind.h:1442
#define UTF8_CATEGORY_SYMBOL_MATH
Math symbol category, Sm in the Unicode database.
Definition: utf8rewind.h:1424
#define UTF8_CATEGORY_SYMBOL_MODIFIER
Modifier symbol category, Sk in the Unicode database.
Definition: utf8rewind.h:1436
#define UTF8_CATEGORY_SYMBOL_CURRENCY
Currency symbol category, Sc in the Unicode database.
Definition: utf8rewind.h:1430

Combined flag for all symbol categories.

#define UTF8_CATEGORY_SEPARATOR_SPACE   0x00400000

Space separator category, Zs in the Unicode database.

#define UTF8_CATEGORY_SEPARATOR_LINE   0x00800000

Line separator category, Zl in the Unicode database.

#define UTF8_CATEGORY_SEPARATOR_PARAGRAPH   0x01000000

Paragraph separator category, Zp in the Unicode database.

#define UTF8_CATEGORY_SEPARATOR
Value:
#define UTF8_CATEGORY_SEPARATOR_LINE
Line separator category, Zl in the Unicode database.
Definition: utf8rewind.h:1462
#define UTF8_CATEGORY_SEPARATOR_PARAGRAPH
Paragraph separator category, Zp in the Unicode database.
Definition: utf8rewind.h:1468
#define UTF8_CATEGORY_SEPARATOR_SPACE
Space separator category, Zs in the Unicode database.
Definition: utf8rewind.h:1456

Combined flag for all separator categories.

#define UTF8_CATEGORY_CONTROL   0x02000000

Control category, Cc in the Unicode database.

#define UTF8_CATEGORY_FORMAT   0x04000000

Format category, Cf in the Unicode database.

#define UTF8_CATEGORY_SURROGATE   0x08000000

Surrogate category, Cs in the Unicode database.

#define UTF8_CATEGORY_PRIVATE_USE   0x10000000

Private use category, Co in the Unicode database.

#define UTF8_CATEGORY_UNASSIGNED   0x20000000

Unassigned category, Cn in the Unicode database.

#define UTF8_CATEGORY_COMPATIBILITY   0x40000000

Flag used for maintaining backwards compatibility with POSIX functions, not found in the Unicode database.

#define UTF8_CATEGORY_IGNORE_GRAPHEME_CLUSTER   0x80000000

Flag used for checking only the general category of code points at the start of a grapheme cluster.

#define UTF8_CATEGORY_ISCNTRL
Value:
#define UTF8_CATEGORY_CONTROL
Control category, Cc in the Unicode database.
Definition: utf8rewind.h:1482
#define UTF8_CATEGORY_COMPATIBILITY
Flag used for maintaining backwards compatibility with POSIX functions, not found in the Unicode data...
Definition: utf8rewind.h:1513

Flag used for maintaining backwards compatibility with POSIX iscntrl function.

#define UTF8_CATEGORY_ISPRINT
Value:
#define UTF8_CATEGORY_SYMBOL
Combined flag for all symbol categories.
Definition: utf8rewind.h:1448
#define UTF8_CATEGORY_PUNCTUATION
Combined flag for all punctuation categories.
Definition: utf8rewind.h:1414
#define UTF8_CATEGORY_SEPARATOR
Combined flag for all separator categories.
Definition: utf8rewind.h:1474
#define UTF8_CATEGORY_LETTER
Combined flag for all letter categories.
Definition: utf8rewind.h:1303
#define UTF8_CATEGORY_COMPATIBILITY
Flag used for maintaining backwards compatibility with POSIX functions, not found in the Unicode data...
Definition: utf8rewind.h:1513
#define UTF8_CATEGORY_NUMBER
Combined flag for all number categories.
Definition: utf8rewind.h:1364

Flag used for maintaining backwards compatibility with POSIX isprint function.

#define UTF8_CATEGORY_ISSPACE
Value:
#define UTF8_CATEGORY_SEPARATOR_SPACE
Space separator category, Zs in the Unicode database.
Definition: utf8rewind.h:1456
#define UTF8_CATEGORY_COMPATIBILITY
Flag used for maintaining backwards compatibility with POSIX functions, not found in the Unicode data...
Definition: utf8rewind.h:1513

Flag used for maintaining backwards compatibility with POSIX isspace function.

#define UTF8_CATEGORY_ISBLANK
Value:
#define UTF8_CATEGORY_PRIVATE_USE
Private use category, Co in the Unicode database.
Definition: utf8rewind.h:1500
#define UTF8_CATEGORY_SEPARATOR_SPACE
Space separator category, Zs in the Unicode database.
Definition: utf8rewind.h:1456
#define UTF8_CATEGORY_COMPATIBILITY
Flag used for maintaining backwards compatibility with POSIX functions, not found in the Unicode data...
Definition: utf8rewind.h:1513

Flag used for maintaining backwards compatibility with POSIX isblank function.

#define UTF8_CATEGORY_ISGRAPH
Value:
#define UTF8_CATEGORY_SYMBOL
Combined flag for all symbol categories.
Definition: utf8rewind.h:1448
#define UTF8_CATEGORY_PUNCTUATION
Combined flag for all punctuation categories.
Definition: utf8rewind.h:1414
#define UTF8_CATEGORY_LETTER
Combined flag for all letter categories.
Definition: utf8rewind.h:1303
#define UTF8_CATEGORY_COMPATIBILITY
Flag used for maintaining backwards compatibility with POSIX functions, not found in the Unicode data...
Definition: utf8rewind.h:1513
#define UTF8_CATEGORY_NUMBER
Combined flag for all number categories.
Definition: utf8rewind.h:1364

Flag used for maintaining backwards compatibility with POSIX isgraph function.

#define UTF8_CATEGORY_ISPUNCT
Value:
#define UTF8_CATEGORY_SYMBOL
Combined flag for all symbol categories.
Definition: utf8rewind.h:1448
#define UTF8_CATEGORY_PUNCTUATION
Combined flag for all punctuation categories.
Definition: utf8rewind.h:1414
#define UTF8_CATEGORY_COMPATIBILITY
Flag used for maintaining backwards compatibility with POSIX functions, not found in the Unicode data...
Definition: utf8rewind.h:1513

Flag used for maintaining backwards compatibility with POSIX ispunct function.

#define UTF8_CATEGORY_ISALNUM
Value:
#define UTF8_CATEGORY_LETTER
Combined flag for all letter categories.
Definition: utf8rewind.h:1303
#define UTF8_CATEGORY_COMPATIBILITY
Flag used for maintaining backwards compatibility with POSIX functions, not found in the Unicode data...
Definition: utf8rewind.h:1513
#define UTF8_CATEGORY_NUMBER
Combined flag for all number categories.
Definition: utf8rewind.h:1364

Flag used for maintaining backwards compatibility with POSIX isalnum function.

#define UTF8_CATEGORY_ISALPHA
Value:
#define UTF8_CATEGORY_LETTER
Combined flag for all letter categories.
Definition: utf8rewind.h:1303
#define UTF8_CATEGORY_COMPATIBILITY
Flag used for maintaining backwards compatibility with POSIX functions, not found in the Unicode data...
Definition: utf8rewind.h:1513

Flag used for maintaining backwards compatibility with POSIX isalpha function.

#define UTF8_CATEGORY_ISUPPER
Value:
#define UTF8_CATEGORY_LETTER_UPPERCASE
Uppercase letter code points, Lu in the Unicode database.
Definition: utf8rewind.h:1273
#define UTF8_CATEGORY_COMPATIBILITY
Flag used for maintaining backwards compatibility with POSIX functions, not found in the Unicode data...
Definition: utf8rewind.h:1513

Flag used for maintaining backwards compatibility with POSIX isupper function.

#define UTF8_CATEGORY_ISLOWER
Value:
#define UTF8_CATEGORY_LETTER_LOWERCASE
Lowercase letter code points, Ll in the Unicode database.
Definition: utf8rewind.h:1279
#define UTF8_CATEGORY_COMPATIBILITY
Flag used for maintaining backwards compatibility with POSIX functions, not found in the Unicode data...
Definition: utf8rewind.h:1513

Flag used for maintaining backwards compatibility with POSIX islower function.

#define UTF8_CATEGORY_ISDIGIT
Value:
#define UTF8_CATEGORY_COMPATIBILITY
Flag used for maintaining backwards compatibility with POSIX functions, not found in the Unicode data...
Definition: utf8rewind.h:1513
#define UTF8_CATEGORY_NUMBER
Combined flag for all number categories.
Definition: utf8rewind.h:1364

Flag used for maintaining backwards compatibility with POSIX isdigit function.

#define UTF8_CATEGORY_ISXDIGIT
Value:
#define UTF8_CATEGORY_PRIVATE_USE
Private use category, Co in the Unicode database.
Definition: utf8rewind.h:1500
#define UTF8_CATEGORY_COMPATIBILITY
Flag used for maintaining backwards compatibility with POSIX functions, not found in the Unicode data...
Definition: utf8rewind.h:1513
#define UTF8_CATEGORY_NUMBER
Combined flag for all number categories.
Definition: utf8rewind.h:1364

Flag used for maintaining backwards compatibility with POSIX isxdigit function.

Typedef Documentation

UTF-16 encoded code point.

UTF-32 encoded code point.

Function Documentation

UTF8_API size_t utf8len ( const char *  text)

Get the length in code points of a UTF-8 encoded string.

Example:

uint8_t CheckPassword(const char* password)
{
size_t length = utf8len(password);
return (length == utf8len("hunter2"));
}
Parameters
[in]textUTF-8 encoded string.
Returns
Length in code points.
UTF8_API size_t utf16toutf8 ( const utf16_t input,
size_t  inputSize,
char *  target,
size_t  targetSize,
int32_t *  errors 
)

Convert a UTF-16 encoded string to a UTF-8 encoded string.

Note
This function should only be called directly if you are positive that you are working with UTF-16 encoded text. If you're working with wide strings, take a look at widetoutf8 instead.

Example:

uint8_t Player_SetNameUtf16(const utf16_t* name, size_t nameSize)
{
char buffer[256];
size_t buffer_size = 255;
size_t converted_size;
int32_t errors;
converted_size = utf16toutf8(name, nameSize, buffer, buffer_size, &errors);
if (converted_size == 0 ||
errors != UTF8_ERR_NONE)
{
return 0;
}
buffer[converted_size] = 0;
return Player_SetName(converted_name);
}
Parameters
[in]inputUTF-16 encoded string.
[in]inputSizeSize of the input in bytes.
[out]targetOutput buffer for the result, can be NULL.
[in]targetSizeSize of the output buffer in bytes.
[out]errorsOutput for errors.
Returns
Amount of bytes needed for storing output.
Return values
UTF8_ERR_NONENo errors.
UTF8_ERR_INVALID_DATAFailed to decode data.
UTF8_ERR_OVERLAPPING_PARAMETERSInput and output buffers overlap in memory.
UTF8_ERR_NOT_ENOUGH_SPACETarget buffer size is insufficient for result.
See also
utf32toutf8
widetoutf8
UTF8_API size_t utf32toutf8 ( const unicode_t input,
size_t  inputSize,
char *  target,
size_t  targetSize,
int32_t *  errors 
)

Convert a UTF-32 encoded string to a UTF-8 encoded string.

Note
This function should only be called directly if you are positive that you are working with UTF-32 encoded text. If you're working with wide strings, take a look at widetoutf8 instead.

Example:

uint8_t Database_ExecuteQuery_Unicode(const unicode_t* query, size_t querySize)
{
char* converted = NULL;
size_t converted_size;
uint8_t result = 0;
int32_t errors;
converted_size = utf32toutf8(query, querySize, NULL, 0, &errors);
if (converted_size == 0 ||
errors != UTF8_ERR_NONE)
{
goto cleanup;
}
converted = (char*)malloc(converted_size + 1);
utf32toutf8(query, querySize, converted, converted_size, NULL);
converted[converted_size] = 0;
result = Database_ExecuteQuery(converted);
cleanup:
if (converted != NULL)
{
free(converted);
converted = 0;
}
return result;
}
Parameters
[in]inputUTF-32 encoded string.
[in]inputSizeSize of the input in bytes.
[out]targetOutput buffer for the result, can be NULL.
[in]targetSizeSize of the output buffer in bytes.
[out]errorsOutput for errors.
Returns
Amount of bytes needed for storing output.
Return values
UTF8_ERR_NONENo errors.
UTF8_ERR_INVALID_DATAFailed to decode data.
UTF8_ERR_OVERLAPPING_PARAMETERSInput and output buffers overlap in memory.
UTF8_ERR_NOT_ENOUGH_SPACETarget buffer size is insufficient for result.
See also
utf16toutf8
widetoutf8
UTF8_API size_t widetoutf8 ( const wchar_t *  input,
size_t  inputSize,
char *  target,
size_t  targetSize,
int32_t *  errors 
)

Convert a wide string to a UTF-8 encoded string.

Depending on the platform, wide strings are either UTF-16 or UTF-32 encoded. This function takes a wide string as input and automatically calls the correct conversion function.

This allows for a cross-platform treatment of wide text and is preferable to using the UTF-16 or UTF-32 versions directly.

Example:

texture_t Texture_Load_Wide(const wchar_t* input)
{
char* converted = NULL;
size_t converted_size;
size_t input_size = wcslen(input) * sizeof(wchar_t);
texture_t result = NULL;
int32_t errors;
converted_size = widetoutf8(input, input_size, NULL, 0, &errors);
if (converted_size == 0 ||
errors != UTF8_ERR_NONE)
{
goto cleanup;
}
converted = (char*)malloc(converted_size + 1);
widetoutf8(input, input_size, converted, converted_size, NULL);
converted[converted_size / sizeof(wchar_t)] = 0;
result = Texture_Load(converted);
cleanup:
if (converted != NULL)
{
free(converted);
converted = NULL;
}
return result;
}
Parameters
[in]inputWide-encoded string.
[in]inputSizeSize of the input in bytes.
[out]targetOutput buffer for the result, can be NULL.
[in]targetSizeSize of the output buffer in bytes.
[out]errorsOutput for errors.
Returns
Amount of bytes needed for storing output.
Return values
UTF8_ERR_NONENo errors.
UTF8_ERR_INVALID_DATAFailed to decode data.
UTF8_ERR_OVERLAPPING_PARAMETERSInput and output buffers overlap in memory.
UTF8_ERR_NOT_ENOUGH_SPACETarget buffer size is insufficient for result.
See also
utf8towide
utf16toutf8
utf32toutf8
UTF8_API size_t utf8toutf16 ( const char *  input,
size_t  inputSize,
utf16_t target,
size_t  targetSize,
int32_t *  errors 
)

Convert a UTF-8 encoded string to a UTF-16 encoded string.

Note
This function should only be called directly if you are positive that you must convert to UTF-16, independent of platform. If you're working with wide strings, take a look at utf8towide instead.

Erroneous byte sequences such as missing or illegal bytes or overlong encoding of code points (e.g. using five bytes to encode a sequence that can be represented by two bytes) are converted to the replacement character U+FFFD.

Example:

void Font_DrawText(int x, int y, const char* text)
{
utf16_t buffer[256];
size_t buffer_size = 255 * sizeof(utf16_t);
int32_t errors;
size_t converted_size = utf8toutf16(text, strlen(text), buffer, buffer_size, &errors);
if (converted_size > 0 &&
errors == UTF8_ERR_NONE)
{
Legacy_DrawText(g_FontCurrent, x, y, (unsigned short*)buffer, converted_size / sizeof(utf16_t));
}
}
Parameters
[in]inputUTF-8 encoded string.
[in]inputSizeSize of the input in bytes.
[out]targetOutput buffer for the result, can be NULL.
[in]targetSizeSize of the output buffer in bytes.
[out]errorsOutput for errors.
Returns
Amount of bytes needed for storing output.
Return values
UTF8_ERR_NONENo errors.
UTF8_ERR_INVALID_DATAFailed to decode data.
UTF8_ERR_OVERLAPPING_PARAMETERSInput and output buffers overlap in memory.
UTF8_ERR_NOT_ENOUGH_SPACETarget buffer size is insufficient for result.
See also
utf8towide
utf8toutf32
UTF8_API size_t utf8toutf32 ( const char *  input,
size_t  inputSize,
unicode_t target,
size_t  targetSize,
int32_t *  errors 
)

Convert a UTF-8 encoded string to a UTF-32 encoded string.

Note
This function should only be called directly if you are positive that you must convert to UTF-32, independent of platform. If you're working with wide strings, take a look at utf8towide instead.

Erroneous byte sequences such as missing or illegal bytes or overlong encoding of code points (e.g. using five bytes to encode a sequence that can be represented by two bytes) are converted to the replacement character U+FFFD.

Example:

void TextField_AddCharacter(const char* encoded)
{
unicode_t code_point = 0;
int32_t errors;
utf8toutf32(encoded, strlen(encoded), &code_point, sizeof(unicode_t), &errors);
if (errors == UTF8_ERR_NONE)
{
TextField_AddCodePoint(code_point);
}
}
Parameters
[in]inputUTF-8 encoded string.
[in]inputSizeSize of the input in bytes.
[out]targetOutput buffer for the result, can be NULL.
[in]targetSizeSize of the output buffer in bytes.
[out]errorsOutput for errors.
Returns
Amount of bytes needed for storing output.
Return values
UTF8_ERR_NONENo errors.
UTF8_ERR_INVALID_DATAFailed to decode data.
UTF8_ERR_OVERLAPPING_PARAMETERSInput and output buffers overlap in memory.
UTF8_ERR_NOT_ENOUGH_SPACETarget buffer size is insufficient for result.
See also
utf8towide
utf8toutf16
UTF8_API size_t utf8towide ( const char *  input,
size_t  inputSize,
wchar_t *  target,
size_t  targetSize,
int32_t *  errors 
)

Convert a UTF-8 encoded string to a wide string.

Depending on the platform, wide strings are either UTF-16 or UTF-32 encoded. This function takes a UTF-8 encoded string as input and automatically calls the correct conversion function.

This allows for a cross-platform treatment of wide text and is preferable to using the UTF-16 or UTF-32 versions directly.

Erroneous byte sequences such as missing or illegal bytes or overlong encoding of code points (e.g. using five bytes to encode a sequence that can be represented by two bytes) are converted to the replacement character U+FFFD.

Note
Code points outside the Basic Multilingual Plane (BMP) are converted to surrogate pairs when using UTF-16. This means that strings containing characters outside the BMP converted on a platform with UTF-32 wide strings are not compatible with platforms with UTF-16 wide strings.
Hence, it is preferable to store all data as UTF-8 and only convert to
wide strings when required by a third-party interface.

Example:

void Window_SetTitle(void* windowHandle, const char* text)
{
size_t input_size = strlen(text);
wchar_t* converted = NULL;
size_t converted_size;
int32_t errors;
converted_size = utf8towide(text, input_size, NULL, 0, &errors);
if (converted_size == 0 ||
errors != UTF8_ERR_NONE)
{
goto cleanup;
}
converted = (wchar_t*)malloc(converted_size + sizeof(wchar_t));
utf8towide(text, input_size, converted, converted_size, NULL);
converted[converted_size / sizeof(wchar_t)] = 0;
SetWindowTextW((HWND)windowHandle, converted);
cleanup:
if (converted != NULL)
{
free(converted);
converted = NULL;
}
}
Parameters
[in]inputUTF-8 encoded string.
[in]inputSizeSize of the input in bytes.
[out]targetOutput buffer for the result, can be NULL.
[in]targetSizeSize of the output buffer in bytes.
[out]errorsOutput for errors.
Returns
Amount of bytes needed for storing output.
Return values
UTF8_ERR_NONENo errors.
UTF8_ERR_INVALID_DATAFailed to decode data.
UTF8_ERR_OVERLAPPING_PARAMETERSInput and output buffers overlap in memory.
UTF8_ERR_NOT_ENOUGH_SPACETarget buffer size is insufficient for result.
See also
widetoutf8
utf8toutf16
utf8toutf32
UTF8_API const char* utf8seek ( const char *  text,
size_t  textSize,
const char *  textStart,
off_t  offset,
int  direction 
)

Seek into a UTF-8 encoded string.

Working with UTF-8 encoded strings can be tricky due to the nature of the variable-length encoding. Because one character no longer equals one byte, it can be difficult to skip around in a UTF-8 encoded string without decoding the code points.

This function provides an interface similar to fseek in order to enable skipping to another part of the string.

Note
textStart must come before text in memory when seeking from the current or end position.

Example:

const char* text = "Press \xE0\x80\x13 to continue.";
const char fixed[1024];
const char* commandStart;
const char* commandEnd;
memset(fixed, 0, sizeof(fixed));
commandStart = strstr(text, "\xE0\x80\x13");
if (commandStart == 0)
{
return 0;
}
strncpy(fixed, text, commandStart - text);
strcat(fixed, "ENTER");
commandEnd = utf8seek(commandStart, strlen(commandStart), text, 1, SEEK_CUR);
if (commandEnd != commandStart)
{
strcat(fixed, commandEnd);
}
Parameters
[in]textInput string.
[in]textSizeSize of input string in bytes.
[in]textStartStart of input string.
[in]offsetRequested offset in code points.
[in]directionDirection to seek in.
  • SEEK_SET Offset is from the start of the string.
  • SEEK_CUR Offset is from the current position of the string.
  • SEEK_END Offset is from the end of the string.
Returns
Pointer to offset string or no change on error.
See also
utf8iscategory
UTF8_API size_t utf8toupper ( const char *  input,
size_t  inputSize,
char *  target,
size_t  targetSize,
int32_t *  errors 
)

Convert UTF-8 encoded text to uppercase.

This function allows conversion of UTF-8 encoded strings to uppercase without first changing the encoding to UTF-32. Conversion is fully compliant with the Unicode 7.0 standard.

Although most code points can be converted in-place, there are notable exceptions. For example, U+00DF (LATIN SMALL LETTER SHARP S) maps to "U+0053 U+0053" (LATIN CAPITAL LETTER S and LATIN CAPITAL LETTER S) when converted to uppercase. Therefor, it is advised to first determine the size in bytes of the output by calling the function with a NULL output buffer.

Only a handful of scripts make a distinction between upper and lowercase. In addition to modern scripts, such as Latin, Greek, Armenian and Cyrillic, a few historic or archaic scripts have case. The vast majority of scripts do not have case distinctions.

Note
Case mapping is not reversible. That is, toUpper(toLower(x)) != toLower(toUpper(x)).
This function checks the (thread-local) system locale in order to support languages with exceptional behavior on specific code points. Unfortunately, no cross-platform way of setting and retrieving the system locale is available without adding dependencies to the library. Please refer to your operating system's manual to see how to setup the system locale on your target system.
For more information on these exceptional code points, please refer
to the text file made available by the Unicode Consortium: ftp://ftp.unicode.org/Public/UNIDATA/SpecialCasing.txt

Example:

void Button_Draw(int32_t x, int32_t y, const char* text)
{
size_t input_size = strlen(text);
char* converted = NULL;
size_t converted_size;
int32_t text_box_width, text_box_height;
int32_t errors;
converted_size = utf8toupper(text, input_size, NULL, 0, &errors);
if (converted_size == 0 ||
errors != UTF8_ERR_NONE)
{
goto cleanup;
}
converted = (char*)malloc(converted_size + 1);
utf8toupper(text, input_size, converted, converted_size, NULL);
converted[converted_size] = 0;
Font_GetTextDimensions(converted, &text_box_width, &text_box_height);
Draw_BoxFilled(x - 4, y - 4, text_box_width + 8, text_box_height + 8, 0x088A08);
Draw_BoxOutline(x - 4, y - 4, text_box_width + 8, text_box_height + 8, 0xA9F5A9);
Font_DrawText(x + 2, y + 1, converted, 0x000000);
Font_DrawText(x, y, converted, 0xFFFFFF);
cleanup:
if (converted != NULL)
{
free(converted);
converted = NULL;
}
}
Parameters
[in]inputUTF-8 encoded string.
[in]inputSizeSize of the input in bytes.
[out]targetOutput buffer for the result, can be NULL.
[in]targetSizeSize of the output buffer in bytes.
[out]errorsOutput for errors.
Returns
Amount of bytes needed to contain output.
Return values
UTF8_ERR_NONENo errors.
UTF8_ERR_INVALID_DATAFailed to decode data.
UTF8_ERR_OVERLAPPING_PARAMETERSInput and output buffers overlap in memory.
UTF8_ERR_NOT_ENOUGH_SPACETarget buffer size is insufficient for result.
See also
utf8tolower
utf8totitle
utf8casefold
UTF8_API size_t utf8tolower ( const char *  input,
size_t  inputSize,
char *  target,
size_t  targetSize,
int32_t *  errors 
)

Convert UTF-8 encoded text to lowercase.

This function allows conversion of UTF-8 encoded strings to lowercase without first changing the encoding to UTF-32. Conversion is fully compliant with the Unicode 7.0 standard.

Although most code points can be converted to lowercase in-place, there are notable exceptions. For example, U+0130 (LATIN CAPITAL LETTER I WITH DOT ABOVE) maps to "U+0069 U+0307" (LATIN SMALL LETTER I and COMBINING DOT ABOVE) when converted to lowercase. Therefor, it is advised to first determine the size in bytes of the output by calling the function with a NULL output buffer.

Only a handful of scripts make a distinction between upper- and lowercase. In addition to modern scripts, such as Latin, Greek, Armenian and Cyrillic, a few historic or archaic scripts have case. The vast majority of scripts do not have case distinctions.

Note
Case mapping is not reversible. That is, toUpper(toLower(x)) != toLower(toUpper(x)).
This function checks the (thread-local) system locale in order to support languages with exceptional behavior on specific code points. Unfortunately, no cross-platform way of setting and retrieving the system locale is available without adding dependencies to the library. Please refer to your operating system's manual to see how to setup the system locale on your target system.
For more information on these exceptional code points, please refer
to the text file made available by the Unicode Consortium: ftp://ftp.unicode.org/Public/UNIDATA/SpecialCasing.txt

Example:

author_t* Author_ByName(const char* name)
{
author_t* result = NULL;
size_t name_size = strlen(name);
char* converted = NULL;
size_t converted_size;
int32_t errors;
size_t i;
converted_size = utf8tolower(name, name_size, NULL, 0, &errors);
if (converted_size == 0 ||
errors != UTF8_ERR_NONE)
{
goto cleanup;
}
converted = (char*)malloc(converted_size + 1);
utf8tolower(name, name_size, converted, converted_size, NULL);
converted[converted_size] = 0;
for (i = 0; i < g_AuthorCount; ++i)
{
if (!strcmp(g_Author[i].name, converted))
{
result = &g_Author[i];
break;
}
}
cleanup:
if (converted != NULL)
{
free(converted);
converted = NULL;
}
return result;
}
Parameters
[in]inputUTF-8 encoded string.
[in]inputSizeSize of the input in bytes.
[out]targetOutput buffer for the result, can be NULL.
[in]targetSizeSize of the output buffer in bytes.
[out]errorsOutput for errors.
Returns
Amount of bytes needed for storing output.
Return values
UTF8_ERR_NONENo errors.
UTF8_ERR_INVALID_DATAFailed to decode data.
UTF8_ERR_OVERLAPPING_PARAMETERSInput and output buffers overlap in memory.
UTF8_ERR_NOT_ENOUGH_SPACETarget buffer size is insufficient for result.
See also
utf8toupper
utf8totitle
utf8casefold
UTF8_API size_t utf8totitle ( const char *  input,
size_t  inputSize,
char *  target,
size_t  targetSize,
int32_t *  errors 
)

Convert UTF-8 encoded text to titlecase.

This function allows conversion of UTF-8 encoded strings to titlecase without first changing the encoding to UTF-32. Conversion is fully compliant with the Unicode 7.0 standard.

Titlecase requires a bit more explanation than uppercase and lowercase, because it is not a common text transformation. Titlecase uses uppercase for the first letter of each word and lowercase for the rest. Words are defined as "collections of code points with general category Lu, Ll, Lt, Lm or Lo according to the Unicode database".

Effectively, any type of punctuation can break up a word, even if this is not grammatically valid. This happens because the titlecasing algorithm does not and cannot take grammar rules into account.

Text Titlecase
The running man The Running Man
NATO Alliance Nato Alliance
You're amazing at building libraries You'Re Amazing At Building Libraries

Although most code points can be converted to titlecase in-place, there are notable exceptions. For example, U+00DF (LATIN SMALL LETTER SHARP S) maps to "U+0053 U+0073" (LATIN CAPITAL LETTER S and LATIN SMALL LETTER S) when converted to titlecase. Therefor, it is advised to first determine the size in bytes of the output by calling the function with a NULL output buffer.

Only a handful of scripts make a distinction between upper- and lowercase. In addition to modern scripts, such as Latin, Greek, Armenian and Cyrillic, a few historic or archaic scripts have case. The vast majority of scripts do not have case distinctions.

Note
Case mapping is not reversible. That is, toUpper(toLower(x)) != toLower(toUpper(x)).
This function checks the (thread-local) system locale in order to support languages with exceptional behavior on specific code points. Unfortunately, no cross-platform way of setting and retrieving the system locale is available without adding dependencies to the library. Please refer to your operating system's manual to see how to setup the system locale on your target system.
For more information on these exceptional code points, please refer
to the text file made available by the Unicode Consortium: ftp://ftp.unicode.org/Public/UNIDATA/SpecialCasing.txt

Example:

void Book_SetTitle(book_t* book, const char* title)
{
size_t converted_size;
int32_t errors;
size_t i;
converted_size = utf8totitle(title, strlen(title), book->title, sizeof(book->title) - 1, &errors);
if (converted_size == 0 ||
errors != UTF8_ERR_NONE)
{
memset(book->title, 0, sizeof(book->title));
return;
}
book->title[converted_size] = 0;
}
Parameters
[in]inputUTF-8 encoded string.
[in]inputSizeSize of the input in bytes.
[out]targetOutput buffer for the result, can be NULL.
[in]targetSizeSize of the output buffer in bytes.
[out]errorsOutput for errors.
Returns
Amount of bytes needed for storing output.
Return values
UTF8_ERR_NONENo errors.
UTF8_ERR_INVALID_DATAFailed to decode data.
UTF8_ERR_OVERLAPPING_PARAMETERSInput and output buffers overlap in memory.
UTF8_ERR_NOT_ENOUGH_SPACETarget buffer size is insufficient for result.
See also
utf8tolower
utf8toupper
utf8casefold
UTF8_API size_t utf8casefold ( const char *  input,
size_t  inputSize,
char *  target,
size_t  targetSize,
int32_t *  errors 
)

Remove case distinction from UTF-8 encoded text.

Case folding is the process of eliminating differences between code points concerning case mapping. It is most commonly used for comparing strings in a case-insensitive manner. Conversion is fully compliant with the Unicode 7.0 standard.

Although similar to lowercasing text, there are significant differences. For one, case folding does not take locale into account when converting. In some cases, case folding can be up to 20% faster than lowercasing the same text, but the result cannot be treated as correct lowercased text.

Only two locale-specific exception are made when case folding text. In Turkish, U+0049 LATIN CAPITAL LETTER I maps to U+0131 LATIN SMALL LETTER DOTLESS I and U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE maps to U+0069 LATIN SMALL LETTER I.

Although most code points can be case folded in-place, there are notable exceptions. For example, U+0130 (LATIN CAPITAL LETTER I WITH DOT ABOVE) maps to "U+0069 U+0307" (LATIN SMALL LETTER I and COMBINING DOT ABOVE) when converted to lowercase. Therefor, it is advised to first determine the size in bytes of the output by calling the function with a NULL output buffer.

Only a handful of scripts make a distinction between upper- and lowercase. In addition to modern scripts, such as Latin, Greek, Armenian and Cyrillic, a few historic or archaic scripts have case. The vast majority of scripts do not have case distinctions.

Note
This function checks the (thread-local) system locale in order to support languages with exceptional behavior on specific code points. Unfortunately, no cross-platform way of setting and retrieving the system locale is available without adding dependencies to the library. Please refer to your operating system's manual to see how to setup the system locale on your target system.

Example:

int32_t Command_ParseCommand(const char* argument)
{
char* buffer = NULL;
size_t buffer_size = 0;
int32_t errors;
int32_t result = 0;
buffer_size = utf8casefold(argument, strlen(argument), NULL, 0, &errors);
if (buffer_size == 0 ||
errors != UTF8_ERR_NONE)
{
result = -1;
goto cleanup;
}
buffer = (char*)malloc(buffer_size);
utf8casefold(argument, strlen(argument), buffer, buffer_size, &errors);
if (errors != UTF8_ERR_NONE)
{
result = -1;
goto cleanup;
}
if (!strncmp(buffer, "-username", strlen("-username")))
{
result = eCommand_Username;
}
else if (
!strncmp(buffer, "-password", strlen("-password")))
{
result = eCommand_Password;
}
else if (
!strncmp(buffer, "-message", strlen("-message")))
{
result = eCommand_Message;
}
cleanup:
if (buffer != NULL)
{
free(buffer);
buffer = NULL;
}
return result;
}
Parameters
[in]inputUTF-8 encoded string.
[in]inputSizeSize of the input in bytes.
[out]targetOutput buffer for the result, can be NULL.
[in]targetSizeSize of the output buffer in bytes.
[out]errorsOutput for errors.
Returns
Amount of bytes needed for storing output.
Return values
UTF8_ERR_NONENo errors.
UTF8_ERR_INVALID_DATAFailed to decode data.
UTF8_ERR_OVERLAPPING_PARAMETERSInput and output buffers overlap in memory.
UTF8_ERR_NOT_ENOUGH_SPACETarget buffer size is insufficient for result.
See also
utf8tolower
utf8toupper
utf8totitle
UTF8_API uint8_t utf8isnormalized ( const char *  input,
size_t  inputSize,
size_t  flags,
size_t *  offset 
)

Check if a string is stable in the specified Unicode Normalization Form.

This function can be used as a preprocessing step, before attempting to normalize a string. Normalization is a very expensive process, it is often cheaper to first determine if the string is unstable in the requested normalization form.

The result of the check will be YES if the string is stable and MAYBE or NO if it is unstable. If the result is MAYBE, the string does not necessarily have to be normalized.

If the result is unstable, the offset parameter is set to the offset for the first unstable code point. If the string is stable, the offset is equivalent to the length of the string in bytes.

You must specify the desired Unicode Normalization Form by using a combination of flags:

Unicode Flags
Normalization Form C (NFC) UTF8_NORMALIZE_COMPOSE
Normalization Form KC (NFKC) UTF8_NORMALIZE_COMPOSE + UTF8_NORMALIZE_COMPATIBILITY
Normalization Form D (NFD) UTF8_NORMALIZE_DECOMPOSE
Normalization Form KD (NFKD) UTF8_NORMALIZE_DECOMPOSE + UTF8_NORMALIZE_COMPATIBILITY

For more information, please review Unicode Standard Annex #15 - Unicode Normalization Forms.

Example:

uint8_t Text_InspectComposed(const char* text)
{
const char* src = text;
size_t src_size = strlen(text);
size_t offset;
size_t total_offset;
{
printf("Clean!\n");
return 1;
}
total_offset = offset;
do
{
const char* next;
printf("Unstable at byte %d\n", total_offset);
next = utf8seek(src, text, 1, SEEK_CUR);
if (next == src)
{
break;
}
total_offset += offset;
src = next;
src_size -= next - src;
}
return 0;
}
Parameters
[in]inputUTF-8 encoded string.
[in]inputSizeSize of the input in bytes.
[in]flagsDesired normalization form. Must be a combination of UTF8_NORMALIZE_COMPOSE, UTF8_NORMALIZE_DECOMPOSE and UTF8_NORMALIZE_COMPATIBILITY.
[out]offsetOffset to first unstable code point or length of input in bytes if stable.
Return values
UTF8_NORMALIZATION_RESULT_YESInput is stable and does not have to be normalized.
UTF8_NORMALIZATION_RESULT_MAYBEInput is unstable, but normalization may be skipped.
UTF8_NORMALIZATION_RESULT_NOInput is unstable and must be normalized.
See also
utf8normalize
UTF8_API size_t utf8normalize ( const char *  input,
size_t  inputSize,
char *  target,
size_t  targetSize,
size_t  flags,
int32_t *  errors 
)

Normalize a string to the specified Unicode Normalization Form.

The Unicode standard defines two standards for equivalence between characters: canonical and compatibility equivalence. Canonically equivalent characters and sequence represent the same abstract character and must be rendered with the same appearance and behavior. Compatibility equivalent characters have a weaker equivalence and may be rendered differently.

Unicode Normalization Forms are formally defined standards that can be used to test whether any two strings of characters are equivalent to each other. This equivalence may be canonical or compatibility.

The algorithm puts all combining marks into a specified order and uses the rules for decomposition and composition to transform the string into one of four Unicode Normalization Forms. A binary comparison can then be used to determine equivalence.

These are the Unicode Normalization Forms:

Form Description
Normalization Form D (NFD) Canonical decomposition
Normalization Form C (NFC) Canonical decomposition, followed by canonical composition
Normalization Form KD (NFKD) Compatibility decomposition
Normalization Form KC (NFKC) Compatibility decomposition, followed by canonical composition

utf8normalize can be used to transform text into one of these forms. You must specify the desired Unicode Normalization Form by using a combination of flags:

Form Flags
Normalization Form D (NFD) UTF8_NORMALIZE_DECOMPOSE
Normalization Form C (NFC) UTF8_NORMALIZE_COMPOSE
Normalization Form KD (NFKD) UTF8_NORMALIZE_DECOMPOSE + UTF8_NORMALIZE_COMPATIBILITY
Normalization Form KC (NFKC) UTF8_NORMALIZE_COMPOSE + UTF8_NORMALIZE_COMPATIBILITY

For more information, please review Unicode Standard Annex #15 - Unicode Normalization Forms.

Note
Unnormalized text is rare in the wild. As an example, all text found on the Internet as HTML source code must be encoded as NFC, as specified by the W3C.

Example:

void Font_RenderTextNormalized(const char* input)
{
const char* src = NULL;
const char* src_start;
size_t src_size;
char* converted = NULL;
size_t converted_size = 0;
size_t input_size = strlen(input);
{
int32_t errors;
converted_size = utf8normalize(input, input_size, NULL, 0, UTF8_NORMALIZE_COMPOSE, &errors);
if (converted_size > 0 &&
errors == UTF8_ERR_NONE)
{
converted = (char*)malloc(converted_size + 1);
utf8normalize(input, input_size, converted, converted_size, UTF8_NORMALIZE_COMPOSE, NULL);
converted[converted_size] = 0;
src = (const char*)converted;
src_size = converted_size;
}
}
if (src == NULL)
{
src = (const char*)input;
src_size = input_size;
}
src_start = src;
while (src_size > 0)
{
const char* next;
int32_t errors;
next = utf8seek(src, src_size, src_start, 1, SEEK_CUR);
if (next == src)
{
break;
}
unicode_t code_point;
utf8toutf32(src, (size_t)(next - src), &code_point, sizeof(unicode_t), &errors);
if (errors != UTF8_ERR_NONE)
{
break;
}
Font_RenderCodePoint(code_point);
src_size -= next - src;
src = next;
}
if (converted != NULL)
{
free(converted);
converted = NULL;
}
}
Parameters
[in]inputUTF-8 encoded string.
[in]inputSizeSize of the input in bytes.
[out]targetOutput buffer for the result, can be NULL.
[in]targetSizeSize of the output buffer in bytes.
[in]flagsDesired normalization form. Must be a combination of UTF8_NORMALIZE_COMPOSE, UTF8_NORMALIZE_DECOMPOSE and UTF8_NORMALIZE_COMPATIBILITY.
[out]errorsOutput for errors.
Returns
Amount of bytes needed for storing output.
Return values
UTF8_ERR_NONENo errors.
UTF8_ERR_INVALID_FLAGInvalid combination of flags was specified.
UTF8_ERR_INVALID_DATAFailed to decode data.
UTF8_ERR_OVERLAPPING_PARAMETERSInput and output buffers overlap in memory.
UTF8_ERR_NOT_ENOUGH_SPACETarget buffer size is insufficient for result.
See also
utf8isnormalized
UTF8_API size_t utf8iscategory ( const char *  input,
size_t  inputSize,
size_t  flags 
)

Check if the input string conforms to the category specified by the flags.

This function can be used to check if the code points in a string are part of a category. Valid flags are part of the UTF8_CATEGORY_* list of defines. The category for a code point is defined as part of the entry in UnicodeData.txt, the data file for the Unicode code point database.

Note
The function is greedy. This means it will try to match as many code points with the matching category flags as possible and return the offset in the input in bytes. If this is undesired behavior, use utf8seek to seek in the input first before matching it with the category flags.

By default, the function will treat grapheme clusters as a single code point. This means that a string like:

Code point Canonical combining class General category Name
U+0045 0 Lu (Uppercase letter) LATIN CAPITAL LETTER E
U+0300 230 Mn (Non-spacing mark) COMBINING GRAVE ACCENT

Will match with UTF8_CATEGORY_LETTER_UPPERCASE fully, because the COMBINING GRAVE ACCENT is treated as part of the grapheme cluster. This is useful when e.g. creating a text parser, because you do not have to normalize the text first.

If this is undesired behavior, specify the UTF8_CATEGORY_IGNORE_GRAPHEME_CLUSTER flag.

In order to main backwards compatibility with POSIX functions like isdigit and isspace, compatibility flags have been provided. Note, however, that the result is only guaranteed to be correct for code points in the Basic Latin range, between U+0000 and 0+007F. Combining a compatibility flag with a regular category flag will result in undefined behavior.

Example:

const char* Parser_NextIdentifier(char** output, size_t* outputSize, const char* input, size_t inputSize)
{
const char* src = input;
size_t src_size = inputSize;
size_t whitespace_size;
size_t identifier_size;
whitespace_size = utf8iscategory(src, src_size, UTF8_CATEGORY_SEPARATOR_SPACE);
if (whitespace_size == 0)
{
whitespace_size = utf8iscategory(src, src_size, UTF8_CATEGORY_ISSPACE);
}
if (whitespace_size > 0)
{
if (whitespace_size >= src_size)
{
return src + src_size;
}
src += whitespace_size;
src_size -= whitespace_size;
}
if (identifier_size == 0)
{
return src;
}
*output = (char*)malloc(identifier_size + 1);
memcpy(*output, src, identifier_size);
(*output)[identifier_size] = 0;
*outputSize = identifier_size;
if (identifier_size >= src_size)
{
return src + src_size;
}
return src + identifier_size;
}
Parameters
[in]inputUTF-8 encoded string.
[in]inputSizeSize of the input in bytes.
[in]flagsRequested category. Must be a combination of UTF8_CATEGORY_* flags or a single UTF8_CATEGORY_IS* flag.
Returns
Number of bytes in the input that conform to the specified category flags.
See also
utf8seek