utf8rewind
 All Files Functions Typedefs Macros Groups Pages
Examples

Changes to existing code

Suppose you maintain a client application written in C. In order for clients to login, the application needs to check their credentials. This is accomplished by the following function:

int Login_CheckCredentials(const char* username, const char* password)
{
const char* salt;
const char* hashed_password;
char verify_password[256] = { 0 };
char hashed_verify_password[256] = { 0 };
/* For the purposes of brevity, ignore the fact
that this is a terrible way to generate a salt,
because it has insufficient entropy. */
salt = md5(username);
hashed_password = md5(password);
strcat(verify_password, hashed_password);
strcat(verify_password, salt);
strcpy(hashed_verify_password, md5(verify_password));
return Database_CheckLogin(username, hashed_verify_password);
}

Now we want to improve our security by allowing the full range of Unicode in the passwords by using UTF-8 encoding.

What would we have to change? The answer is: nothing.

A password like "MôntiPythônikdenHôlieGrâilen" would be encoded as

"M\xC3\xB4Pyth\xC3\xB4" "denH\xC3\xB4Gr\xC3\xA2" "en"

Which means that the calls to strcat and strcpy would still work as intended. The password can be treated as ASCII without changing the code.

When converting your project to use UTF-8, you only have to worry about two surface areas: input and display. Let's look at those separately.

Dealing with user input

Continuing with the previous example, the old password field accepted only ASCII input:

static const int g_PasswordInputMaximum = 255;
static char g_PasswordInput[g_PasswordInputMaximum + 1];
int PasswordField_EnterCharacter(char input)
{
char text_input[2] = { input, 0 };
if ((strlen(g_PasswordInput) + 1) > g_PasswordInputMaximum)
{
return 0;
}
strcat(g_PasswordInput, text_input);
return 1;
}

We'll have to make sure that we provide UTF-32 codepoints instead:

int PasswordField_EnterCharacter(unicode_t input)

This can be accomplished using a simple cast, because all ASCII codepoints fit in a unicode_t type.

Next, we'll have to increase the size of the text_input string in order to accommodate the conversion of the codepoint to UTF-8:

char text_input[16] = { 0 };
int32_t errors = 0;
utf32toutf8(input, sizeof(unicode_t), text_input, 16, &errors);
if (errors != 0)
{
return 0;
}

Because the codepoint can consist of multiple characters, we'll have to change the check to see if we're not overflowing the password input string:

if ((strlen(g_PasswordInput) + strlen(text_input)) > g_PasswordInputMaximum)
{
return 0;
}

Finally, putting it all together:

static const int g_PasswordInputMaximum = 255;
static char g_PasswordInput[g_PasswordInputMaximum + 1];
int PasswordField_EnterCharacter(unicode_t input)
{
char text_input[16] = { 0 };
int32_t errors = 0;
utf32toutf8(input, sizeof(unicode_t), text_input, 16, &errors);
if (errors != 0)
{
return 0;
}
if ((strlen(g_PasswordInput) + strlen(text_input)) > g_PasswordInputMaximum)
{
return 0;
}
strcat(g_PasswordInput, text_input);
return 1;
}

With a few changes, the password field now accepts UTF-8 input. The benefit here is that we didn't need to change the algorithm itself, only the input had to be converted.

Displaying Unicode text

One problem remains: the user may be able to enter Unicode characters in text fields, but they won't show up right.

Let's look at the offending function:

void InputField_Draw(int x, int y, const char* text)
{
size_t i;
const char* src = text;
FontBatch_Start("Arial20");
for (i = 0; i < strlen(text); ++i)
{
FontBatch_AddCharacter(*src);
src++;
}
FontBatch_End();
FontBatch_Draw(x, y);
}

The first thing that will have to change is that FontBatch_AddCharacter should accept UTF-32 codepoints. Fortunately, that change is backwards-compatible:

FontBatch_AddCharacter((unicode_t)*src);

Next, we'll have to treat the input as UTF-8 encoded text. We'll convert it to UTF-32 and add the codepoints one by one.

size_t text_size = strlen(text);
int32_t errors = 0;
unicode_t* decoded = NULL;
size_t decoded_size = 0;
/* Determine the size in bytes of the text as UTF-32 codepoints */
decoded_size = utf8toutf32(text, text_size, NULL, 0, &errors);
if (decoded_size == 0 || errors != 0)
{
goto cleanup;
}
decoded = (unicode_t*)malloc(decoded_size);
/* Convert the text to UTF-32 */
utf8toutf32(text, text_size, decoded, decoded_size, &errors);
if (errors != 0)
{
goto cleanup;
}
cleanup:
if (decoded != NULL)
{
free(decoded);
decoded = NULL;
}

Putting it all together again:

void InputField_Draw(int x, int y, const char* text)
{
size_t text_size = strlen(text);
int32_t errors = 0;
unicode_t* decoded = NULL;
size_t decoded_size = 0;
size_t i;
/* Determine the size in bytes of the text as UTF-32 codepoints */
decoded_size = utf8toutf32(text, text_size, NULL, 0, &errors);
if (decoded_size == 0 || errors != 0)
{
goto cleanup;
}
decoded = (unicode_t*)malloc(decoded_size);
/* Convert the text to UTF-32 */
utf8toutf32(text, text_size, decoded, decoded_size, &errors);
if (errors != 0)
{
goto cleanup;
}
/* Add the codepoints to the batch */
FontBatch_Start("Arial20");
for (i = 0; i < decoded_size / sizeof(unicode_t); ++i)
{
FontBatch_AddCharacter(decoded[i]);
}
FontBatch_End();
FontBatch_Draw(x, y);
cleanup:
if (decoded != NULL)
{
free(decoded);
decoded = NULL;
}
}