Changes to existing code
Suppose you maintain a client application written in C. In order for clients to login, the application needs to check their credentials. This is accomplished by the following function:
int Login_CheckCredentials(const char* username, const char* password)
{
const char* salt;
const char* hashed_password;
char verify_password[256] = { 0 };
char hashed_verify_password[256] = { 0 };
salt = md5(username);
hashed_password = md5(password);
strcat(verify_password, hashed_password);
strcat(verify_password, salt);
strcpy(hashed_verify_password, md5(verify_password));
return Database_CheckLogin(username, hashed_verify_password);
}
Now we want to improve our security by allowing the full range of Unicode in the passwords by using UTF-8 encoding.
What would we have to change? The answer is: nothing.
A password like "MôntiPythônikdenHôlieGrâilen" would be encoded as
"M\xC3\xB4Pyth\xC3\xB4" "denH\xC3\xB4Gr\xC3\xA2" "en"
Which means that the calls to strcat
and strcpy
would still work as intended. The password can be treated as ASCII without changing the code.
When converting your project to use UTF-8, you only have to worry about two surface areas: input and display. Let's look at those separately.
Dealing with user input
Continuing with the previous example, the old password field accepted only ASCII input:
static const int g_PasswordInputMaximum = 255;
static char g_PasswordInput[g_PasswordInputMaximum + 1];
int PasswordField_EnterCharacter(char input)
{
char text_input[2] = { input, 0 };
if ((strlen(g_PasswordInput) + 1) > g_PasswordInputMaximum)
{
return 0;
}
strcat(g_PasswordInput, text_input);
return 1;
}
We'll have to make sure that we provide UTF-32 codepoints instead:
int PasswordField_EnterCharacter(
unicode_t input)
This can be accomplished using a simple cast, because all ASCII codepoints fit in a unicode_t
type.
Next, we'll have to increase the size of the text_input
string in order to accommodate the conversion of the codepoint to UTF-8:
char text_input[16] = { 0 };
int32_t errors = 0;
if (errors != 0)
{
return 0;
}
Because the codepoint can consist of multiple characters, we'll have to change the check to see if we're not overflowing the password input string:
if ((strlen(g_PasswordInput) + strlen(text_input)) > g_PasswordInputMaximum)
{
return 0;
}
Finally, putting it all together:
static const int g_PasswordInputMaximum = 255;
static char g_PasswordInput[g_PasswordInputMaximum + 1];
int PasswordField_EnterCharacter(
unicode_t input)
{
char text_input[16] = { 0 };
int32_t errors = 0;
if (errors != 0)
{
return 0;
}
if ((strlen(g_PasswordInput) + strlen(text_input)) > g_PasswordInputMaximum)
{
return 0;
}
strcat(g_PasswordInput, text_input);
return 1;
}
With a few changes, the password field now accepts UTF-8 input. The benefit here is that we didn't need to change the algorithm itself, only the input had to be converted.
Displaying Unicode text
One problem remains: the user may be able to enter Unicode characters in text fields, but they won't show up right.
Let's look at the offending function:
void InputField_Draw(int x, int y, const char* text)
{
size_t i;
const char* src = text;
FontBatch_Start("Arial20");
for (i = 0; i < strlen(text); ++i)
{
FontBatch_AddCharacter(*src);
src++;
}
FontBatch_End();
FontBatch_Draw(x, y);
}
The first thing that will have to change is that FontBatch_AddCharacter
should accept UTF-32 codepoints. Fortunately, that change is backwards-compatible:
Next, we'll have to treat the input as UTF-8 encoded text. We'll convert it to UTF-32 and add the codepoints one by one.
size_t text_size = strlen(text);
int32_t errors = 0;
size_t decoded_size = 0;
decoded_size =
utf8toutf32(text, text_size, NULL, 0, &errors);
if (decoded_size == 0 || errors != 0)
{
goto cleanup;
}
utf8toutf32(text, text_size, decoded, decoded_size, &errors);
if (errors != 0)
{
goto cleanup;
}
cleanup:
if (decoded != NULL)
{
free(decoded);
decoded = NULL;
}
Putting it all together again:
void InputField_Draw(int x, int y, const char* text)
{
size_t text_size = strlen(text);
int32_t errors = 0;
size_t decoded_size = 0;
size_t i;
decoded_size =
utf8toutf32(text, text_size, NULL, 0, &errors);
if (decoded_size == 0 || errors != 0)
{
goto cleanup;
}
utf8toutf32(text, text_size, decoded, decoded_size, &errors);
if (errors != 0)
{
goto cleanup;
}
FontBatch_Start("Arial20");
for (i = 0; i < decoded_size /
sizeof(
unicode_t); ++i)
{
FontBatch_AddCharacter(decoded[i]);
}
FontBatch_End();
FontBatch_Draw(x, y);
cleanup:
if (decoded != NULL)
{
free(decoded);
decoded = NULL;
}
}