Changes to existing code
Suppose you maintain a client application written in C. In order for clients to login, the application needs to check their credentials. This is accomplished by the following function:
int Login_CheckCredentials(const char* username, const char* password)
{
const char* salt;
const char* hashed_password;
char verify_password[256] = { 0 };
char hashed_verify_password[256] = { 0 };
salt = md5(username);
hashed_password = md5(password);
strcat(verify_password, hashed_password);
strcat(verify_password, salt);
strcpy(hashed_verify_password, md5(verify_password));
return Database_CheckLogin(username, hashed_verify_password);
}
Now we want to improve our security by allowing the full range of Unicode in the passwords by using UTF-8 encoding.
What would we have to change? The answer is: nothing.
A password like "MôntiPythônikdenHôlieGrâilen" would be encoded as M\xF8Pyth\xF8denH\xF8Gr\xE4en
, meaning the calls to strcat
and strcpy
would still work as intended. The password can be treated as ASCII without changing the code.
When converting your project to use UTF-8, you only have to worry about two surface areas: input and display. Let's look at those separately.
Dealing with user input
Continuing with the previous example, the old password field accepted only ASCII input:
static const int g_PasswordInputMaximum = 255;
static char g_PasswordInput[g_PasswordInputMaximum + 1];
int PasswordField_EnterCharacter(char input)
{
char text_input[2] = { input, 0 };
if ((strlen(g_PasswordInput) + 1) > g_PasswordInputMaximum)
{
return 0;
}
strcat(g_PasswordInput, text_input);
return 1;
}
We'll have to make sure that we provide UTF-32 codepoints instead:
int PasswordField_EnterCharacter(
unicode_t input)
This can be accomplished using a simple cast, because all ASCII codepoints fit in a unicode_t
type.
Next, we'll have to increase the size of the text_input
string in order to accommodate the conversion of the codepoint to UTF-8:
char text_input[16] = { 0 };
{
return 0;
}
Because the codepoint can consist of multiple characters, we'll have to change the check to see if we're not overflowing the password input string:
if ((strlen(g_PasswordInput) + strlen(text_input)) > g_PasswordInputMaximum)
{
return 0;
}
Finally, putting it all together:
static const int g_PasswordInputMaximum = 255;
static char g_PasswordInput[g_PasswordInputMaximum + 1];
int PasswordField_EnterCharacter(
unicode_t input)
{
char text_input[16] = { 0 };
{
return 0;
}
if ((strlen(g_PasswordInput) + strlen(text_input)) > g_PasswordInputMaximum)
{
return 0;
}
strcat(g_PasswordInput, text_input);
return 1;
}
With a few changes, the password field now accepts UTF-8 input. The benefit here is that we didn't need to change the algorithm itself, only the input had to be converted.
Displaying Unicode text
One problem remains: the user may be able to enter Unicode characters in text fields, but they won't show up right.
Let's look at the offending function:
void InputField_Draw(int x, int y, const char* text)
{
size_t i;
const char* src = text;
FontBatch_Start("Arial20");
for (i = 0; i < strlen(text); ++i)
{
FontBatch_AddCharacter(*src);
src++;
}
FontBatch_End();
FontBatch_Draw(x, y);
}
The first thing that will have to change is that FontBatch_AddCharacter
should accept UTF-32 codepoints. Fortunately, that change is backwards-compatible:
Next, we'll have to treat the input as UTF-8 encoded text. We'll have to read codepoints one-by-one.
size_t i;
int offset;
const char* src = text;
for (i = 0; i <
utf8len(text); ++i)
{
if (offset <= 0)
{
break;
}
src += offset;
}
Putting it all together again:
void InputField_Draw(int x, int y, const char* text)
{
size_t i;
int offset;
const char* src = text;
FontBatch_Start("Arial20");
for (i = 0; i <
utf8len(text); ++i)
{
if (offset <= 0)
{
break;
}
FontBatch_AddCharacter(codepoint);
src += offset;
}
FontBatch_End();
FontBatch_Draw(x, y);
}