A couple years ago I put together a class for UTF-8 strings, kind of a smart wrapper around Harry Fuecks’s phputf8 string functions. All Utf8String
objects—if I’m not missing bugs!—are guaranteed to house valid UTF-8 strings. The factory, make()
, enforces UTF-8 validity, stripping non-UTF-8 bytes (default) or replacing them with Utf8String::$replacement
(? by default).
$str = Utf8String::make('āll īs & ōk');
All the major string functions are methods with similar signatures (dropping the input string argument). Since native string assignment is always a copy, I felt all Utf8String
objects should be immutable to avoid accidental referencing. After the following line, $str
actually points to a fresh Utf8String
object:
$str = $str->ucwords(); // Āll Īs & Ōk
Of course this means they’re chainable:
$str = $str->strtolower()->toAscii(); // all is & ok
Casting to string does what you’d think:
echo $str; // out come UTF-8 bytes.
Checking for UTF-8/ASCII-ness can be slow, so methods that create new objects propagate this info into the constructor so the new objects don’t have to check again. The constructor is private to avoid misuse of those arguments. I also threw in some convenience methods:
// input() returns false if $_POST['msg'] isn't present
if ($msg = Utf8String::input($_POST, 'msg')) {
echo $msg->_; // escape with htmlspecialchars
}
In theory a framework could use a class like this to force safer string handling throughout:
function handleStrings(Utf8String $input) { /**/ }
It’s a proof-of-concept anyway (it’s missing preg_* methods and a lot of other stuff), but if the API could be ironed out, someone could make it into a proper C extension.
Looking over this, the only thing I’m not happy about is __toString() returning a regular PHP string. Even with the ability to type hint functino arguments, this makes it really easy to pass these objects into regular string-handling code and silently lose the validation-enforcing property of the class: