Unicode and UTF-8

Learn how to work with the Unicode and UTF-8 character set.

We'll cover the following...

Unicode
UTF-8

It has been observed that underlying bytes lead to strange results when using the strlen function. We also discussed a bit of the history of character encodings in general, but one thing we did not answer was: what was the character encoding used when converting our strings to byte arrays? When we looked at those byte arrays, we could see the representation of those characters. Still, without knowing the character encoding, all of those values are meaningless since we would not know how to interpret them. The short answer we could give here is, “well, it’s Unicode,” but that only gets us so far.

Unicode

Unicode, like ASCII, is a character set that provides a mapping of characters to an integer (Unicode refers to these associations as code points). However, unlike ASCII, Unicode does not dictate how those values are stored or transferred (remember, in ASCII, characters are mapped to and persisted as single bytes).

Let’s take a look at our Traditional Chinese text again, but with the Unicode code points:

Character	Code Point	Name
你	U+4F60	CJK Unified Ideograph-4F60
好	U+597D	CJK Unified Ideograph-597D
荒	U+8352	CJK Unified Ideograph-8352
野	U+91CE	CJK Unified Ideograph-91CE

Introduction

What Are Strings?

Fluent Strings

The Formatting Helper Methods

The Logical Helper Methods

The Construction Helper Methods

The Extraction Helper Methods

Padding Strings

String Translations and Extension

Lines and Words

Applied Techniques: Writing a Gherkin Parser

Markov Chains and Text Generation

Fixed Width Data Parsing

Splitting Strings

Applied Techniques: A Blade Directive Validator

Working with HTML

Regular Expressions

Conclusion

Appendix

Unicode and UTF-8

Unicode

Unicode Code Points Example