Re-writing Learning Perl to cover Unicode means I have to figure out how to type some of the characters that don’t show up on my keyboard. Not only that, I need to figure out their character names and code points for the examples. I want to convert from any of those (name, code point, character) to a description of the character. I want something like this:
$ perl unichar ã
Processing ã
match grapheme
code point U+00E3
decimal 227
name LATIN SMALL LETTER A WITH TILDE
character ã
I wrote a short program I called unichar, which I have on github.
There are some interesting parts of the script (which might change since I’m still tinkering with it). Even though my locale is set to en_US.UTF-8
and the command-line arguments are UTF-8, the script still doesn’t see them that way so I have to decode them as UTF-8. The decode
subroutine from Encode
takes whatever I have and turns it into a UTF-8 string. In this case, I do that for all the elements of @ARGV
:
use Encode qw(decode);
use I18N::Langinfo qw(langinfo CODESET);
my $codeset = langinfo(CODESET);
@ARGV = map { decode $codeset, $_ } @ARGV;
There are some other interesting bits in there too, but they are a bit advanced for Learning Perl.
Here are some more examples of the output. I handle unprintable and invisible characters specially:
$ perl unichar 䣱
Processing 䣱
match grapheme
code point U+48F1
decimal 18673
name
character 䣱
$ perl unichar ↞
Processing ↞
match grapheme
code point U+219E
decimal 8606
name LEFTWARDS TWO HEADED ARROW
character ↞
$ perl unichar U+2057
Processing U+2057
match code point
code point U+2057
decimal 8279
name QUADRUPLE PRIME
character ⁗
$ perl unichar "TAMIL LETTER HA"
Processing TAMIL LETTER HA
match name
code point U+0BB9
decimal 3001
name TAMIL LETTER HA
character ஹ
$ perl unichar 0x05d0
Processing 0x05d0
match code point
code point U+05D0
decimal 1488
name HEBREW LETTER ALEF
character א
$ perl unichar "CYRILLIC CAPITAL LETTER I WITH GRAVE"
Processing CYRILLIC CAPITAL LETTER I WITH GRAVE
match name
code point U+040D
decimal 1037
name CYRILLIC CAPITAL LETTER I WITH GRAVE
character Ѝ
$ perl unichar 0x9
Processing 0x9
match code point
code point U+0009
decimal 9
name CHARACTER TABULATION
character
$ perl unichar 0x07
Processing 0x07
match code point
code point U+0007
decimal 7
name BELL
character