Re-writing Learning Perl to cover Unicode means I have to figure out how to type some of the characters that don’t show up on my keyboard. Not only that, I need to figure out their character names and code points for the examples. I want to convert from any of those (name, code point, character) to a description of the character. I want something like this:
$ perl unichar ã Processing ã match grapheme code point U+00E3 decimal 227 name LATIN SMALL LETTER A WITH TILDE character ã
I wrote a short program I called unichar, which I have on github.
There are some interesting parts of the script (which might change since I’m still tinkering with it). Even though my locale is set to en_US.UTF-8
and the command-line arguments are UTF-8, the script still doesn’t see them that way so I have to decode them as UTF-8. The decode
subroutine from Encode
takes whatever I have and turns it into a UTF-8 string. In this case, I do that for all the elements of @ARGV
:
use Encode qw(decode); use I18N::Langinfo qw(langinfo CODESET); my $codeset = langinfo(CODESET); @ARGV = map { decode $codeset, $_ } @ARGV;
There are some other interesting bits in there too, but they are a bit advanced for Learning Perl.
Here are some more examples of the output. I handle unprintable and invisible characters specially:
$ perl unichar 䣱 Processing 䣱 match grapheme code point U+48F1 decimal 18673 namecharacter 䣱 $ perl unichar ↞ Processing ↞ match grapheme code point U+219E decimal 8606 name LEFTWARDS TWO HEADED ARROW character ↞ $ perl unichar U+2057 Processing U+2057 match code point code point U+2057 decimal 8279 name QUADRUPLE PRIME character ⁗ $ perl unichar "TAMIL LETTER HA" Processing TAMIL LETTER HA match name code point U+0BB9 decimal 3001 name TAMIL LETTER HA character ஹ $ perl unichar 0x05d0 Processing 0x05d0 match code point code point U+05D0 decimal 1488 name HEBREW LETTER ALEF character א $ perl unichar "CYRILLIC CAPITAL LETTER I WITH GRAVE" Processing CYRILLIC CAPITAL LETTER I WITH GRAVE match name code point U+040D decimal 1037 name CYRILLIC CAPITAL LETTER I WITH GRAVE character Ѝ $ perl unichar 0x9 Processing 0x9 match code point code point U+0009 decimal 9 name CHARACTER TABULATION character $ perl unichar 0x07 Processing 0x07 match code point code point U+0007 decimal 7 name BELL character