Character maps and code pages

For discussions about programming, and for programming questions and advice


Moderator: Forum moderators

User avatar
MochiMoppel
Posts: 1233
Joined: Mon Jun 15, 2020 6:25 am
Location: Japan
Has thanked: 21 times
Been thanked: 437 times

Character maps and code pages

Post by MochiMoppel »

Some time ago I opened an ASCII table thread and asked, where to find character maps in Puppy. Files seem to be scattered and not consistent across distros.

Folder /usr/share/cups/charmaps works for me but may be deprecated
Folder /usr/lib/aspell seems OK but the charmaps sometimes contain errors
@step pointed out that Fatdog64 uses /usr/lib64/aspell
@some1 mentioned /usr/share/i18n/charmaps Few MS code pages, many IBM variants (seldom used and poorly supported by iconv)
Presently, whenever I need charmaps in my applications, I scan all folders since I can't be sure which one exists on the currently used distro, Not ideal since syntax of the files differ.

As far as I can see nobody mentioned /usr/lib/siconv
I found this by chance on my Slacko5.6 and on the new Slacko7. Contains lots of charmaps (Apple code page for Icelandic anyone :) ? ) . The folder seems to be required by cdrtools.

Is it OK to assume that this folder is present in all major distros? I know that it exists in Fossapup64.

step
Posts: 546
Joined: Thu Aug 13, 2020 9:55 am
Has thanked: 57 times
Been thanked: 198 times
Contact:

Re: Character maps and code pages

Post by step »

Fatdog64 isn't a major distro but it does carry /usr/lib/siconv.

User avatar
MochiMoppel
Posts: 1233
Joined: Mon Jun 15, 2020 6:25 am
Location: Japan
Has thanked: 21 times
Been thanked: 437 times

Re: Character maps and code pages

Post by MochiMoppel »

Thanks. Checked a couple of other distros. With the notable exception of Tahrpup all have it.

User avatar
rockedge
Site Admin
Posts: 6541
Joined: Mon Dec 02, 2019 1:38 am
Location: Connecticut,U.S.A.
Has thanked: 2748 times
Been thanked: 2620 times
Contact:

Re: Character maps and code pages

Post by rockedge »

I see the directory /usr/lib/siconv on Bionic64. Good to know.

User avatar
MochiMoppel
Posts: 1233
Joined: Mon Jun 15, 2020 6:25 am
Location: Japan
Has thanked: 21 times
Been thanked: 437 times

Character maps and code pages

Post by MochiMoppel »

Hmmm, an interesting directory indeed, however the music plays here: /usr/lib/gconv, used by iconv, Geany, Leafpad etc. Unfortunately (for me) the character maps in this directory are compiled, so there is no straightforward way to access their contents. What's more: in my Slacko5.6 only a few exist and "borrowing" the missing .so files from other distros does not work. Character encoding/recoding is very limited.

Now it appears that /usr/lib/siconv, despite its name, is not used by iconv and I wonder what its purpose is. Education?
Amusement?

OK, then let's have fun. Here is a little script that scans all ~50 codepages in /usr/lib/siconv and displays the characters mapped for a particular (hexa)decimal value. The challenge was not to make it work. The challenge was to make it work fast.
Could be mildly useful to determine the encoding used for a file when the file is rendered with cryptic block characters (containing hex values).

Code: Select all

#!/bin/bash
cd /usr/lib/siconv || exit
XDG=$(for i in {128..255}; do printf "0x%X\n  (%d)\n" $i $i ;done)
IFS=$'\n\t'
while : ;do
    HEX=$(Xdialog --stdout --left --menubox "Please select a value of Extended ASCII range\nhex 0x80 - 0xFF (decimal 128 - 255)" 0 0 10 $XDG)
    (($?)) && break
    DEC=$(printf $((HEX)))
    gPATTERN="^[^#]*$HEX.*0x.*[^>]$"
    sPATTERN="([^:]+).*[ 	]0x([0-9A-Fa-f]+).*#[ 	]*(.*)"
    rPATTERN="\1	\\\U\2	U+\2	\3"
    {  printf "   Representation of hex $HEX (dec $DEC)\n   CODEPAGE CHR UTF    CHARACTER NAME\n----------------------------------------------------\n"
     { printf '%11s  %b\t%s %s\n' $(grep -im1 "$gPATTERN" * | sed -r "s/$sPATTERN/$rPATTERN/")
       printf '%11s  -- UNDEFINED --\n' *
     } | sort -uk 1,1
    } | gxmessage -fn 'monospace,12' -c -but Continue:0,Cancel:1 -de Continue -file - 
    (($?)) && break
done

EDIT: Fixed tabs in sPATTERN and rPATTERN

codepages.jpg
codepages.jpg (101.51 KiB) Viewed 7873 times
Last edited by MochiMoppel on Sun Jun 13, 2021 7:39 am, edited 2 times in total.
User avatar
misko_2083
Posts: 196
Joined: Wed Dec 09, 2020 11:59 pm
Has thanked: 10 times
Been thanked: 20 times

Re: Character maps and code pages

Post by misko_2083 »

MochiMoppel wrote: Tue Jun 01, 2021 5:33 am

Could be mildly useful to determine the encoding used for a file when the file is rendered with cryptic block characters (containing hex values).

In Debian there is encguess. That perl script guesses the encoding of a file preety well.
The other one is uchardet https://www.freedesktop.org/wiki/Software/uchardet/
That one is not so accurate by my experience.

Do you want to exit the Circus? The Harsh Truth
https://www.youtube.com/watch?v=ZJwQicZHp_c

User avatar
MochiMoppel
Posts: 1233
Joined: Mon Jun 15, 2020 6:25 am
Location: Japan
Has thanked: 21 times
Been thanked: 437 times

Re: Character maps and code pages

Post by MochiMoppel »

misko_2083 wrote: Fri Jun 11, 2021 12:30 am

In Debian there is encguess. That perl script guesses the encoding of a file preety well.

How could any script or program guess the encoding? Sounds like a mission impossible.
I mean surely a program can roll a virtual dice and guess, but what are the chances to succeed? Let's say I have a text that contains only one extended ASCII and the value is 0xDE like in the screenshot . This could represent any of 25 different characters and a corresponding codepage, but which one? Unless the program understands the context it most certainly will fail.

BTW: I fixed my posted code. I use Geany for code creation and there I use tabs for indentions. The forum software uses different tab widths, resulting in aligning problems, so before posting I usually change Geany tabs to spaces. This also changed some crucial tabs in the sed patterns. Sorry :oops:

User avatar
greengeek
Posts: 1383
Joined: Thu Jul 16, 2020 11:06 pm
Has thanked: 534 times
Been thanked: 192 times

Re: Character maps and code pages

Post by greengeek »

MochiMoppel wrote: Mon May 03, 2021 1:07 pm

Checked a couple of other distros. With the notable exception of Tahrpup all have it.

I see it here on my Tahr32 6.0.6
Contents as follows:
cp10000
cp10006
cp10007
cp10029
cp10079
cp10081
cp1250
cp1251
cp1252
cp1253
cp1254
cp1255
cp1256
cp1257
cp1258
cp437
cp737
cp775
cp850
cp852
cp855
cp857
cp860
cp861
cp862
cp863
cp864
cp865
cp866
cp869
cp874
iso8859-1
iso8859-10
iso8859-11
iso8859-13
iso8859-14
iso8859-15
iso8859-16
iso8859-2
iso8859-3
iso8859-4
iso8859-5
iso8859-6
iso8859-7
iso8859-8
iso8859-9
koi8-r
koi8-u
siconv.txt

User avatar
MochiMoppel
Posts: 1233
Joined: Mon Jun 15, 2020 6:25 am
Location: Japan
Has thanked: 21 times
Been thanked: 437 times

Character maps and code pages

Post by MochiMoppel »

greengeek wrote:

I see it here on my Tahr32 6.0.6

I don't see it on my Tahr32 6.0.5

User avatar
wiak
Posts: 4079
Joined: Tue Dec 03, 2019 6:10 am
Location: Packing - big job
Has thanked: 65 times
Been thanked: 1206 times
Contact:

Re: Character maps and code pages

Post by wiak »

I know next to nothing about this topic and not sure if the issue I had is related but I'll post it anyway just in case.

I've recently been using rclone to save some large directories up to google drive account. Unfortunately, some of these directories, which contain many subdirectories, and are large (around 400MB) contain files that rclone is flagging as having some filenames with characters that (I think) are not in UTF8 char set. I'm not sure if that was the flagged message since I didn't have time to note the message down exactly, but I did find one or two of the problematic filenames by manually searching and noting they had character that was appearing with odd graphic black diamond shape - like I say, I have no idea about this topic.

What I'm wanting is some script that will recursively find these problematic characters, and change them into an underscore, prior to using rclone to save the directory up to google drive... Any help appreciated.

I'm not wanting to try a save again now, since takes too long and no use with the dodgy chars still there anyway, so I can't duplicate the error message better just now, but if it happens to occur at some other stage I'll try and post better details, but I'm pretty sure the error was along the lines of the above. rclone seemed to be auto-fixing the error but that doesn't help me with the local copy, which I want to be the same as the google drive saved version without syncing it back down again (tho if no other choice, I could try that alternative - nevertheless, a find/replace such characters script would be useful).

EDIT: I just found this:
https://rclone.org/local/

Filenames
Filenames should be encoded in UTF-8 on disk. This is the normal case for Windows and OS X.

There is a bit more uncertainty in the Linux world, but new distributions will have UTF-8 encoded files names. If you are using an old Linux filesystem with non UTF-8 file names (e.g. latin1) then you can use the convmv tool to convert the filesystem to UTF-8. This tool is available in most distributions' package managers.

If an invalid (non-UTF8) filename is read, the invalid characters will be replaced with a quoted representation of the invalid bytes. The name gro\xdf will be transferred as gro‛DF. rclone will emit a debug message in this case (use -v to see), e.g.

Local file system at .: Replacing invalid UTF-8 characters in "gro\xdf"

So I guess I can use that convmv tool prior to the rclone attempt. I'm installing it now. On reflection, it seems to me that rclone was itself doing above quoted representation of the invalid bytes automatically, so I am hoping runing convmv on my local drive copy will result in a similarly fixed version (though I'll have to test that). Nevertheless, I would still be interested in any script that would reveal such filenames.
EDIT2: re-reading that replacement info, it seems likely that convmv doesn't itself use "quoted representation of the invalid bytes" but instead converts everything to UTF-8 so I need something to convert my local copy to the quoted representation version. As far as using convmv tool it seems I need to know the original encoding ("the from" -f encoding) and make the "to (-t)" encoding UTF-8, but I don't know the original encoding. I suppose I can try that encguess program mentioned in earlier post by misko to try and find original encoding(?).
EDIT3: I have apparently managed to rename the last odd filename and get it the same on local copy and on google drive (used --dry-run option of rclone to find out which file it was). However, I still need the overall fix/script for other directories I am likely to upload to gdrive using rclone copy.

https://www.tinylinux.info/
DOWNLOAD wd_multi for hundreds of 'distros' at your fingertips: viewtopic.php?p=99154#p99154
Αξίζει να μεταφραστεί;

User avatar
puppy_apprentice
Posts: 691
Joined: Tue Oct 06, 2020 8:43 pm
Location: land of bigos and schabowy ;)
Has thanked: 5 times
Been thanked: 115 times

Re: Character maps and code pages

Post by puppy_apprentice »

1. make list of all problematic directories:

Code: Select all

find . -maxdepth X -type d >dirlist.txt
where x is recursion depth

2. determine dirlist.txt encoding:

Code: Select all

file --mime-encoding dirlist.txt

3. use convmv:

Code: Select all

convmv --notest -f xxxx -t utf8 .
where xxxx is taken from 2.
User avatar
MochiMoppel
Posts: 1233
Joined: Mon Jun 15, 2020 6:25 am
Location: Japan
Has thanked: 21 times
Been thanked: 437 times

Re: Character maps and code pages

Post by MochiMoppel »

wiak wrote: Sun Jun 20, 2021 9:42 pm

Nevertheless, I would still be interested in any script that would reveal such filenames.

Found no useful solutions on stackoverflow, so I had to roll my own. Very simple and works for me:

Code: Select all

find /path/to/dir -not -name "*"

Searches the directory recursively. Invalid characters in filenames will appear as '?'. In ROX-Filer such filenames will be displayed in red color and a tooltip reading "This filename is not valid UTF-8. You should rename it". Just opening the rename dialog and closing it again will then convert the names to UTF-8. For only a few affected names this method may be the easiest and safest.

If hundreds of files are affected or if the invalid characters should be converted to something else then the Internet wisdom recommends convmv. Of course a script could be more flexible (and more fun). Would be a bit tricky, depending on required level of flexibility and safety, but should possible. Using the above find command in combination with the stat command could be a good start for experiments.

User avatar
wiak
Posts: 4079
Joined: Tue Dec 03, 2019 6:10 am
Location: Packing - big job
Has thanked: 65 times
Been thanked: 1206 times
Contact:

Re: Character maps and code pages

Post by wiak »

MochiMoppel wrote: Sat Jun 26, 2021 1:58 am
wiak wrote: Sun Jun 20, 2021 9:42 pm

Nevertheless, I would still be interested in any script that would reveal such filenames.

Found no useful solutions on stackoverflow, so I had to roll my own. Very simple and works for me:

Code: Select all

find /path/to/dir -not -name "*"

Searches the directory recursively. Invalid characters in filenames will appear as '?'. In ROX-Filer such filenames will be displayed in red color and a tooltip reading "This filename is not valid UTF-8. You should rename it". Just opening the rename dialog and closing it again will then convert the names to UTF-8. For only a few affected names this method may be the easiest and safest.

If hundreds of files are affected or if the invalid characters should be converted to something else then the Internet wisdom recommends convmv. Of course a script could be more flexible (and more fun). Would be a bit tricky, depending on required level of flexibility and safety, but should possible. Using the above find command in combination with the stat command could be a good start for experiments.

Yes in practice I'm using convmv for translation - certainly does the job. Quite a number of files are affected and received regularly from various business sources. However that find trick looks simple and convenient for the quick checks I want to do. I'll give that a go. Thanks.

I think I mentioned that the issue is occurring when I'm using rclone to backup to google drive. Rclone actually does a conversion of its own (but a kind of fudge substitution sort of conversion unlike the appropriate utf8 convmv does). I could therefore simply accept the rclone result for this purpose and simply rclone sync back so both the local and remote copies are the same. Currently I'm using convmv though since result naming correctly shows the accented characters.

wiak

https://www.tinylinux.info/
DOWNLOAD wd_multi for hundreds of 'distros' at your fingertips: viewtopic.php?p=99154#p99154
Αξίζει να μεταφραστεί;

User avatar
MochiMoppel
Posts: 1233
Joined: Mon Jun 15, 2020 6:25 am
Location: Japan
Has thanked: 21 times
Been thanked: 437 times

Re: Character maps and code pages

Post by MochiMoppel »

wiak wrote: Sat Jun 26, 2021 2:11 am

Rclone actually does a conversion of its own

I find it odd that a tool that has 'clone' in its name converts the file names on its own. They may contain invalid UTF-8 bytes but they are nevertheless perfecty valid file names. IMO a sync tool that makes changes to the files is a no-go.

but a kind of fudge substitution sort of conversion unlike the appropriate utf8 convmv does

Changing the invalid UTF-8 characters with their hex representation avoids the ambiguities of codepage translation. It's not a bad idea. On the other hand if you receive a bunch of files , a mix of French, Russian, Greek file names, you couldn't convert them with a single "appropriate" convmv command.

User avatar
MochiMoppel
Posts: 1233
Joined: Mon Jun 15, 2020 6:25 am
Location: Japan
Has thanked: 21 times
Been thanked: 437 times

Invalid UTF filenames

Post by MochiMoppel »

For anyone wondering what this is all about here is a small script that creates a few files with invalid UTF-8.
The find command searches the whole /tmp dir and passes the result to Xdialog, which has its own way to mark the invalid characters.
Note also how the files are displayed by ROX-Filer in /tmp/testfiles and try to open them with Geany, using Geany's Open dialog. This dialog will display just another way to flag these files. Spoiler alert: Geany may crash when actually trying to open the files but Leafpad will not.
Have fun!

Code: Select all

#!/bin/bash
mkdir /tmp/testfiles 2>/dev/null
> /tmp/testfiles/$'Malm\xf6'  #Malmö invalid UTF-8
> /tmp/testfiles/$'Malm\xf8'  #Malmø invalid UTF-8
> /tmp/testfiles/$'\x80uro'   #€uro  invalid UTF-8

FOUND=$(find /tmp/ -not -name "*")
Xdialog -left -back "Invalid UTF-8 filenames:" -msg "$FOUND" x

[EDIT] Changed find /tmp to find /tmp/ . Without the trailing slash the find command in newer Puppies will no work as expected. It would only find the /tmp directory itself but would not search for files in the /tmp directory.


.

Last edited by MochiMoppel on Sun Jun 27, 2021 2:00 am, edited 1 time in total.
User avatar
wiak
Posts: 4079
Joined: Tue Dec 03, 2019 6:10 am
Location: Packing - big job
Has thanked: 65 times
Been thanked: 1206 times
Contact:

Re: Character maps and code pages

Post by wiak »

Honestly, I know nothing about this. The files my partner receives come from a Spanish speaking country. I think I converted the filenames using (but forgot to take special note of the command):

Code: Select all

convmv -r -f ISO-8859-1 -t utf8  /directory_to_convert

Above is the test without renaming the files. Use --notest option to include the rename. Include -i option if you want interactive (yes/no) mode.

Despite still knowing next to nothing about such matters, at least I found a reference that may explain matters to me once I digest it more fully:

http://dwheeler.com/essays/fixing-unix- ... .html#utf8
Mojibake: https://en.wikipedia.org/wiki/Mojibake

MochiMoppel wrote: Sat Jun 26, 2021 3:59 am

Code: Select all

> /tmp/testfiles/$'Malm\xf6'  #Malmö invalid UTF-8

This is an interesting construct, though I'm lost as to how it works. I can get by with most bash constructs, but I am certainly no genius at it. $'Malm\xf6' ... how does that do what it does???! (EDIT: found out - per my next post)
Anyway, main question I have is "what is that script supposed to do"? When I run it I get the Xdialog, but all it says is per the screenshot attached.

Code: Select all

find /tmp -not -name "*"

doesn't seem to be working; meaning, it doesn't find anything so what was to be expected? EDIT: it works now, I had not set LANG locale correctly (previously just had LANG C; not that I really have a clue about locale issues...).

Maybe I am missing some language-related component on my installation - that is perfectly possible since my own usage isn't usually concerned about such matters, but it would be good to sort out and I'm unlikely myself to know how to fix that; the ignorance that comes from only knowing and using one language, being English... well, I do know quite a bit of Norwegian, and a sliight bit of French, and a minute bit of German, and once studied Russian for about six weeks because I liked chess (though found I was using more time on the optional Russian course than on my main Electronic Engineering degree study courses), but never bothered configuring any computer to work with the various non-english-alphabet characters involved with any of these.

Note: I only had locales C and POSIX (per locale -a command). I tried adding locales:
en_GB.UTF-8 UTF-8
en_GB ISO-8859-1
by uncommenting these two in /etc/locale.gen (on my WDL_Arch64 system). Of course I would be better with NZ (since I'm there...) or maybe US, but force of old habits made me use GB... So, if wanting US... could manually edit /etc/locale.conf or use command: (per: arch wiki Locale link below)

Code: Select all

localectl set-locale LANG=en_US.UTF-8

followed by running locale-gen command. But this made no difference to my results...
https://wiki.archlinux.org/title/Locale

I find it all so frustrating that for tonight's bed-time reading I'm annoying myself further with:
"The UNIX-Haters Handbook"
http://web.mit.edu/~simsong/www/ugh.pdf

EUREKA!!! Okay, I was daft. After the above, I forgot to set LANG=en_GB.utf8 (in /etc/locale.conf), which I have now done and maybe the new attached image (fixed_I_think.png) is what I'm supposed to get... Mind you, what a weird but working idea to use command:

Code: Select all

find /tmp -not -name "*"
Attachments
fixed_I_think.png
fixed_I_think.png (10.03 KiB) Viewed 8269 times
pcmanfm_display.png
pcmanfm_display.png (6.91 KiB) Viewed 8303 times
file_list.png
file_list.png (9.59 KiB) Viewed 8304 times
invalid_utf8.png
invalid_utf8.png (17.19 KiB) Viewed 8305 times

https://www.tinylinux.info/
DOWNLOAD wd_multi for hundreds of 'distros' at your fingertips: viewtopic.php?p=99154#p99154
Αξίζει να μεταφραστεί;

User avatar
wiak
Posts: 4079
Joined: Tue Dec 03, 2019 6:10 am
Location: Packing - big job
Has thanked: 65 times
Been thanked: 1206 times
Contact:

Re: Character maps and code pages

Post by wiak »

Hmmm... This stuff is kind of fascinating:

https://www.smashingmagazine.com/2012/0 ... cter-sets/

UTF-8 is a clever. It works a bit like the Shift key on your keyboard. Normally when you press the H on your keyboard a lower case “h” appears on the screen. But if you press Shift first, a capital H will appear.

UTF-8 treats numbers 0-127 as ASCII, 192-247 as Shift keys, and 128-192 as the key to be shifted. For instance, characters 208 and 209 shift you into the Cyrillic range. 208 followed by 175 is character 1071, the Cyrillic Я. The exact calculation is (208%32)*64 + (175%64) = 1071. Characters 224-239 are like a double shift. 226 followed by 190 and then 128 is character 12160: ⾀. 240 and over is a triple shift.

Alas that there are not enough hours in the day though. Okay, so I found this when reading through one of my links above:
http://dwheeler.com/essays/fixing-unix- ... s.html#ifs

A slightly more pleasant approach in Bourne-like shells is to use the $'...' extension. This isn’t standard, but it’s widely supported, including by the bash, ksh (korn shell), and zsh shells. In these shells you can just say IFS=$'\n\t' and you’re done, which is slightly more pleasant. As the korn shell documentation says, the purpose of '...' is to ‘solve the problem of entering special characters in scripts [using] ANSI-C rules to translate the string... It would have been cleaner to have all “...” strings handle ANSI-C escapes, but that would not be backwards compatible.’ It might even be more efficient; some shells might implement ‘printf ...‘ by invoking a separate process, which would have nontrivial overhead (shells can optimize this away, too, since printf is typically a builtin). But this $'...' extension isn’t supported by some Bourne-like shells, including dash (the default /bin/sh in Ubuntu) and the busybox shell, and the portable version isn’t too bad. I’d like to see $'...' added to a future POSIX standard and these other shells, as it’s a widely implemented and useful extension. I think $'...' will in the next version of the POSIX specification (you can blame me for proposing it).

and: https://tldp.org/LDP/abs/html/abs-guide.html#STRQ

https://www.tinylinux.info/
DOWNLOAD wd_multi for hundreds of 'distros' at your fingertips: viewtopic.php?p=99154#p99154
Αξίζει να μεταφραστεί;

User avatar
MochiMoppel
Posts: 1233
Joined: Mon Jun 15, 2020 6:25 am
Location: Japan
Has thanked: 21 times
Been thanked: 437 times

Invalid UTF filenames

Post by MochiMoppel »

wiak wrote: Sat Jun 26, 2021 1:45 pm

maybe the new attached image (fixed_I_think.png) is what I'm supposed to get... Mind you, what a weird but working idea to use command:

Code: Select all

find /tmp -not -name "*"

Not quite what you are supposed to get. Firstly it surely is weird that the code works at all but it's also weird that it works in your 64bit system. It doesn't work when I tested in Fossapup64 and slacko64. It turned out that the find command works differently and insists on a trailing slash for the search directory. In my slacko5.6 it works with or without the slash, and that's the way it's supposed to work.
Secondly it's odd that your Xdialog shows boxes with the FFFD inside, Unicode U+FFFD is the code for the character (official name is REPLACEMENT CHARACTER) and Xdialog should be able to display it, just like pcmanfm does in your screenshot pcmanfm_display.png.

My Xdialog output is also different. These funny crossed rectangles look like characters but are they really? They don't behave like characters and when copied to clipboard the clipboad manager translates them to \Uffffffff .This is not valid Unicode and would be outside the permissable UTF-8 value range.

Biggest surprise and disappointment in "modern" Puppies is the behavior of ROX-Filer. As shown in my screenshot the "classic" ROX-Filer displays the invalid UTF names in red and adds a tooltip. At the same time it tries to be clever, and instead of marking the problematic characters with one of the placeholders � or '?' it displays "best guess" characters - probably depending on the current locale. This can make the names look like ordinary names, but the red color tells the user that something is fishy and that they are not what they appear to be. However in newer ROX-Filers these names appear in black and can't be distinguished from UTF-8 encoded names - unless the user hovers the mouse and triggers the tooltip, but why would any user do this? A very bad design decision.

Attachments
Screenshot.png
Screenshot.png (37.49 KiB) Viewed 8242 times
some1
Posts: 85
Joined: Wed Aug 19, 2020 4:32 am
Has thanked: 18 times
Been thanked: 14 times

Re: Character maps and code pages

Post by some1 »

In my Fossapup9.5
/tmp is a symlink
find default is -P - see the manual
-----
-----
FatDog:siconv location:
/usr/lib64/siconv

-----
The find-logic seem to be:
1. Valid namespace is a subset of the stored namebytes
The valid subset is determined by the locale/encoding
2. find -name "*" -> the valid filenamespace
3. find -not -name "*" -> the invalid cruft -part of which might be valid
with a different locale.
----
Gnome - and probably other GUI-tools - have handlers for FILENAMES.
Gnome has a G_BROKEN_FILENAME
(so GUI-wise - we might be able to see fileitems - which we cannot capture with f.x a normal find-command)
And Yes! - we need the red-colored indication in ROX for a faulty filename.

User avatar
MochiMoppel
Posts: 1233
Joined: Mon Jun 15, 2020 6:25 am
Location: Japan
Has thanked: 21 times
Been thanked: 437 times

Character maps and code pages

Post by MochiMoppel »

some1 wrote: Tue Jun 29, 2021 12:46 pm

In my Fossapup9.5
/tmp is a symlink
find default is -P - see the manual

OK, that explains the behavior. Setting the option to -H should fix this issue. Looks a bit weird but I find this more elegant than forcibly adding a trailing slash.

The find-logic seem to be:
1. Valid namespace is a subset of the stored namebytes
The valid subset is determined by the locale/encoding

Hard to digest.
I can reproduce the failed results reported by @wiak.
My locale is en_US.UTF-8 , which lets the script work as expected. However when I override the locale with a preceding
export LC_CTYPE=POSIX
then -not -name "*" finds no invalid names, which is very strange because POSIX would also make properly UTF-8 encoded filenames invalid, wouldn't it?

Easy to verify: Fix the names in /tmp/testfiles to valid UTF-8 names (with the ROX-Filer rename trick).
With LC_CTYPE=en_US.UTF-8 the command find /tmp/testfiles/M* would return such valid name as /tmp/testfiles/Malmø
With LC_CTYPE=POSIX the same command would return such (now invalid?) name as /tmp/testfiles/Malm??
Though POSIX makes the "valid subset" much smaller, find's -not -name "*" condition will not return any invalid filename.

[EDIT]

wiak wrote: Sat Jun 26, 2021 2:16 pm

Hmmm... This stuff is kind of fascinating:
https://www.smashingmagazine.com/2012/0 ... cter-sets/

UTF-8 is a clever. It works a bit like the Shift key on your keyboard. Normally when you press the H on your keyboard a lower case “h” appears on the screen. But if you press Shift first, a capital H will appear.

UTF-8 treats numbers 0-127 as ASCII, 192-247 as Shift keys, and 128-192 as the key to be shifted.

That's a nice article and description. However I think there is a mistake. "Shift keys" don't start at 192, they start at 194 (192 can't be Shift key and also a key to be shifted as the article suggests. It can't be both).
2-byte Unicode characters start with Unicode block "Latin-1 Supplement" (U+0080 ~ U+00FF). The first character in this block (the non-printable U+0080 PADDING CHARACTER) is formed with hex bytes \xc2\x80.
\xc2 is decimal 194. Any attempts to form a UTF-8 character with a starting \xc0 or \xc1 (192 or 193) will lead to an invalid character.

User avatar
misko_2083
Posts: 196
Joined: Wed Dec 09, 2020 11:59 pm
Has thanked: 10 times
Been thanked: 20 times

Re: Character maps and code pages

Post by misko_2083 »

This extracts non-ascii characters:

Code: Select all

nonascii() { LANG=C grep -o '[^ -~]\+'; }
echo $'Malm\xf6' $'Malm\xf8' $'\x80uro' | nonascii
�
�
�

You would think � are the same characters in the output but

Code: Select all

printf '%q\n' $(echo $'Malm\xf6' $'Malm\xf8' $'\x80uro' | nonascii)
$'\366'
$'\370'
$'\200'

So, � is is a "visual aid", not a real character passed to Xdialog.

Let's try to replace the non ascii with '?' Next will only echo mv command:

Code: Select all

find /tmp/testfiles -print0 | while IFS= read -r -d '' file; do
  dname="$(dirname "$file")"
  fname="$(basename "$file")"
  newname="${fname//[^[:ascii:]]/?}"

   if [ "${fname}" != "${newname}" ]      # if equal -> already clean -> skip
   then
          if [ -e "${dname}/${newname}" ]
          then
                    echo "\"$newname\" and \"$fname\" both exist in \"$dname\":"
                    ls -ld "$dname/$newname" "$dname/$fname"
           else
                   echo mv "$file" "$dname/$newname"
           fi
   fi
done

The result:

Code: Select all

mv /tmp/testfiles/Malm� /tmp/testfiles/Malm?
mv /tmp/testfiles/�uro /tmp/testfiles/?uro

Any char value that falls within the ASCII range (0x00 .. 0x7F) will fit in 1 char and map to the same codepoint value in Unicode (U+0000 .. U+007F),
but any char value in the ANSI range but not in the ASCII range (0x80 .. 0xFF) is subject to interpretation by whatever character encoding created the char values.
Some encodings use 1 char per character, some use multiple chars.
Detecting non-unicode characters is impossible because there is no non-UTF-8 character, some are 1 byte, some 2 some three or more bytes.

Do you want to exit the Circus? The Harsh Truth
https://www.youtube.com/watch?v=ZJwQicZHp_c

User avatar
MochiMoppel
Posts: 1233
Joined: Mon Jun 15, 2020 6:25 am
Location: Japan
Has thanked: 21 times
Been thanked: 437 times

Re: Character maps and code pages

Post by MochiMoppel »

misko_2083 wrote: Wed Jul 14, 2021 2:23 am

Let's try to replace the non ascii with '?'

Has the potential to turn a perfectly readable and valid name into a mess like ???????.jpg :)

Detecting non-unicode characters is impossible

We just did that. See previous posts.

User avatar
misko_2083
Posts: 196
Joined: Wed Dec 09, 2020 11:59 pm
Has thanked: 10 times
Been thanked: 20 times

Re: Character maps and code pages

Post by misko_2083 »

MochiMoppel wrote: Fri Jul 16, 2021 12:23 am
misko_2083 wrote: Wed Jul 14, 2021 2:23 am

Let's try to replace the non ascii with '?'

Has the potential to turn a perfectly readable and valid name into a mess like ???????.jpg :)

Detecting non-unicode characters is impossible

We just did that. See previous posts.

??????.jpg That's only a bit worse than my metod in naming scripts. :)
I meant to write automatically detecting non-unicode chars in a script.
There seems to always need a user input.
When changing a text from Latin to Cyrilic I always have to do manual correction.
Hey the difference is only in three letters.
Our language is completely vocalized with 30 sounds and 30 Cyrilic characters.
However we also have a Latin script with 27 chars and 3 diagraphs.
Sometimes people use English keyboards that don't have some letters (or lack a different keyborad input, or too lazy to make a switch).
Instead they type the closest match and the software transliterates to a completely different letter in Cyrilic.
And that's why I always have to go though all the lines and fix the letters relying on the context of the sentence.
I think the Google tried to make a software that does this automatically but failed.

Do you want to exit the Circus? The Harsh Truth
https://www.youtube.com/watch?v=ZJwQicZHp_c

User avatar
MochiMoppel
Posts: 1233
Joined: Mon Jun 15, 2020 6:25 am
Location: Japan
Has thanked: 21 times
Been thanked: 437 times

Re: Character maps and code pages

Post by MochiMoppel »

misko_2083 wrote: Fri Jul 16, 2021 6:35 pm

Our language is completely vocalized with 30 sounds and 30 Cyrilic characters.
However we also have a Latin script with 27 chars and 3 diagraphs.

Serbian?

Sometimes people use English keyboards that don't have some letters (or lack a different keyborad input, or too lazy to make a switch).
Instead they type the closest match and the software transliterates to a completely different letter in Cyrilic.

I assume that the "software" is not iconv and that there is no codepage around that can convert Basic Latin (ASCII) to Cyrillic equivalents, e.g. changing an ASCII L to a Cyrillic Л. On the other side a more or less sophisticated search/replace script should be able to do it, starting with the diagraphs and leaving only ambiguous characters like Z (could be Ж or З ) for manual correction. Don't know if this describes your task, it's just my imagination. A real life example would help.

User avatar
Grey
Posts: 2023
Joined: Wed Jul 22, 2020 12:33 am
Location: Russia
Has thanked: 76 times
Been thanked: 376 times

Re: Character maps and code pages

Post by Grey »

MochiMoppel wrote: Sun Jul 18, 2021 2:20 am
misko_2083 wrote: Fri Jul 16, 2021 6:35 pm

Our language is completely vocalized with 30 sounds and 30 Cyrilic characters.
However we also have a Latin script with 27 chars and 3 diagraphs.

Serbian?

Sometimes people use English keyboards that don't have some letters (or lack a different keyborad input, or too lazy to make a switch).
Instead they type the closest match and the software transliterates to a completely different letter in Cyrilic.

I assume that the "software" is not iconv and that there is no codepage around that can convert Basic Latin (ASCII) to Cyrillic equivalents, e.g. changing an ASCII L to a Cyrillic Л. On the other side a more or less sophisticated search/replace script should be able to do it, starting with the diagraphs and leaving only ambiguous characters like Z (could be Ж or З ) for manual correction. Don't know if this describes your task, it's just my imagination. A real life example would help.

Yes, he meant Serbian - I remember from his screenshots the words "сликовни" and "уклони". Serbia (or Montenegro) is doing well.
You have not yet seen how encodings are used in the countries of the former USSR and English keyboards on which stickers with letters are glued - a mixture of Cyrillic and Latin + local flavor. But the people have adapted :)

Fossapup OS, Ryzen 5 3600 CPU, 64 GB RAM, GeForce GTX 1050 Ti 4 GB, Sound Blaster Audigy Rx with amplifier + Yamaha speakers for loud sound, USB Sound Blaster X-Fi Surround 5.1 Pro V3 + headphones for quiet sound.

User avatar
MochiMoppel
Posts: 1233
Joined: Mon Jun 15, 2020 6:25 am
Location: Japan
Has thanked: 21 times
Been thanked: 437 times

Character maps and code pages

Post by MochiMoppel »

The Serbian/Cyrillic discussion has been moved to a separate thread.

wiak wrote: Sun Jun 20, 2021 9:42 pm

nevertheless, a find/replace such characters script would be useful
<snip>
it seems likely that convmv doesn't itself use "quoted representation of the invalid bytes" but instead converts everything to UTF-8

The find part is done. The replace part is not.
I wasn't able to find a copy of the convmv tool to see what it does. If it really can only change encoding from whatever codepage to UTF-8 then I would consider this tool too limited. What if I want to change the names to ASCII, not UTF-8, e.g. to make them compatible with Unicode agnostic filesystems or tools?

After a bit of tinkering I've come up with the shortest (and most reliable!) replacement script. A oneliner:

Code: Select all

eval "$(find -H /tmp -not -name "*" -exec stat -c "mv -b $%N %N" {} \; | sed "s/[‘’]/'/g")"

It finds invalid UTF-8 names and changes the "problematic" bytes of a filename to their octal representations, so referring to our earlier examples it would change the invalid Malmö to a valid pure ASCII name of Malm\366.
For the unlikely case that the renaming would overwrite an already existing file with name of Malm\366 I use the -b option of the mv command, which creates a backup copy of the overwritten file

Attachments
fixinvalid.png
fixinvalid.png (34.98 KiB) Viewed 7719 times
User avatar
wiak
Posts: 4079
Joined: Tue Dec 03, 2019 6:10 am
Location: Packing - big job
Has thanked: 65 times
Been thanked: 1206 times
Contact:

Re: Character maps and code pages

Post by wiak »

MochiMoppel wrote: Wed Jul 28, 2021 2:40 am

I wasn't able to find a copy of the convmv tool to see what it does. If it really can only change encoding from whatever codepage to UTF-8 then I would consider this tool too limited.

No, seems I was wrong about that, seems you can give current encoding and target encoding parameters:

https://linux.die.net/man/1/convmv

I just used it one time for quick latin1 coversion and think it was to utf-8, but I already can't recall since haven't used it since (but will probably need to use it again sometime). Your find/replace one-liner would probably do for my own limited needs though.

https://www.tinylinux.info/
DOWNLOAD wd_multi for hundreds of 'distros' at your fingertips: viewtopic.php?p=99154#p99154
Αξίζει να μεταφραστεί;

User avatar
misko_2083
Posts: 196
Joined: Wed Dec 09, 2020 11:59 pm
Has thanked: 10 times
Been thanked: 20 times

Re: Character maps and code pages

Post by misko_2083 »

MochiMoppel wrote: Wed Jul 28, 2021 2:40 am

After a bit of tinkering I've come up with the shortest (and most reliable!) replacement script. A oneliner:

Code: Select all

eval "$(find -H /tmp -not -name "*" -exec stat -c "mv -b $%N %N" {} \; | sed "s/[‘’]/'/g")"

It finds invalid UTF-8 names and changes the "problematic" bytes of a filename to their octal representations, so referring to our earlier examples it would change the invalid Malmö to a valid pure ASCII name of Malm\366.
For the unlikely case that the renaming would overwrite an already existing file with name of Malm\366 I use the -b option of the mv command, which creates a backup copy of the overwritten file

That's very good Mochi.
It's possible to revert it back with character transformation.
I think it was introduced in bash 4.4? I have 5.0.3.

Code: Select all

 VAR='Malm\366'
 echo $VAR
Malm\366
 VAR="${VAR@E}"
 echo ${VAR}
Malm�
echo ${VAR@Q}
$'Malm\366'

I was wondering where ${variable@E} would be usefull, other than obvious use case in expanding ${variable@Q}.

Do you want to exit the Circus? The Harsh Truth
https://www.youtube.com/watch?v=ZJwQicZHp_c

User avatar
MochiMoppel
Posts: 1233
Joined: Mon Jun 15, 2020 6:25 am
Location: Japan
Has thanked: 21 times
Been thanked: 437 times

Re: Character maps and code pages

Post by MochiMoppel »

misko_2083 wrote: Sat Jul 31, 2021 10:40 am

That's very good Mochi.
It's possible to revert it back with character transformation.
I think it was introduced in bash 4.4? I have 5.0.3.

Introduced in bash 4.4 alpha ("o. There is a new ${parameter@spec} family of operators to transform the value of `parameter'."). Buried in an endless list of bugfixes which makes me wonder if anything in bash ever worked as intended :lol:

Looks like a more convenient syntax for what is already possible with much older versions. The printf command contains this functionality and it works with bash 3.0, the oldest version I could find in my archive. I don't know an elegant way to convert a non-ASCII string into a string with octal escape codes, i.e. I miss a command similar to find %q that would produce only the .... part of a $'....' construct. Apart from that everything else is possible with simple code:

Code: Select all

var1=$'K\xf8benhavn K\xc3\xb8benhavn' #contains invalid and valid UTF-8 hex escape codes
echo "var1: $var1"          #K�benhavn København
var2=$(printf %q "$var1")   #convert to $'...' construct
var2=${var2//[$\']/}        #convert $'...' to ...
echo "var2: $var2"          #K\370benhavn K\303\270benhavn
var3=$(printf "$var2")      #reconvert octal escape codes to invalid and valid UTF-8 chars
echo "var3: $var3"          #K�benhavn København
var4=$(printf %q "$var3")   #convert string to $'...' construct, with all non-ASCII chars as octal escape codes
echo "var4: $var4"          #$'K\370benhavn K\303\270benhavn'
eval echo "var4: $var4"     #K�benhavn København
User avatar
misko_2083
Posts: 196
Joined: Wed Dec 09, 2020 11:59 pm
Has thanked: 10 times
Been thanked: 20 times

Re: Character maps and code pages

Post by misko_2083 »

MochiMoppel wrote: Sun Aug 01, 2021 7:43 am

Introduced in bash 4.4 alpha ("o. There is a new ${parameter@spec} family of operators to transform the value of `parameter'."). Buried in an endless list of bugfixes which makes me wonder if anything in bash ever worked as intended :lol:

I'm still exploring the usefullness of it.

Q - returns single quoted string with any special characters such as \n, \t,... escaped
Maybe it can be used to place all the lines of a file file in a single line.

Code: Select all

printf "%q" "$(< file)"
$'line1\nline2\nline3\nline4'

var=$(<file)
echo "${var@Q}"
$'line1\nline2\nline3\nline4'

P - is neat for tesing the looks of prompts like PS1, PS2

Code: Select all

prompt="\\[$(tput setaf 5)\\]\\u@\\h:\\w #\\[$(tput sgr0)\\]"
echo "${prompt@P} Neat"

E - I'm not sure how usefull it is
It expands all of escaped characters.

Code: Select all

var="one\n\ttwo\nthree"
 echo "${var}"
one\n\ttwo\nthree

echo -e "${var}"
one
	two
three

echo "${var@E}"
one
	two
three

There is probaly some use case when passing quoted variables to a script, like a thunar custom action to a script.

Code: Select all

bash -c 'echo "${@@Q}"' _ a b c
'a' 'b' 'c'
bash -c 'echo "${@@E}"' _ 'a' 'b' 'c'
a b c
bash -c 'printf "%s\n" "${@@Q}"' _ a b c
'a'
'b'
'c'
bash -c 'printf "%s\n" "${@@E}"' _ 'a' 'b' 'c'
a
b
c
bash -c 'printf "%s\n" "${@@Q}"' _ {1..3} | yad --text-info
'1'
'2'
'3'

*If used directly as a thunar custom action instead of a script file %s option in printf must be escaped like this %%s

Code: Select all

bash -c 'printf "%%s\n" "${@@Q}"' _ %F | yad --text-info

A - prints out the variable how it is assigned and declare option if available.

Code: Select all

var='one\n\ttwo\nthree'
echo "${var@A}"
var='one\n\ttwo\nthree'

declare -i var=5
echo "${var@A}"
declare -i var='5'

var+=1
echo "${var@A}"
declare -i var='6'

a - returns varible's attributes if any are declared

Code: Select all

declare -ri var='5'
echo ${var@a}
ri
MochiMoppel wrote: Sun Aug 01, 2021 7:43 am

Looks like a more convenient syntax for what is already possible with much older versions. The printf command contains this functionality and it works with bash 3.0, the oldest version I could find in my archive. I don't know an elegant way to convert a non-ASCII string into a string with octal escape codes, i.e. I miss a command similar to find %q that would produce only the .... part of a $'....' construct. Apart from that everything else is possible with simple code:

Maybe with sed.

Code: Select all

ls /tmp/testfiles | sed  -n 'l'   # That's small letter L

If there are newlines in the filename then it's best to do find -print0 | xargs -0

Code: Select all

find -H /tmp -not -name "*" -print0 2>/dev/null | xargs --null printf "'%s'\n" | sed  -n 'l' 
'/tmp/testfiles/Malm\366'$
'/tmp/testfiles/Malm\370'$
'/tmp/testfiles/\200uro$
eh'$
'/tmp/testfiles/\200uro'$

And that last char has to be removed :)

Code: Select all

 find -H /tmp -not -name "*" -print0 2>/dev/null | xargs --null printf "'%s'\n" | sed  -n 'l'  | sed  's/.$//'
 '/tmp/testfiles/Malm\366'
'/tmp/testfiles/Malm\370'
'/tmp/testfiles/\200uro
eh'
'/tmp/testfiles/\200uro'
 

Normaly it's used to display octal values instead of non-readable characters in a file

Code: Select all

sed  -n 'l' file.txt

Do you want to exit the Circus? The Harsh Truth
https://www.youtube.com/watch?v=ZJwQicZHp_c

Post Reply

Return to “Programming”