Page 1 of 1
Filename with umlaut not valid UTF-8, but duplicate is valid
Posted: Tue Oct 24, 2023 12:10 pm
by proebler
I have a number of files, which after having been copied through/with SAMBA-share from one drive to another, are now shown to have 'non valid UTF-8' names.
The file names contain 'Umlaut' character(s) (üäö).
In Rox-Filer:
When I 'Duplicate' such a file, the duplicate file then gets a valid (UTF-8) name.
Example:
(non-valid) AüFilename.pdf >Duplicate> (valid) AüFilename-2.pdf
When I 'Rename' such a file, the renamed file also gets a valid (UTF-8) name.
Example:
(non-valid) AüFilename.pdf >Rename> (valid) AüäöFilename-2.pdf
On the other hand, when I use 'Copy' , the copied file remains non-valid and qpdfview refuses to open it.
Example:
(non-valid) AüFilename.pdf >Copy> (non-valid) Copy of AüFilename.pdf
Can someone explain what happens?
Thanks
proebler
Re: Filename with umlaut not valid UTF-8, but duplicate is valid
Posted: Wed Oct 25, 2023 6:01 pm
by HerrBert
@proebler
I recall having a similar issue using yassm on a fat32 windows share...
If you use yassm to access a windows share, you may have to set iocharset=utf8
as option.
Re: Filename with umlaut not valid UTF-8, but duplicate is valid
Posted: Wed Oct 25, 2023 7:08 pm
by mikewalsh
@proebler :-
I don't know as I can explain what's happening, but it sounds very similar to what happens with certain PNG images I download. I'll try to view them in Viewnior, but Viewnior complains that the file contains no "readable" data.
What I then do is to open said file in the GIMP, followed by re-saving it, still as a PNG, to the same location.....so it 'overwrites' the original. Invariably, this always "fixes" the rogue file, after which Viewnior opens it as normal with no complaints.
Like you, it's one of those wee mysteries that no doubt HAS an explanation.......but, since it's only an occasional occurrence (and I know the simple 'fix' for it), I've never bothered to try & figure out exactly what's behind it. I may BE a 'geek', but I'm not a very 'geeky' one..!
Quite how simply running something through a 'Save' operation fixes functionality, I don't have a clue.......but it always seems to work. And for that, I'm grateful.
Mike.
Re: Filename with umlaut not valid UTF-8, but duplicate is valid
Posted: Thu Oct 26, 2023 2:37 am
by MochiMoppel
proebler wrote: ↑Tue Oct 24, 2023 12:10 pm
In Rox-Filer:
When I 'Duplicate' such a file, the duplicate file then gets a valid (UTF-8) name.
No surprise. Duplication requires you to type a new name, and ROX-Filer, as any other GUI application, simply doesn't let you type invalid UTF-8, so your new file name will always be valid
When I 'Rename' such a file, the renamed file also gets a valid (UTF-8) name.
same as 'Duplicate'
On the other hand, when I use 'Copy' , the copied file remains non-valid and qpdfview refuses to open it.
'Copy' doesn't affect the source name, it just needs a target name. Though the UTF-8 encoding of the file name may be wrong, that doesn't mean the file name itself is invalid.
What may be confusing: ROX-Filer tries to guess what the invalid UTF-8 should be, and instead of rendering such undefined character with a question mark (as most other applications do), it renders the character with its "best guess".
Let's create an example. Following code will create an empty file in directory /tmp with an invalid UTF-8 character in its file name. In ROX-Filer the name will appear as AüFilename, in other file managers and in the GTK Open dialog it will appear as A?Filename
The name contains hex 0xFC, which in many code pages is rendered as Umlaut 'ü', but it's not valid UTF-8.
In UTF-8 Umlaut 'ü' would require 2 bytes, 0xC3 and 0xBC, however the UTF-8 value for this combination is U+00FC. So this might give ROX-Filer the idea to represent invalid hex 0xFC with valid U+00FC.
Re: Filename with umlaut not valid UTF-8, but duplicate is valid
Posted: Thu Oct 26, 2023 6:31 pm
by Burunduk
MochiMoppel wrote: ↑Thu Oct 26, 2023 2:37 am
Let's create an example. Following code will create an empty file in directory /tmp with an invalid UTF-8 character in its file name. In ROX-Filer the name will appear as AüFilename, in other file managers and in the GTK Open dialog it will appear as A?Filename
>
instead of touch
? Interesting. This is the only explanation I've found: What occurs when I use redirection without leading commands? It seems to be POSIX. Works in the busybox shells too.
There is a tool for converting such filenames to a valid utf-8:
Code: Select all
root# >/tmp/A$'\xFC'Filename
root#
root# convmv -f latin-1 -t utf8 -r --notest /tmp/
mv "/tmp/AüFilename" "/tmp/AüFilename"
Ready! I converted 1 files in 0 seconds.
convmv is a perl script from the ubuntu or debian repo.
Re: Filename with umlaut not valid UTF-8, but duplicate is valid
Posted: Fri Oct 27, 2023 2:22 am
by MochiMoppel
I never use touch for this purpose (though for better readability I often use echo -n > myfile
). It's also useful for reducing existing files to zero size.
The bash manual says
3.6.2 Redirecting Output
Redirection of output causes the file whose name results from the expansion of word to be opened for writing on file descriptor n, or the standard output (file descriptor 1) if n is not specified. If the file does not exist it is created; if it does exist it is truncated to zero size.
The general format for redirecting output is:
[n]>[|]word
In other words: The file descriptor n on the left side is optional and the file on the right side is created if it doesn't exist, i.e. the file is created even if no data are actually redirected.
Re: Filename with umlaut not valid UTF-8, but duplicate is valid
Posted: Fri Oct 27, 2023 4:14 am
by MochiMoppel
Burunduk wrote: ↑Thu Oct 26, 2023 6:31 pmThere is a tool for converting such filenames to a valid utf-8:
Code: Select all
root# >/tmp/A$'\xFC'Filename
root#
root# convmv -f latin-1 -t utf8 -r --notest /tmp/
mv "/tmp/AüFilename" "/tmp/AüFilename"
Ready! I converted 1 files in 0 seconds.
convmv is a perl script from the ubuntu or debian repo.
I don't know convmv, but I think the same effect can also be achieved with a bash script using iconv, which is probably included in all Puppies. Like convmv requires a trailing slash when a directory is passed as argument:
Code: Select all
#!/bin/bash
for f in $@* ;do
fconv=$(echo "$f" | iconv -f WINDOWS-1252 -t UTF-8)
[[ $f != $fconv ]] && mv "$f" "$fconv"
done
Seems to work, but haven't tested extensively. For peace of mind try with 'cp' instead of 'mv'
Here is a related post, describing how to find files with improper file name: viewtopic.php?p=29088#p29088
The trick of using find <path> -not -name "*"
doesn't work anymore in BW64 Don't know why.
Edit:BW64 uses find (GNU findutils) 4.9.0 . It behaves now like busybox find Older versions are fine.
Re: Filename with umlaut not valid UTF-8, but duplicate is valid
Posted: Sat Oct 28, 2023 3:58 pm
by Burunduk
MochiMoppel wrote: ↑Fri Oct 27, 2023 4:14 am
Edit:BW64 uses find (GNU findutils) 4.9.0 . It behaves now like busybox find Older versions are fine.
MochiMoppel wrote: ↑Fri Oct 27, 2023 12:40 pm
Version 4.7.0 in @rockedge 's F96-CE still works fine. However using F96-CE's version in BW64 does not work ... and this may be an indication that find is not the culprit. I suspect that in BW64 other changes have been made, because also ROX-Filer interprets '*' names differently.
find 4.9.0 compiled in Fossapup64-9.5 works just like 4.7.0.
However, the locale setting changes the result. I have en_GB.UTF-8 in Fossa. With $LANG set to C or en_US, no files are found:
Code: Select all
root# find /tmp/ -not -name "*"
/tmp/testfiles/?uro
/tmp/testfiles/Malm?
/tmp/testfiles/Malm?
root# LANG=C find /tmp/ -not -name "*"
root#
Re: Filename with umlaut not valid UTF-8, but duplicate is valid
Posted: Sun Oct 29, 2023 2:11 am
by MochiMoppel
Burunduk wrote: ↑Sat Oct 28, 2023 3:58 pm the locale setting changes the result. I have en_GB.UTF-8 in Fossa. With $LANG set to C or en_US, no files are found:
I have en_US.UTF-8 (the default), and this used to work. Now it doesn't.
Re: Filename with umlaut not valid UTF-8, but duplicate is valid
Posted: Sun Oct 29, 2023 4:51 am
by Burunduk
MochiMoppel wrote: ↑Sun Oct 29, 2023 2:11 am
I have en_US.UTF-8 (the default), and this used to work. Now it doesn't.
Can see it in VoidPup64-22.02. Probably a change in the glibc. At least find /tmp/ -not -regex ".*"
still works. It's trying to match a whole path so it's not an equivalent but better than nothing.