Page 1 of 1

bash how filter non showable uncode chars?

Posted: Sun Oct 23, 2022 6:47 pm
by blumenwesen

Hi, how can I delete special characters like "💩🐕힚팧" and leave only normally visible ones like "01abŁÐ....".
With the code I can display, but not filter them out.

Code: Select all

zcat /usr/share/i18n/charmaps/UTF-8.gz  | awk '{if(NR>25001 && NR<27000) {gsub(/\//, "\\\\"); system("/usr/bin/echo -ne "$2" | grep -oP \"[^\\x00-\\x7F]\""); split($2, z, "\\"); print z[1]}}'

a="💩🐕01abŁÐ힚팧" # should look like this "01abŁÐ" or "💩🐕01abŁÐ"
echo ${a//[![:graph:]]/} # delete nothing "💩🐕01abŁÐ힚팧"
tr -dc '[[:graph:]]' <<< $a # delete too mutch "01ab.."

Re: bash how filter non showable uncode chars?

Posted: Mon Oct 24, 2022 12:22 am
by MochiMoppel
blumenwesen wrote: Sun Oct 23, 2022 6:47 pm

Hi, how can I delete special characters like "💩🐕힚팧"

These characters are not special. Just normal Unicode characters U+1F4A9 U+1F415 U+D79A U+D327.
First one is officially called "PILE OF POO" and the last is HANGUL SYLLABLE PAH. If you see them as such or as blocks depends mainly on your installed fonts (here in the forum you can see the first character because the BB software can convert shit into a SVG image :o ).
Are you asking how to delete them when they appear as blocks?

and leave only normally visible ones like "01abŁÐ....".

Not all are "normally visible". The last 2 characters are not even normed, meaning that they are not defined and have no names. They belong to the "Private Use Area" of Unicode. Font developers can use them as they please. If the one after Ð appear like '88' it's only because that's how this character U+F000 is defined in DejaVu


Re: bash how filter non showable uncode chars?

Posted: Mon Oct 24, 2022 5:27 am
by user1234
MochiMoppel wrote: Mon Oct 24, 2022 12:22 am

Are you asking how to delete them when they appear as blocks?

These are basically not called blocks (even though they look like one). Instead they are called Tofu, which the google's noto (no tofu) font tries to remove for every Unicode char.


Re: bash how filter non showable uncode chars?

Posted: Mon Oct 24, 2022 2:41 pm
by blumenwesen

Ok thanks, but how to distinguish the installed from the uninstalled?
I made this here, but is there a faster or better way to compare + filter the characters?

Code: Select all

str='ɨɆ01ab!?🙏⃹'
for ((z=0; z<${#str}; z+=1)); do
    for y in $(fc-list : family | sort | uniq); do
        for x in $(fc-match --format="%{charset}\n" "$y"); do
            for w in $(seq "0x${x%-*}" "0x${x#*-}"); do
                [[ $(echo -e \\U"$w") =~ "${str:$z:1}" ]] && echo -e \\U"$w"" $w $y" && break 3
            done
        done
    done
done

[EDIT]25.10.22 11:26am, I want rephrase the question, how can I identify the language family of a character faster?[/EDIT]


Re: bash how filter non showable uncode chars?

Posted: Wed Oct 26, 2022 12:57 pm
by MochiMoppel
blumenwesen wrote: Mon Oct 24, 2022 2:41 pm

I made this here

And you tested it too?
You should have noticed that it doesn't work even with the example you provided.
Your seq expression returns decimal values for $w, however echo -e \\U"$w" expects values for $w to be hexadecimal. This means that ASCII characters 0x20 - 0x31 will never be found.


Re: bash how filter non showable uncode chars?

Posted: Thu Oct 27, 2022 10:04 am
by blumenwesen

Thanks for the improvement, I've changed it now.
therefore, I have to realize that it is not possible with bash, and the blocks can only be deleted manually.
in this sense, the main thing is save, the man himself. :mrgreen:


Re: bash how filter non showable uncode chars?

Posted: Fri Oct 28, 2022 6:14 am
by MochiMoppel
blumenwesen wrote: Thu Oct 27, 2022 10:04 am

I have to realize that it is not possible with bash, and the blocks can only be deleted manually.

Nothing is impossible:

Code: Select all

str='▶⏾⏿ℱ✌◀ Paris Αθήνα मुंबई ວຽງຈັນ'

for ((z=0; z<${#str}; z++)); do
    CHAR=${str:$z:1}
    (( $(printf %i "'$CHAR") < 128 )) && RESULT+="$CHAR" && continue        #ASCII
    [[ $(fc-list ":charset=$(printf %x "'$CHAR")") ]] && RESULT+="$CHAR"    #UNICODE
done
echo  "Before:	$str
After:	$RESULT"  | leafpad
noblocks(1).png
noblocks(1).png (6.55 KiB) Viewed 547 times

Re: bash how filter non showable uncode chars?

Posted: Fri Oct 28, 2022 8:26 am
by user1234
MochiMoppel wrote: Fri Oct 28, 2022 6:14 am
blumenwesen wrote: Thu Oct 27, 2022 10:04 am

I have to realize that it is not possible with bash, and the blocks can only be deleted manually.

Nothing is impossible:

Code: Select all

str='▶⏾⏿ℱ✌◀ Paris Αθήνα मुंबई ວຽງຈັນ'

for ((z=0; z<${#str}; z++)); do
    CHAR=${str:$z:1}
    (( $(printf %i "'$CHAR") < 128 )) && RESULT+="$CHAR" && continue        #ASCII
    [[ $(fc-list ":charset=$(printf %x "'$CHAR")") ]] && RESULT+="$CHAR"    #UNICODE
done
echo  "Before:	$str
After:	$RESULT"  | leafpad

noblocks(1).png

Seems like you need to install a devnagri (hindi) font :mrgreen:. This is what you're looking for 8-).


Re: bash how filter non showable uncode chars?

Posted: Fri Oct 28, 2022 2:44 pm
by blumenwesen

Yes, that would be an idea, but I don't know if all visible characters can be displayed in a file.
In geany, the complete text is deleted and an error is displayed with other editors.
Well, but it worked out so thank you again.