bash how filter non showable uncode chars?

For discussions about programming, and for programming questions and advice


Moderator: Forum moderators

Post Reply
blumenwesen
Posts: 32
Joined: Sun Apr 10, 2022 10:02 pm

bash how filter non showable uncode chars?

Post by blumenwesen »

Hi, how can I delete special characters like "💩🐕힚팧" and leave only normally visible ones like "01abŁÐ....".
With the code I can display, but not filter them out.

Code: Select all

zcat /usr/share/i18n/charmaps/UTF-8.gz  | awk '{if(NR>25001 && NR<27000) {gsub(/\//, "\\\\"); system("/usr/bin/echo -ne "$2" | grep -oP \"[^\\x00-\\x7F]\""); split($2, z, "\\"); print z[1]}}'

a="💩🐕01abŁÐ힚팧" # should look like this "01abŁÐ" or "💩🐕01abŁÐ"
echo ${a//[![:graph:]]/} # delete nothing "💩🐕01abŁÐ힚팧"
tr -dc '[[:graph:]]' <<< $a # delete too mutch "01ab.."
User avatar
MochiMoppel
Posts: 1137
Joined: Mon Jun 15, 2020 6:25 am
Location: Japan
Has thanked: 18 times
Been thanked: 371 times

Re: bash how filter non showable uncode chars?

Post by MochiMoppel »

blumenwesen wrote: Sun Oct 23, 2022 6:47 pm

Hi, how can I delete special characters like "💩🐕힚팧"

These characters are not special. Just normal Unicode characters U+1F4A9 U+1F415 U+D79A U+D327.
First one is officially called "PILE OF POO" and the last is HANGUL SYLLABLE PAH. If you see them as such or as blocks depends mainly on your installed fonts (here in the forum you can see the first character because the BB software can convert shit into a SVG image :o ).
Are you asking how to delete them when they appear as blocks?

and leave only normally visible ones like "01abŁÐ....".

Not all are "normally visible". The last 2 characters are not even normed, meaning that they are not defined and have no names. They belong to the "Private Use Area" of Unicode. Font developers can use them as they please. If the one after Ð appear like '88' it's only because that's how this character U+F000 is defined in DejaVu

User avatar
user1234
Posts: 413
Joined: Sat Feb 26, 2022 5:48 am
Location: Somewhere on earth
Has thanked: 154 times
Been thanked: 88 times

Re: bash how filter non showable uncode chars?

Post by user1234 »

MochiMoppel wrote: Mon Oct 24, 2022 12:22 am

Are you asking how to delete them when they appear as blocks?

These are basically not called blocks (even though they look like one). Instead they are called Tofu, which the google's noto (no tofu) font tries to remove for every Unicode char.

PuppyLinux 🐾 gives new life to old computers ✨

blumenwesen
Posts: 32
Joined: Sun Apr 10, 2022 10:02 pm

Re: bash how filter non showable uncode chars?

Post by blumenwesen »

Ok thanks, but how to distinguish the installed from the uninstalled?
I made this here, but is there a faster or better way to compare + filter the characters?

Code: Select all

str='ɨɆ01ab!?🙏⃹'
for ((z=0; z<${#str}; z+=1)); do
    for y in $(fc-list : family | sort | uniq); do
        for x in $(fc-match --format="%{charset}\n" "$y"); do
            for w in $(seq "0x${x%-*}" "0x${x#*-}"); do
                [[ $(echo -e \\U"$w") =~ "${str:$z:1}" ]] && echo -e \\U"$w"" $w $y" && break 3
            done
        done
    done
done

[EDIT]25.10.22 11:26am, I want rephrase the question, how can I identify the language family of a character faster?[/EDIT]

User avatar
MochiMoppel
Posts: 1137
Joined: Mon Jun 15, 2020 6:25 am
Location: Japan
Has thanked: 18 times
Been thanked: 371 times

Re: bash how filter non showable uncode chars?

Post by MochiMoppel »

blumenwesen wrote: Mon Oct 24, 2022 2:41 pm

I made this here

And you tested it too?
You should have noticed that it doesn't work even with the example you provided.
Your seq expression returns decimal values for $w, however echo -e \\U"$w" expects values for $w to be hexadecimal. This means that ASCII characters 0x20 - 0x31 will never be found.

blumenwesen
Posts: 32
Joined: Sun Apr 10, 2022 10:02 pm

Re: bash how filter non showable uncode chars?

Post by blumenwesen »

Thanks for the improvement, I've changed it now.
therefore, I have to realize that it is not possible with bash, and the blocks can only be deleted manually.
in this sense, the main thing is save, the man himself. :mrgreen:

User avatar
MochiMoppel
Posts: 1137
Joined: Mon Jun 15, 2020 6:25 am
Location: Japan
Has thanked: 18 times
Been thanked: 371 times

Re: bash how filter non showable uncode chars?

Post by MochiMoppel »

blumenwesen wrote: Thu Oct 27, 2022 10:04 am

I have to realize that it is not possible with bash, and the blocks can only be deleted manually.

Nothing is impossible:

Code: Select all

str='▶⏾⏿ℱ✌◀ Paris Αθήνα मुंबई ວຽງຈັນ'

for ((z=0; z<${#str}; z++)); do
    CHAR=${str:$z:1}
    (( $(printf %i "'$CHAR") < 128 )) && RESULT+="$CHAR" && continue        #ASCII
    [[ $(fc-list ":charset=$(printf %x "'$CHAR")") ]] && RESULT+="$CHAR"    #UNICODE
done
echo  "Before:	$str
After:	$RESULT"  | leafpad
noblocks(1).png
noblocks(1).png (6.55 KiB) Viewed 450 times
User avatar
user1234
Posts: 413
Joined: Sat Feb 26, 2022 5:48 am
Location: Somewhere on earth
Has thanked: 154 times
Been thanked: 88 times

Re: bash how filter non showable uncode chars?

Post by user1234 »

MochiMoppel wrote: Fri Oct 28, 2022 6:14 am
blumenwesen wrote: Thu Oct 27, 2022 10:04 am

I have to realize that it is not possible with bash, and the blocks can only be deleted manually.

Nothing is impossible:

Code: Select all

str='▶⏾⏿ℱ✌◀ Paris Αθήνα मुंबई ວຽງຈັນ'

for ((z=0; z<${#str}; z++)); do
    CHAR=${str:$z:1}
    (( $(printf %i "'$CHAR") < 128 )) && RESULT+="$CHAR" && continue        #ASCII
    [[ $(fc-list ":charset=$(printf %x "'$CHAR")") ]] && RESULT+="$CHAR"    #UNICODE
done
echo  "Before:	$str
After:	$RESULT"  | leafpad

noblocks(1).png

Seems like you need to install a devnagri (hindi) font :mrgreen:. This is what you're looking for 8-).

PuppyLinux 🐾 gives new life to old computers ✨

blumenwesen
Posts: 32
Joined: Sun Apr 10, 2022 10:02 pm

Re: bash how filter non showable uncode chars?

Post by blumenwesen »

Yes, that would be an idea, but I don't know if all visible characters can be displayed in a file.
In geany, the complete text is deleted and an error is displayed with other editors.
Well, but it worked out so thank you again.

Post Reply

Return to “Programming”