Page 1 of 1

text splitting problem

Posted: Thu Nov 16, 2023 3:38 am
by MochiMoppel

I'm trying to split a string of space delimited words so that each word ends up on a separate line , but substrings enclosed in quotes or brackets should be treated as one word, even if they contain multiple words.

Example:
Lorem 'ipsum dolor sit' amet 'consecur' adipis [elit sed do] [eius] mod

should be converted into
Lorem
'ipsum dolor sit'
amet
'consecur'
adipis
[elit sed do]
[eius]
mod

Is there an elegant way to achieve this? So far I've come up with a solution that seems to work, but I find it a bit clumsy:

Code: Select all

string="Lorem 'ipsum dolor sit' amet 'consecur' adipis [elit sed do] [eius] mod"
for n in $string;do
    case $n in
        '['*']'|"'"*"'")                    # [eius] and 'consecur'
            echo "$n" 
            ;;
        '['*|*']'|*"'"*)
            if [[ -z $iscompound ]];then    # [elit and 'ipsum
                iscompound=1
                sub+=$n' '
            else                            # do] andr sit'
                sub+=$n
                echo "$sub"
                [[ $n =~ ("'"|']') ]] && iscompound= sub=
            fi
            ;;
        *)
            [[ $iscompound ]] && sub+=$n' ' || echo "$n" # dolor or Lorem
            ;;
    esac
done

Is there a better way?


Re: text splitting problem

Posted: Thu Nov 16, 2023 8:33 am
by puppy_apprentice

Specifically for this example:

Code: Select all

x="Lorem 'ipsum dolor sit' amet 'consecur' adipis [elit sed do] [eius] mod";echo $x | tr "'" "\n" | tr "[" "\n" | tr "]" "\n"

But this is not exactly what you want ;)

Icon has better functions for such things.


Re: text splitting problem

Posted: Thu Nov 16, 2023 10:35 am
by HerrBert

Also specifically for this example:

Code: Select all

string="Lorem 'ipsum dolor sit' amet 'consecur' adipis [elit sed do] [eius] mod"
i=0
for n in $string; do
	[ "${n::1}" = "'" -o "${n::1}" = "[" ] && i=1
	[ "${n: -1:1}" = "'" -o "${n: -1:1}" = "]" ] && i=0
	[ $i -eq 0 ] && echo $n || echo -n "$n "
done

[edit] shorter:

Code: Select all

string="Lorem 'ipsum dolor sit' amet 'consecur' adipis [elit sed do] [eius] mod"
i=0
for n in $string; do
	[[ ${n::1} = [\[\'] ]] && i=1
	[[ ${n: -1:1} = [\]\'] ]] && i=0
	[ $i -eq 0 ] && echo $n || echo -n "$n "
done

(still learning this crazy sh*t :oops: )


Re: text splitting problem

Posted: Thu Nov 16, 2023 11:32 am
by MochiMoppel

Very clever :thumbup:

Even shorter and all lines same lenght :lol:

Code: Select all

string="Lorem 'ipsum dolor sit' amet 'consecur' adipis [elit sed do] [eius] mod"
i=0
for n in $string; do
	[[ ${n::1}    = [[\'] ]] && ((i++))
	[[ ${n: -1:1} = []\'] ]] && ((i++))
	((i%2)) && echo -n "$n " || echo $n
done

Re: text splitting problem

Posted: Thu Nov 16, 2023 12:07 pm
by HerrBert

:lol:
TBH, ((i++)) was my first attempt, but at least i thought there is no math needed

Very nice exercise :thumbup:


Re: text splitting problem

Posted: Thu Nov 16, 2023 12:21 pm
by MochiMoppel
HerrBert wrote: Thu Nov 16, 2023 12:07 pm

TBH, ((i++)) was my first attempt, but at least i thought there is no math needed

Agreed! Down with Math ! :twisted:

Code: Select all

string="Lorem 'ipsum dolor sit' 'consecur' adipis [elit sed do] [eius] mod"
for n in $string; do
  [[ $n =  [[\']* ]] && i=1
  [[ $n = *[]\']  ]] && i=0
  ((i)) && echo -n "$n " || echo $n
done

Re: text splitting problem

Posted: Thu Nov 16, 2023 3:20 pm
by puppy_apprentice

Code: Select all

string="Lorem 'ipsum dolor sit' 'consecur' adipis [elit sed do] [eius] mod *hello world*"
for n in $string
do
  [[ $n =  [$1]* ]] && i=1
  [[ $n = *[$2]  ]] && i=0
  ((i)) && echo -n "$n " || echo $n
done

test.sh \*[\' ]\*\'


Re: text splitting problem

Posted: Fri Nov 17, 2023 8:36 am
by MochiMoppel

Math is back :welcome:

@HerrBert Still based on your concept I made the snippet more versatile. The [...] and '...' compounds may now have preceding and/or trailing strings. This makes it possible to filter regex patterns (which after all is the purpose of my exercise).

Code: Select all

string="abc *[\ ]*  ^[0-9].* $'a b c' $'\n'"
for n in $string; do
  [[ $n = *[[\'\]]*     ]] && ((i++)) # matches ^[0-9].* or $'a or ]*
  [[ $n = *[[\']*[]\']* ]] && i=0     # matches ^[0-9].* but not  $'a or ]*
  ((i%2)) && echo -n "$n " || echo $n
done

Output:

Code: Select all

abc
*[\ ]*
^[0-9].*
$'a b c'
$'\n'

Re: text splitting problem

Posted: Sat Dec 02, 2023 8:48 pm
by puppy_apprentice

Code: Select all

arr_of_strings=(Lorem 'ipsum dolor sit' amet 'consecur' adipis [elit sed do] [eius] mod)
echo ${arr_of_strings[@]}
for (( i=0; i<=${#arr_of_strings[@]};i++ ))
do
	echo ${arr_of_strings[i]}
done

Code: Select all

Lorem ipsum dolor sit amet consecur adipis [elit sed do] [eius] mod
Lorem
ipsum dolor sit
amet
consecur
adipis
[elit sed do]
[eius]
mod

Code: Select all

arr_of_strings=(Lorem \'ipsum dolor sit\' amet \'consecur\' adipis [elit sed do] [eius] mod *asterisk check*)
echo ${arr_of_strings[@]}
for (( i=0; i<=${#arr_of_strings[@]};i++ ))
do
	echo ${arr_of_strings[i]}
done

Code: Select all

Lorem 'ipsum dolor sit' amet 'consecur' adipis [elit sed do] [eius] mod *asterisk check*
Lorem
'ipsum
dolor
sit'
amet
'consecur'
adipis
[elit sed do]
[eius]
mod
*aterisk
check*

Arrays can be used only in specific situations.


Re: text splitting problem

Posted: Tue Dec 05, 2023 5:57 am
by MochiMoppel
puppy_apprentice wrote: Sat Dec 02, 2023 8:48 pm

Arrays can be used only in specific situations.

at least not here :mrgreen:


Re: text splitting problem

Posted: Tue Dec 05, 2023 6:39 am
by pp4mnklinux

Only a suggestion using the 'awk' command:

Code: Select all

string="Lorem 'ipsum dolor sit' amet 'consecur' adipis [elit sed do] [eius] mod"

echo "$string" | awk -v RS="[ \t\n]+" '{gsub(/^'"'"'(.*)'"'"'$/, "\1", $1); print $1}'

As I said...only a suggestion, but if I understood you correctly, it is possible this help you as general solution:

Code: Select all

#!/bin/bash

split_string() {
    local input_string="$1"
    local result=""
    local current_word=""

    while IFS= read -rn1 char; do
        case "$char" in
            ' '|$'\n'|$'\t')  # Space, newline, or tab
                if [[ -n $current_word ]]; then
                    result+="$current_word"$'\n'
                    current_word=""
                fi
                ;;
            "'")
                in_quote=true
                current_word+="$char"
                ;;
            "[" | "]")
                in_bracket=true
                current_word+="$char"
                ;;
            *)
                current_word+="$char"
                ;;
        esac
    done <<< "$input_string"

    if [[ -n $current_word ]]; then
        result+="$current_word"$'\n'
    fi

    echo "$result"
}



Re: text splitting problem

Posted: Thu Dec 07, 2023 7:39 am
by MochiMoppel
pp4mnklinux wrote: Tue Dec 05, 2023 6:39 am

if I understood you correctly, it is possible this help you as general solution

Thanks for trying to help. It obviously fails to meet the requirements but it adds to the list of approaches that do not work. Learning from mistakes is almost as much fun as finding a working solution - at least that's what I try to believe when I again end up in a dead-end street of the coding maze :lol:


Re: text splitting problem

Posted: Fri Dec 08, 2023 8:03 am
by greengeek

Here is what Google Bard suggested:

Code: Select all

#!/bin/bash

# Check if input file is provided
if [ -z "$1" ]; then
  echo "Please provide the input file path as an argument."
  exit 1
fi

# Define the text file path
input_file="$1"

# Define output filename with ".processed" extension
output_file="${input_file}.processed"

# Initialize variables
current_buffer=""
within_brackets=false
within_quotes=false

# Process each character in the input file
while IFS= read -r -n1 char; do
  # Check if within brackets or quotes
  if [[ "$char" == "[" ]]; then
    within_brackets=true
  elif [[ "$char" == "]" ]]; then
    within_brackets=false
  elif [[ "$char" == "\"" ]]; then
    within_quotes=not $within_quotes
  fi

  # Check if character is space and not within brackets or quotes
  if [[ "$char" == " " && ! "$within_brackets" && ! "$within_quotes" ]]; then
    # Add current buffer to output file with newline
    echo "$current_buffer" >> "$output_file"
    # Reset current buffer
    current_buffer=""
  else
    # Append character to current buffer
    current_buffer="$current_buffer$char"
  fi
done < "$input_file"

# Ensure last buffer is written to output file
if [[ ! -z "$current_buffer" ]]; then
  echo "$current_buffer" >> "$output_file"
fi

echo "Processed file saved to: $output_file"

Doesn't work.
No idea how Google got so wealthy...
8-)
(Maybe i am as successful telling Bard what to do as i am telling my wife what to do... Skillset lacking..)

Here is what I asked Bard (maybe I misunderstood the question)

Please write a bash script to process a text file and convert whitespace to linefeed except within brackets or single quotes or double quotes


Re: text splitting problem

Posted: Mon Dec 11, 2023 7:11 am
by MochiMoppel

@greengeek
:lol:
I asked Bard exactly the same question. For me he came up with a different suggestion:

Code: Select all

#!/bin/bash

# Check if a character is inside a quote
is_inside_quote () {
  local char="$1"
  local in_single_quote="false"
  local in_double_quote="false"

  # Check for single quote
  if [[ "$char" == "'" ]]; then
    in_single_quote="$in_single_quote"
    in_double_quote="false"
  elif [[ "$in_single_quote" == "false" && "$char" == '"' ]]; then
    in_double_quote="$in_double_quote"
    in_single_quote="false"
  fi

  # Check for bracket
  if [[ "$char" == '[' ]]; then
    in_bracket="true"
  elif [[ "$in_bracket" == "true" && "$char" == ']' ]]; then
    in_bracket="false"
  fi

  # Return true if inside any quote or bracket
  if [[ "$in_single_quote" == "true" || "$in_double_quote" == "true" || "$in_bracket" == "true" ]]; then
    return 0
  else
    return 1
  fi
}

# Input file path
input_file="$1"

# Output file path (optional)
output_file="${2:-$input_file.out}"

# Check if input file exists
if [[ ! -f "$input_file" ]]; then
  echo "Error: Input file '$input_file' does not exist."
  exit 1
fi

# Open output file for writing
exec > "$output_file"

# Process each character in the input file
while read -r -n 1 char; do
  if [[ "$char" == " " || "$char" == "\t" ]]; then
    if ! is_inside_quote "$char"; then
      echo
    fi
  else
    echo -n "$char"
  fi
done < "$input_file"

echo

exit 0

Needless to say that it doesn't work either. Even worse than your Bard because he fails to insert newline characters.
But the comments are nice.
Interestingly no syntax errors in both attempts and in your case funny mistakes. If within_quotes=false then he might expect that within_quotes=not $within_quotes (syntactically correct!) results in within_quotes=true. That's not how Bash works but it's how humans think. And this means that he can have hardly copied this stuff from a serious code source.

I also checked his "Draft 2" and "Draft 3" alternatives. Wouldn't even start because of multiple syntax errors :lol: