text splitting problem

For discussions about programming, and for programming questions and advice


Moderator: Forum moderators

Post Reply
User avatar
MochiMoppel
Posts: 1290
Joined: Mon Jun 15, 2020 6:25 am
Location: Japan
Has thanked: 22 times
Been thanked: 476 times

text splitting problem

Post by MochiMoppel »

I'm trying to split a string of space delimited words so that each word ends up on a separate line , but substrings enclosed in quotes or brackets should be treated as one word, even if they contain multiple words.

Example:
Lorem 'ipsum dolor sit' amet 'consecur' adipis [elit sed do] [eius] mod

should be converted into
Lorem
'ipsum dolor sit'
amet
'consecur'
adipis
[elit sed do]
[eius]
mod

Is there an elegant way to achieve this? So far I've come up with a solution that seems to work, but I find it a bit clumsy:

Code: Select all

string="Lorem 'ipsum dolor sit' amet 'consecur' adipis [elit sed do] [eius] mod"
for n in $string;do
    case $n in
        '['*']'|"'"*"'")                    # [eius] and 'consecur'
            echo "$n" 
            ;;
        '['*|*']'|*"'"*)
            if [[ -z $iscompound ]];then    # [elit and 'ipsum
                iscompound=1
                sub+=$n' '
            else                            # do] andr sit'
                sub+=$n
                echo "$sub"
                [[ $n =~ ("'"|']') ]] && iscompound= sub=
            fi
            ;;
        *)
            [[ $iscompound ]] && sub+=$n' ' || echo "$n" # dolor or Lorem
            ;;
    esac
done

Is there a better way?

User avatar
puppy_apprentice
Posts: 692
Joined: Tue Oct 06, 2020 8:43 pm
Location: land of bigos and schabowy ;)
Has thanked: 5 times
Been thanked: 115 times

Re: text splitting problem

Post by puppy_apprentice »

Specifically for this example:

Code: Select all

x="Lorem 'ipsum dolor sit' amet 'consecur' adipis [elit sed do] [eius] mod";echo $x | tr "'" "\n" | tr "[" "\n" | tr "]" "\n"

But this is not exactly what you want ;)

Icon has better functions for such things.

HerrBert
Posts: 364
Joined: Mon Jul 13, 2020 6:14 pm
Location: Germany, NRW
Has thanked: 19 times
Been thanked: 135 times

Re: text splitting problem

Post by HerrBert »

Also specifically for this example:

Code: Select all

string="Lorem 'ipsum dolor sit' amet 'consecur' adipis [elit sed do] [eius] mod"
i=0
for n in $string; do
	[ "${n::1}" = "'" -o "${n::1}" = "[" ] && i=1
	[ "${n: -1:1}" = "'" -o "${n: -1:1}" = "]" ] && i=0
	[ $i -eq 0 ] && echo $n || echo -n "$n "
done

[edit] shorter:

Code: Select all

string="Lorem 'ipsum dolor sit' amet 'consecur' adipis [elit sed do] [eius] mod"
i=0
for n in $string; do
	[[ ${n::1} = [\[\'] ]] && i=1
	[[ ${n: -1:1} = [\]\'] ]] && i=0
	[ $i -eq 0 ] && echo $n || echo -n "$n "
done

(still learning this crazy sh*t :oops: )

User avatar
MochiMoppel
Posts: 1290
Joined: Mon Jun 15, 2020 6:25 am
Location: Japan
Has thanked: 22 times
Been thanked: 476 times

Re: text splitting problem

Post by MochiMoppel »

Very clever :thumbup:

Even shorter and all lines same lenght :lol:

Code: Select all

string="Lorem 'ipsum dolor sit' amet 'consecur' adipis [elit sed do] [eius] mod"
i=0
for n in $string; do
	[[ ${n::1}    = [[\'] ]] && ((i++))
	[[ ${n: -1:1} = []\'] ]] && ((i++))
	((i%2)) && echo -n "$n " || echo $n
done
HerrBert
Posts: 364
Joined: Mon Jul 13, 2020 6:14 pm
Location: Germany, NRW
Has thanked: 19 times
Been thanked: 135 times

Re: text splitting problem

Post by HerrBert »

:lol:
TBH, ((i++)) was my first attempt, but at least i thought there is no math needed

Very nice exercise :thumbup:

User avatar
MochiMoppel
Posts: 1290
Joined: Mon Jun 15, 2020 6:25 am
Location: Japan
Has thanked: 22 times
Been thanked: 476 times

Re: text splitting problem

Post by MochiMoppel »

HerrBert wrote: Thu Nov 16, 2023 12:07 pm

TBH, ((i++)) was my first attempt, but at least i thought there is no math needed

Agreed! Down with Math ! :twisted:

Code: Select all

string="Lorem 'ipsum dolor sit' 'consecur' adipis [elit sed do] [eius] mod"
for n in $string; do
  [[ $n =  [[\']* ]] && i=1
  [[ $n = *[]\']  ]] && i=0
  ((i)) && echo -n "$n " || echo $n
done
User avatar
puppy_apprentice
Posts: 692
Joined: Tue Oct 06, 2020 8:43 pm
Location: land of bigos and schabowy ;)
Has thanked: 5 times
Been thanked: 115 times

Re: text splitting problem

Post by puppy_apprentice »

Code: Select all

string="Lorem 'ipsum dolor sit' 'consecur' adipis [elit sed do] [eius] mod *hello world*"
for n in $string
do
  [[ $n =  [$1]* ]] && i=1
  [[ $n = *[$2]  ]] && i=0
  ((i)) && echo -n "$n " || echo $n
done

test.sh \*[\' ]\*\'

User avatar
MochiMoppel
Posts: 1290
Joined: Mon Jun 15, 2020 6:25 am
Location: Japan
Has thanked: 22 times
Been thanked: 476 times

Re: text splitting problem

Post by MochiMoppel »

Math is back :welcome:

@HerrBert Still based on your concept I made the snippet more versatile. The [...] and '...' compounds may now have preceding and/or trailing strings. This makes it possible to filter regex patterns (which after all is the purpose of my exercise).

Code: Select all

string="abc *[\ ]*  ^[0-9].* $'a b c' $'\n'"
for n in $string; do
  [[ $n = *[[\'\]]*     ]] && ((i++)) # matches ^[0-9].* or $'a or ]*
  [[ $n = *[[\']*[]\']* ]] && i=0     # matches ^[0-9].* but not  $'a or ]*
  ((i%2)) && echo -n "$n " || echo $n
done

Output:

Code: Select all

abc
*[\ ]*
^[0-9].*
$'a b c'
$'\n'
User avatar
puppy_apprentice
Posts: 692
Joined: Tue Oct 06, 2020 8:43 pm
Location: land of bigos and schabowy ;)
Has thanked: 5 times
Been thanked: 115 times

Re: text splitting problem

Post by puppy_apprentice »

Code: Select all

arr_of_strings=(Lorem 'ipsum dolor sit' amet 'consecur' adipis [elit sed do] [eius] mod)
echo ${arr_of_strings[@]}
for (( i=0; i<=${#arr_of_strings[@]};i++ ))
do
	echo ${arr_of_strings[i]}
done

Code: Select all

Lorem ipsum dolor sit amet consecur adipis [elit sed do] [eius] mod
Lorem
ipsum dolor sit
amet
consecur
adipis
[elit sed do]
[eius]
mod

Code: Select all

arr_of_strings=(Lorem \'ipsum dolor sit\' amet \'consecur\' adipis [elit sed do] [eius] mod *asterisk check*)
echo ${arr_of_strings[@]}
for (( i=0; i<=${#arr_of_strings[@]};i++ ))
do
	echo ${arr_of_strings[i]}
done

Code: Select all

Lorem 'ipsum dolor sit' amet 'consecur' adipis [elit sed do] [eius] mod *asterisk check*
Lorem
'ipsum
dolor
sit'
amet
'consecur'
adipis
[elit sed do]
[eius]
mod
*aterisk
check*

Arrays can be used only in specific situations.

User avatar
MochiMoppel
Posts: 1290
Joined: Mon Jun 15, 2020 6:25 am
Location: Japan
Has thanked: 22 times
Been thanked: 476 times

Re: text splitting problem

Post by MochiMoppel »

puppy_apprentice wrote: Sat Dec 02, 2023 8:48 pm

Arrays can be used only in specific situations.

at least not here :mrgreen:

User avatar
pp4mnklinux
Posts: 1238
Joined: Wed Aug 19, 2020 5:43 pm
Has thanked: 658 times
Been thanked: 321 times

Re: text splitting problem

Post by pp4mnklinux »

Only a suggestion using the 'awk' command:

Code: Select all

string="Lorem 'ipsum dolor sit' amet 'consecur' adipis [elit sed do] [eius] mod"

echo "$string" | awk -v RS="[ \t\n]+" '{gsub(/^'"'"'(.*)'"'"'$/, "\1", $1); print $1}'

As I said...only a suggestion, but if I understood you correctly, it is possible this help you as general solution:

Code: Select all

#!/bin/bash

split_string() {
    local input_string="$1"
    local result=""
    local current_word=""

    while IFS= read -rn1 char; do
        case "$char" in
            ' '|$'\n'|$'\t')  # Space, newline, or tab
                if [[ -n $current_word ]]; then
                    result+="$current_word"$'\n'
                    current_word=""
                fi
                ;;
            "'")
                in_quote=true
                current_word+="$char"
                ;;
            "[" | "]")
                in_bracket=true
                current_word+="$char"
                ;;
            *)
                current_word+="$char"
                ;;
        esac
    done <<< "$input_string"

    if [[ -n $current_word ]]; then
        result+="$current_word"$'\n'
    fi

    echo "$result"
}


PP4MNK

User avatar
MochiMoppel
Posts: 1290
Joined: Mon Jun 15, 2020 6:25 am
Location: Japan
Has thanked: 22 times
Been thanked: 476 times

Re: text splitting problem

Post by MochiMoppel »

pp4mnklinux wrote: Tue Dec 05, 2023 6:39 am

if I understood you correctly, it is possible this help you as general solution

Thanks for trying to help. It obviously fails to meet the requirements but it adds to the list of approaches that do not work. Learning from mistakes is almost as much fun as finding a working solution - at least that's what I try to believe when I again end up in a dead-end street of the coding maze :lol:

User avatar
greengeek
Posts: 1463
Joined: Thu Jul 16, 2020 11:06 pm
Has thanked: 593 times
Been thanked: 208 times

Re: text splitting problem

Post by greengeek »

Here is what Google Bard suggested:

Code: Select all

#!/bin/bash

# Check if input file is provided
if [ -z "$1" ]; then
  echo "Please provide the input file path as an argument."
  exit 1
fi

# Define the text file path
input_file="$1"

# Define output filename with ".processed" extension
output_file="${input_file}.processed"

# Initialize variables
current_buffer=""
within_brackets=false
within_quotes=false

# Process each character in the input file
while IFS= read -r -n1 char; do
  # Check if within brackets or quotes
  if [[ "$char" == "[" ]]; then
    within_brackets=true
  elif [[ "$char" == "]" ]]; then
    within_brackets=false
  elif [[ "$char" == "\"" ]]; then
    within_quotes=not $within_quotes
  fi

  # Check if character is space and not within brackets or quotes
  if [[ "$char" == " " && ! "$within_brackets" && ! "$within_quotes" ]]; then
    # Add current buffer to output file with newline
    echo "$current_buffer" >> "$output_file"
    # Reset current buffer
    current_buffer=""
  else
    # Append character to current buffer
    current_buffer="$current_buffer$char"
  fi
done < "$input_file"

# Ensure last buffer is written to output file
if [[ ! -z "$current_buffer" ]]; then
  echo "$current_buffer" >> "$output_file"
fi

echo "Processed file saved to: $output_file"

Doesn't work.
No idea how Google got so wealthy...
8-)
(Maybe i am as successful telling Bard what to do as i am telling my wife what to do... Skillset lacking..)

Here is what I asked Bard (maybe I misunderstood the question)

Please write a bash script to process a text file and convert whitespace to linefeed except within brackets or single quotes or double quotes

User avatar
MochiMoppel
Posts: 1290
Joined: Mon Jun 15, 2020 6:25 am
Location: Japan
Has thanked: 22 times
Been thanked: 476 times

Re: text splitting problem

Post by MochiMoppel »

@greengeek
:lol:
I asked Bard exactly the same question. For me he came up with a different suggestion:

Code: Select all

#!/bin/bash

# Check if a character is inside a quote
is_inside_quote () {
  local char="$1"
  local in_single_quote="false"
  local in_double_quote="false"

  # Check for single quote
  if [[ "$char" == "'" ]]; then
    in_single_quote="$in_single_quote"
    in_double_quote="false"
  elif [[ "$in_single_quote" == "false" && "$char" == '"' ]]; then
    in_double_quote="$in_double_quote"
    in_single_quote="false"
  fi

  # Check for bracket
  if [[ "$char" == '[' ]]; then
    in_bracket="true"
  elif [[ "$in_bracket" == "true" && "$char" == ']' ]]; then
    in_bracket="false"
  fi

  # Return true if inside any quote or bracket
  if [[ "$in_single_quote" == "true" || "$in_double_quote" == "true" || "$in_bracket" == "true" ]]; then
    return 0
  else
    return 1
  fi
}

# Input file path
input_file="$1"

# Output file path (optional)
output_file="${2:-$input_file.out}"

# Check if input file exists
if [[ ! -f "$input_file" ]]; then
  echo "Error: Input file '$input_file' does not exist."
  exit 1
fi

# Open output file for writing
exec > "$output_file"

# Process each character in the input file
while read -r -n 1 char; do
  if [[ "$char" == " " || "$char" == "\t" ]]; then
    if ! is_inside_quote "$char"; then
      echo
    fi
  else
    echo -n "$char"
  fi
done < "$input_file"

echo

exit 0

Needless to say that it doesn't work either. Even worse than your Bard because he fails to insert newline characters.
But the comments are nice.
Interestingly no syntax errors in both attempts and in your case funny mistakes. If within_quotes=false then he might expect that within_quotes=not $within_quotes (syntactically correct!) results in within_quotes=true. That's not how Bash works but it's how humans think. And this means that he can have hardly copied this stuff from a serious code source.

I also checked his "Draft 2" and "Draft 3" alternatives. Wouldn't even start because of multiple syntax errors :lol:

superhik
Posts: 52
Joined: Mon Jun 19, 2023 7:56 pm
Has thanked: 6 times
Been thanked: 21 times

Re: text splitting problem

Post by superhik »

@MochiMoppel

Grep:

Code: Select all

grep -oP "(?:[^'\s\[]+|'[^']+'|\[[^\]]+\])" <<< "Lorem 'ipsum dolor sit' amet 'consecur' adipis [elit sed do] [eius] mod"

The regular expression used is:

Code: Select all

(?:[^'\s\[]+|'[^']+'|\[[^\]]+\])

Here's a breakdown of how it works:

(?: starts a non-capturing group
[^'\s\[]+ matches one or more characters that are not quotes, spaces, or brackets
| or...
'[^']+' matches a quoted substring: a quote, followed by one or more characters that are not quotes, followed by a quote
| or...
\[[^\]]+\] matches a bracketed substring: a left bracket, followed by one or more characters that are not right brackets, followed by a right bracket
) closes the non-capturing group

The -o option tells grep to print only the matched text, and the -P option enables PCRE (Perl-compatible regular expressions) syntax.

Note that this assumes that the input string is well-formed, with balanced quotes and brackets. If the input string can contain unbalanced or malformed quoted or bracketed substrings, you may need to add additional error handling or preprocessing.

Sed:

Code: Select all

sed -E "s/ ?('[^']+'|\[[^]]+\]|[^'\[[:space:]]+)/\1\n/g" <<< "Lorem 'ipsum dolor sit' amet 'consecur' adipis [elit sed do] [eius] mod"

Matching Patterns:
'[^']+' Matches substrings enclosed in single quotes.
\[[^]]+\] Matches substrings enclosed in brackets.
[^'\[[:space:]]+ Matches any sequence of characters that are not single quotes, brackets, or spaces.

Substitution:
?('\[^']+'|\[[^]]+\]|[^'\[[:space:]]+) Matches any of the patterns above and an optional leading space.
\1\n Replaces each match with itself followed by a newline.

PS. if you need to remove the last trailing newline add s/\n$//

Code: Select all

sed -E "s/ ?('[^']+'|\[[^]]+\]|[^'\[[:space:]]+)/\1\n/g;s/\n$//" <<< "Lorem 'ipsum dolor sit' amet 'consecur' adipis [elit sed do] [eius] mod"
retiredt00
Posts: 224
Joined: Fri Sep 04, 2020 12:11 pm
Has thanked: 11 times
Been thanked: 36 times

Re: text splitting problem

Post by retiredt00 »

This is a very informative thread and very close to what I would like to do, which is in a file that has lines with letters symbols and lumbers and lines that have only numbers or only letters or only symbols, eliminate lines that have only numbers or only symbols.
The problem I have in is that all the lines in the files have spaces in the front that I do not want to eliminate as the spacing if important for the layout

Code: Select all

        bbbb
    aaaaa       
  bbaaabbbbb > 123
   *   |   *
    123
	456
     ccc
  abcdefghi || 789
    45

to become this

Code: Select all

        bbbb
    aaaaa       
  bbaaabbbbb > 123
     ccc
  abcdefghi || 789

or this

Code: Select all

        bbbb
    aaaaa       
  bbaaabbbbb > 123
    123
	456
     ccc
  abcdefghi || 789
    45
User avatar
MochiMoppel
Posts: 1290
Joined: Mon Jun 15, 2020 6:25 am
Location: Japan
Has thanked: 22 times
Been thanked: 476 times

Re: text splitting problem

Post by MochiMoppel »

@superhik
Thanks for the new variants. Well, they work for Lorem but not for my "real life" string abc *[\ ]* ^[0-9].* $'a b c' $'\n' (see my post of 2023-11-17)
I usually prefer sed or grep but here they are not only slower than @HerrBert's approach but probably not suitable at all.

@retiredt00
Your problem may be not that "all the lines in the files have spaces in the front" because some don't. Your line 456 starts with a tab, not a space.
To get the desired results you posted:

Code: Select all

sed '/^[[:blank:]]*[^a-z]*$/d' /path/to/file

or

Code: Select all

sed '/^[[:blank:]]*[^0-9a-z]*$/d'  /path/to/file
retiredt00
Posts: 224
Joined: Fri Sep 04, 2020 12:11 pm
Has thanked: 11 times
Been thanked: 36 times

Re: text splitting problem

Post by retiredt00 »

Thank you MochiMoppel but this still does not work
It works in the example I provided but not in the actual file, a part of which I attach.
This is part of an old dos file that dos2unix was transforming into 1 line, so I used sed -e "s/\r/\n/g" filename to make it unix like and now the number lines are not removed by the sed '/^[[:blank:]]*[^0-9]*$/d' /path/to/file command

Attachments
test.txt
(742 Bytes) Downloaded 40 times
User avatar
MochiMoppel
Posts: 1290
Joined: Mon Jun 15, 2020 6:25 am
Location: Japan
Has thanked: 22 times
Been thanked: 476 times

Re: text splitting problem

Post by MochiMoppel »

retiredt00 wrote: Thu Jul 04, 2024 2:06 pm

Thank you MochiMoppel but this still does not work
It works in the example I provided but not in the actual file

Then you can hardly call it an example :roll:

now the number lines are not removed by the sed '/^[[:blank:]]*[^0-9]*$/d' /path/to/file command

Of course not. That's not the code I posted.

I'm not sure anymore what you want. To remove lines containing only numbers and only non alphanumerics you could try
sed '/^[0-9 ]*$/d ; /^[^0-9a-zA-Z]*$/d' test.txt

Above sed command consists of 2 patterns:
/^[0-9 ]*$/d deletes lines containing only numbers and spaces (mind the space after 0-9 !)
/^[^0-9a-zA-Z]*$/d deletes lines containing no alphanumerics ; e.g. | |* || | *|

This example

Code: Select all

 >HI  <RI         P0I     <qII   stMI  <III   >ciI 
 >HI  <AII        Ns   fcI  T   <AI  >Ea  BJI  HpII
 |  |    ||  |  |       | |    |   |||||||| ||| |||
  >  250   
  *   1>
  <
 |  |*   ||  | *|       |*|    |   |||||||| ||| |||
 126     134            149    156   162    169    
   126     134            149        160      169   

will be converted to

Code: Select all

 >HI  <RI         P0I     <qII   stMI  <III   >ciI 
 >HI  <AII        Ns   fcI  T   <AI  >Ea  BJI  HpII
  >  250   
  *   1>
Last edited by MochiMoppel on Fri Jul 05, 2024 1:22 am, edited 1 time in total.
superhik
Posts: 52
Joined: Mon Jun 19, 2023 7:56 pm
Has thanked: 6 times
Been thanked: 21 times

Re: text splitting problem

Post by superhik »

MochiMoppel wrote: Thu Jul 04, 2024 9:02 am

@superhik
Thanks for the new variants. Well, they work for Lorem but not for my "real life" string abc *[\ ]* ^[0-9].* $'a b c' $'\n' (see my post of 2023-11-17)
I usually prefer sed or grep but here they are not only slower than @HerrBert's approach but probably not suitable at all.

There are too many variations that don't work with @HerrBert's approach like abc *[\ ]* ^[a-z0-9].* $'a b c' $'\n'.
Next will loop through each character. If outside of the bracketed and quoted strings, it will replace the spaces with the temporary replacement strings, which are unlikely to occur _@@@_.. The field separator is set to the replacement string and another loop prints every element while skipping the empty ones.which are between spaces outside of the bracketed and quoted strings.

Code: Select all

#!/bin/bash
#input="abc *[\ ]*  ^[0-9].* $'a b c' $'\n'"
input="abc *[\ ]*  ^[a-z0-9].* $'a b c' $'\n'"

output=""
in_quotes=false
in_brackets=0

for (( i=0; i<${#input}; i++ )); do
    char="${input:$i:1}"

    if [[ $char == "'" ]]; then
        if $in_quotes; then
            in_quotes=false
        else
            in_quotes=true
        fi
        output+="$char"
    elif [[ $char == "[" ]]; then
        ((in_brackets++))
        output+="$char"
    elif [[ $char == "]" && $in_brackets -gt 0 ]]; then
        ((in_brackets--))
        output+="$char"
    elif [[ $char == " " ]] && [[ $in_quotes == false && $in_brackets -eq 0 ]]; then
        output+="_@@@_"
    else
        output+="$char"
    fi
done

OLDIFS=$IFS
IFS="_@@@_"
for w in $output; do
       [[ -z "$w" ]] && continue 
       echo  "$w"
done
IFS=$OLDIFS
retiredt00
Posts: 224
Joined: Fri Sep 04, 2020 12:11 pm
Has thanked: 11 times
Been thanked: 36 times

Re: text splitting problem

Post by retiredt00 »

MochiMoppel wrote: Thu Jul 04, 2024 3:24 pm

To remove lines containing only numbers and only non alphanumerics you could try
sed '/^[0-9 ]*$/d ; /^[^0-9a-zA-Z]*$/d' test.txt

Above sed command consists of 2 patterns:
/^[0-9 ]*$/d deletes lines containing only numbers and spaces (mind the space after 0-9 !)
/^[^0-9a-zA-Z]*$/d deletes lines containing no alphanumerics ; e.g. | |* || | *|

It worked!
Thanks

superhik
Posts: 52
Joined: Mon Jun 19, 2023 7:56 pm
Has thanked: 6 times
Been thanked: 21 times

Re: text splitting problem

Post by superhik »

Made a small app for this in C.
Let's name it Block Splitter or bsplit.
Save it as bsplit.c and compile with gcc bsplit.c -o bsplit -std=c11 -Wall

Code: Select all

#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include <string.h>
#include <unistd.h>
#include <getopt.h>
#include <sys/select.h>
#include <sys/stat.h>

// Define ssize_t if it's not available
#ifndef HAVE_SSIZE_T
typedef long ssize_t;
#endif

void printHelp(const char *program_name) {
    printf("Usage: %s [-b block_start] [-e block_end] [-f file]\n", program_name);
    printf("Options:\n");
    printf("  -b <block_start>   Specify the start character of a block (can be repeated)\n");
    printf("  -e <block_end>     Specify the end character of a block corresponding to the last specified start character\n");
    printf("  -f <file>          read input from a file instead of stdin\n");
    printf("  -h                 print out help\n");
}

// Implement a simple getline function if getline is not available
ssize_t custom_getline(char **lineptr, size_t *n, FILE *stream) {
    ssize_t rd = 0;
    char *buf = NULL;
    size_t bufsize = 0;
    int c;

    if (lineptr == NULL || n == NULL || stream == NULL) {
        return -1;
    }

    buf = *lineptr;
    bufsize = *n;
    *lineptr = NULL;
    *n = 0;

    while ((c = fgetc(stream)) != EOF) {
        if (rd >= bufsize) {
            bufsize += 128;
            buf = realloc(buf, bufsize);
            if (buf == NULL) {
                return -1;
            }
        }

        buf[rd++] = c;

        if (c == '\n') {
            break;
        }
    }

    if (rd == 0 || ferror(stream)) {
        free(buf);
        return -1;
    }

    // Trim newline character if present
    if (buf[rd - 1] == '\n') {
        buf[rd - 1] = '\0';
        rd--; // Adjust read length
    }

    buf[rd] = '\0';
    *lineptr = buf;
    *n = bufsize;

    return rd;
}

typedef struct {
    char start;
    char end;
} BlockDelimiter;

void printWords(const char *input, BlockDelimiter *block_delimiters, int num_blocks) {
    bool in_block = false;
    char current_word[100]; // Assuming maximum word length
    int current_index = 0;
    size_t input_len = strlen(input);

    for (size_t i = 0; i < input_len; ++i) {
        char ch = input[i];

        for (int j = 0; j < num_blocks; j++) {
            if (ch == block_delimiters[j].start && !in_block) {
                in_block = true;
                break;
            } else if (ch == block_delimiters[j].end && in_block) {
                in_block = false;
                break;
            }
        }

        if (ch == ' ' && !in_block) {
            if (current_index > 0) {
                current_word[current_index] = '\0';
                printf("%s\n", current_word);
                fflush(stdout);
                current_index = 0;
            }
        } else {
            current_word[current_index++] = ch;
        }

    }

    // Print the last word if any
    if (current_index > 0) {
        current_word[current_index] = '\0';
        printf("%s\n", current_word);
        fflush(stdout);
    }
}

int main(int argc, char *argv[]) {
    FILE *input_file = NULL;
    BlockDelimiter block_delimiters[10]; // Support up to 10 different block delimiters
    int num_blocks = 0;

    int opt;
    while ((opt = getopt(argc, argv, "b:e:f:h:")) != -1) {
        switch (opt) {
            case 'b':
                if (num_blocks < 10) {
                    block_delimiters[num_blocks].start = optarg[0];
                }
                break;
            case 'e':
                if (num_blocks < 10) {
                    block_delimiters[num_blocks].end = optarg[0];
                    num_blocks++;
                }
                break;
            case 'f':
                input_file = fopen(optarg, "r");
                if (input_file == NULL) {
                    perror("Error opening file");
                    exit(EXIT_FAILURE);
                }
                break;
            case 'h':
                printHelp(argv[0]);
                exit(0);
            default:
                printHelp(argv[0]);
                exit(EXIT_FAILURE);
        }
    }

    char *input = NULL;
    size_t len = 0;
    ssize_t rd;

    if (input_file) {
        while ((rd = custom_getline(&input, &len, input_file)) != -1) {
            // Process each line here
            if (rd > 0 && input[rd - 1] == '\n') {
                input[rd - 1] = '\0'; // Remove newline if present
            }
            printWords(input, block_delimiters, num_blocks);
        }
    } else if (argc > optind) {
        // Process remaining arguments as input
        size_t input_len = 0;
        for (int i = optind; i < argc; i++) {
            input_len += strlen(argv[i]) + 1; // +1 for space
        }
        input = malloc(input_len + 1); // +1 for null terminator
        if (input == NULL) {
            perror("malloc");
            exit(EXIT_FAILURE);
        }
        input[0] = '\0';
        for (int i = optind; i < argc; i++) {
            strcat(input, argv[i]);
            if (i < argc - 1) {
                strcat(input, " ");
            }
        }
        rd = strlen(input);
    } else {
        // Check if stdin is connected to a terminal
        int fd = fileno(stdin); // Get the file descriptor for stdin

        // Use fstat to check if input is available on stdin
        struct stat statbuf;
        if (fstat(fd, &statbuf) == 0 && S_ISFIFO(statbuf.st_mode)) {
            // Input is available from a pipe or redirection
            // stdin is a regular file, proceed with reading from
            while ((rd = custom_getline(&input, &len, stdin)) != -1) {
                // Process each line here
                if (rd > 0 && input[rd - 1] == '\n') {
                    input[rd - 1] = '\0'; // Remove newline if present
                }
                printWords(input, block_delimiters, num_blocks);
            }
        } else {
            // No input provided via arguments, file, or stdin
            fprintf(stderr, "No input provided via arguments, file, or stdin.\n");
            exit(EXIT_FAILURE);
        } 
    }

    if (input_file) {
        fclose(input_file);
    }

    free(input);

    return 0;
}

Up to 10 blocks can be defined.
Running:

Code: Select all

 echo "\"a b c\"1 *[\ ]*  ^[a-z0-9].* $'a b c' $'\n'" | ./bsplit -b '[' -e ']'  -b "'" -e "'"  -b '"' -e'"'
while :; do echo "\"a b c\"1 *[\ ]*  ^[a-z0-9].* $'a b c' $'\n'"; sleep 1; done | ./bsplit -b '[' -e ']'  -b "'" -e "'"  -b '"' -e'"'
./bsplit -b '[' -e ']'  -b "'" -e "'"  -b '"' -e'"' -f "file"
./bsplit -b '[' -e ']'  -b "'" -e "'"  -b '"' -e'"' -b '<'  -e '>' <<<"\"a b c\"1 *[\ ]*  ^[a-z0-9].* $'a b c' $'\n' < a >"

It uses the similar method I posted before: negative lookahead-like.It's slightly faster than shell.

Essentially bsplit replaces spaces with the newlines and has an option to define a range between the two characters ehere the spaces are ignored and therefore it will preserve blocks.
@MochiMoppel it's funny how a simple script can became a full application.

User avatar
MochiMoppel
Posts: 1290
Joined: Mon Jun 15, 2020 6:25 am
Location: Japan
Has thanked: 22 times
Been thanked: 476 times

Re: text splitting problem

Post by MochiMoppel »

superhik wrote: Fri Jul 05, 2024 1:08 am

There are too many variations that don't work with @HerrBert's approach like abc *[\ ]* ^[a-z0-9].* $'a b c' $'\n'.

Well, *this* one would work. I think you mean the example you use for your script which uses multiple spaces as word separators. That would indeed be problematic with @HerrBert's approach. For my use case it's irrelevant because my strings never have consecutive spaces. I always "sanitize" them with a string=$(echo $string) command, which removes any leading, trailing and consecutive whitespace.

It would be easy though to preserve consecutive whitespace within the [...] and '...' compounds by temporarily replacing double spaces with a letter unlikely to appear in the text. I normally use hex01:

Code: Select all

string="abc *[\ ]* ^[a-z	0-9].* $'a    b c' $'\n'"
TMP=$IFS
IFS=' '
for n in ${string//  /$'\x1'}; do
  n=${n//$'\x1'/  }
  [[ $n = *[[\'\]]*     ]] && ((i++))
  [[ $n = *[[\']*[]\']* ]] && i=0
  ((i%2)) && echo -n "$n " || echo "$n"
done
IFS=$TMP

@MochiMoppel it's funny how a simple script can became a full application.

Now that we have a solution all we need is a suitable problem. :lol:

superhik
Posts: 52
Joined: Mon Jun 19, 2023 7:56 pm
Has thanked: 6 times
Been thanked: 21 times

Re: text splitting problem

Post by superhik »

MochiMoppel wrote: Fri Jul 12, 2024 9:55 am

@MochiMoppel it's funny how a simple script can became a full application.

Now that we have a solution all we need is a suitable problem. :lol:

:lol: The Engineering of Consent, "use of an engineering approach - that is, action based only on thorough knowledge of the situation and on the application of scientific principles and tried practices to the task of getting people to support ideas and programs."

Post Reply

Return to “Programming”