Page 1 of 1
text splitting problem
Posted: Thu Nov 16, 2023 3:38 am
by MochiMoppel
I'm trying to split a string of space delimited words so that each word ends up on a separate line , but substrings enclosed in quotes or brackets should be treated as one word, even if they contain multiple words.
Example:
Lorem 'ipsum dolor sit' amet 'consecur' adipis [elit sed do] [eius] mod
should be converted into
Lorem
'ipsum dolor sit'
amet
'consecur'
adipis
[elit sed do]
[eius]
mod
Is there an elegant way to achieve this? So far I've come up with a solution that seems to work, but I find it a bit clumsy:
Code: Select all
string="Lorem 'ipsum dolor sit' amet 'consecur' adipis [elit sed do] [eius] mod"
for n in $string;do
case $n in
'['*']'|"'"*"'") # [eius] and 'consecur'
echo "$n"
;;
'['*|*']'|*"'"*)
if [[ -z $iscompound ]];then # [elit and 'ipsum
iscompound=1
sub+=$n' '
else # do] andr sit'
sub+=$n
echo "$sub"
[[ $n =~ ("'"|']') ]] && iscompound= sub=
fi
;;
*)
[[ $iscompound ]] && sub+=$n' ' || echo "$n" # dolor or Lorem
;;
esac
done
Is there a better way?
Re: text splitting problem
Posted: Thu Nov 16, 2023 8:33 am
by puppy_apprentice
Specifically for this example:
Code: Select all
x="Lorem 'ipsum dolor sit' amet 'consecur' adipis [elit sed do] [eius] mod";echo $x | tr "'" "\n" | tr "[" "\n" | tr "]" "\n"
But this is not exactly what you want
Icon has better functions for such things.
Re: text splitting problem
Posted: Thu Nov 16, 2023 10:35 am
by HerrBert
Also specifically for this example:
Code: Select all
string="Lorem 'ipsum dolor sit' amet 'consecur' adipis [elit sed do] [eius] mod"
i=0
for n in $string; do
[ "${n::1}" = "'" -o "${n::1}" = "[" ] && i=1
[ "${n: -1:1}" = "'" -o "${n: -1:1}" = "]" ] && i=0
[ $i -eq 0 ] && echo $n || echo -n "$n "
done
[edit] shorter:
Code: Select all
string="Lorem 'ipsum dolor sit' amet 'consecur' adipis [elit sed do] [eius] mod"
i=0
for n in $string; do
[[ ${n::1} = [\[\'] ]] && i=1
[[ ${n: -1:1} = [\]\'] ]] && i=0
[ $i -eq 0 ] && echo $n || echo -n "$n "
done
(still learning this crazy sh*t )
Re: text splitting problem
Posted: Thu Nov 16, 2023 11:32 am
by MochiMoppel
Very clever
Even shorter and all lines same lenght
Code: Select all
string="Lorem 'ipsum dolor sit' amet 'consecur' adipis [elit sed do] [eius] mod"
i=0
for n in $string; do
[[ ${n::1} = [[\'] ]] && ((i++))
[[ ${n: -1:1} = []\'] ]] && ((i++))
((i%2)) && echo -n "$n " || echo $n
done
Re: text splitting problem
Posted: Thu Nov 16, 2023 12:07 pm
by HerrBert
TBH, ((i++)) was my first attempt, but at least i thought there is no math needed
Very nice exercise
Re: text splitting problem
Posted: Thu Nov 16, 2023 12:21 pm
by MochiMoppel
HerrBert wrote: Thu Nov 16, 2023 12:07 pmTBH, ((i++)) was my first attempt, but at least i thought there is no math needed
Agreed! Down with Math !
Code: Select all
string="Lorem 'ipsum dolor sit' 'consecur' adipis [elit sed do] [eius] mod"
for n in $string; do
[[ $n = [[\']* ]] && i=1
[[ $n = *[]\'] ]] && i=0
((i)) && echo -n "$n " || echo $n
done
Re: text splitting problem
Posted: Thu Nov 16, 2023 3:20 pm
by puppy_apprentice
Code: Select all
string="Lorem 'ipsum dolor sit' 'consecur' adipis [elit sed do] [eius] mod *hello world*"
for n in $string
do
[[ $n = [$1]* ]] && i=1
[[ $n = *[$2] ]] && i=0
((i)) && echo -n "$n " || echo $n
done
Re: text splitting problem
Posted: Fri Nov 17, 2023 8:36 am
by MochiMoppel
Math is back
@HerrBert Still based on your concept I made the snippet more versatile. The [...] and '...' compounds may now have preceding and/or trailing strings. This makes it possible to filter regex patterns (which after all is the purpose of my exercise).
Code: Select all
string="abc *[\ ]* ^[0-9].* $'a b c' $'\n'"
for n in $string; do
[[ $n = *[[\'\]]* ]] && ((i++)) # matches ^[0-9].* or $'a or ]*
[[ $n = *[[\']*[]\']* ]] && i=0 # matches ^[0-9].* but not $'a or ]*
((i%2)) && echo -n "$n " || echo $n
done
Output:
Code: Select all
abc
*[\ ]*
^[0-9].*
$'a b c'
$'\n'
Re: text splitting problem
Posted: Sat Dec 02, 2023 8:48 pm
by puppy_apprentice
Code: Select all
arr_of_strings=(Lorem 'ipsum dolor sit' amet 'consecur' adipis [elit sed do] [eius] mod)
echo ${arr_of_strings[@]}
for (( i=0; i<=${#arr_of_strings[@]};i++ ))
do
echo ${arr_of_strings[i]}
done
Code: Select all
Lorem ipsum dolor sit amet consecur adipis [elit sed do] [eius] mod
Lorem
ipsum dolor sit
amet
consecur
adipis
[elit sed do]
[eius]
mod
Code: Select all
arr_of_strings=(Lorem \'ipsum dolor sit\' amet \'consecur\' adipis [elit sed do] [eius] mod *asterisk check*)
echo ${arr_of_strings[@]}
for (( i=0; i<=${#arr_of_strings[@]};i++ ))
do
echo ${arr_of_strings[i]}
done
Code: Select all
Lorem 'ipsum dolor sit' amet 'consecur' adipis [elit sed do] [eius] mod *asterisk check*
Lorem
'ipsum
dolor
sit'
amet
'consecur'
adipis
[elit sed do]
[eius]
mod
*aterisk
check*
Arrays can be used only in specific situations.
Re: text splitting problem
Posted: Tue Dec 05, 2023 5:57 am
by MochiMoppel
puppy_apprentice wrote: Sat Dec 02, 2023 8:48 pmArrays can be used only in specific situations.
at least not here
Re: text splitting problem
Posted: Tue Dec 05, 2023 6:39 am
by pp4mnklinux
Only a suggestion using the 'awk' command:
Code: Select all
string="Lorem 'ipsum dolor sit' amet 'consecur' adipis [elit sed do] [eius] mod"
echo "$string" | awk -v RS="[ \t\n]+" '{gsub(/^'"'"'(.*)'"'"'$/, "\1", $1); print $1}'
As I said...only a suggestion, but if I understood you correctly, it is possible this help you as general solution:
Code: Select all
#!/bin/bash
split_string() {
local input_string="$1"
local result=""
local current_word=""
while IFS= read -rn1 char; do
case "$char" in
' '|$'\n'|$'\t') # Space, newline, or tab
if [[ -n $current_word ]]; then
result+="$current_word"$'\n'
current_word=""
fi
;;
"'")
in_quote=true
current_word+="$char"
;;
"[" | "]")
in_bracket=true
current_word+="$char"
;;
*)
current_word+="$char"
;;
esac
done <<< "$input_string"
if [[ -n $current_word ]]; then
result+="$current_word"$'\n'
fi
echo "$result"
}
Re: text splitting problem
Posted: Thu Dec 07, 2023 7:39 am
by MochiMoppel
pp4mnklinux wrote: Tue Dec 05, 2023 6:39 am if I understood you correctly, it is possible this help you as general solution
Thanks for trying to help. It obviously fails to meet the requirements but it adds to the list of approaches that do not work. Learning from mistakes is almost as much fun as finding a working solution - at least that's what I try to believe when I again end up in a dead-end street of the coding maze
Re: text splitting problem
Posted: Fri Dec 08, 2023 8:03 am
by greengeek
Here is what Google Bard suggested:
Code: Select all
#!/bin/bash
# Check if input file is provided
if [ -z "$1" ]; then
echo "Please provide the input file path as an argument."
exit 1
fi
# Define the text file path
input_file="$1"
# Define output filename with ".processed" extension
output_file="${input_file}.processed"
# Initialize variables
current_buffer=""
within_brackets=false
within_quotes=false
# Process each character in the input file
while IFS= read -r -n1 char; do
# Check if within brackets or quotes
if [[ "$char" == "[" ]]; then
within_brackets=true
elif [[ "$char" == "]" ]]; then
within_brackets=false
elif [[ "$char" == "\"" ]]; then
within_quotes=not $within_quotes
fi
# Check if character is space and not within brackets or quotes
if [[ "$char" == " " && ! "$within_brackets" && ! "$within_quotes" ]]; then
# Add current buffer to output file with newline
echo "$current_buffer" >> "$output_file"
# Reset current buffer
current_buffer=""
else
# Append character to current buffer
current_buffer="$current_buffer$char"
fi
done < "$input_file"
# Ensure last buffer is written to output file
if [[ ! -z "$current_buffer" ]]; then
echo "$current_buffer" >> "$output_file"
fi
echo "Processed file saved to: $output_file"
Doesn't work.
No idea how Google got so wealthy...
(Maybe i am as successful telling Bard what to do as i am telling my wife what to do... Skillset lacking..)
Here is what I asked Bard (maybe I misunderstood the question)
Please write a bash script to process a text file and convert whitespace to linefeed except within brackets or single quotes or double quotes
Re: text splitting problem
Posted: Mon Dec 11, 2023 7:11 am
by MochiMoppel
@greengeek
I asked Bard exactly the same question. For me he came up with a different suggestion:
Code: Select all
#!/bin/bash
# Check if a character is inside a quote
is_inside_quote () {
local char="$1"
local in_single_quote="false"
local in_double_quote="false"
# Check for single quote
if [[ "$char" == "'" ]]; then
in_single_quote="$in_single_quote"
in_double_quote="false"
elif [[ "$in_single_quote" == "false" && "$char" == '"' ]]; then
in_double_quote="$in_double_quote"
in_single_quote="false"
fi
# Check for bracket
if [[ "$char" == '[' ]]; then
in_bracket="true"
elif [[ "$in_bracket" == "true" && "$char" == ']' ]]; then
in_bracket="false"
fi
# Return true if inside any quote or bracket
if [[ "$in_single_quote" == "true" || "$in_double_quote" == "true" || "$in_bracket" == "true" ]]; then
return 0
else
return 1
fi
}
# Input file path
input_file="$1"
# Output file path (optional)
output_file="${2:-$input_file.out}"
# Check if input file exists
if [[ ! -f "$input_file" ]]; then
echo "Error: Input file '$input_file' does not exist."
exit 1
fi
# Open output file for writing
exec > "$output_file"
# Process each character in the input file
while read -r -n 1 char; do
if [[ "$char" == " " || "$char" == "\t" ]]; then
if ! is_inside_quote "$char"; then
echo
fi
else
echo -n "$char"
fi
done < "$input_file"
echo
exit 0
Needless to say that it doesn't work either. Even worse than your Bard because he fails to insert newline characters.
But the comments are nice.
Interestingly no syntax errors in both attempts and in your case funny mistakes. If within_quotes=false then he might expect that within_quotes=not $within_quotes
(syntactically correct!) results in within_quotes=true. That's not how Bash works but it's how humans think. And this means that he can have hardly copied this stuff from a serious code source.
I also checked his "Draft 2" and "Draft 3" alternatives. Wouldn't even start because of multiple syntax errors
Re: text splitting problem
Posted: Wed Jul 03, 2024 12:20 pm
by superhik
@MochiMoppel
Grep:
Code: Select all
grep -oP "(?:[^'\s\[]+|'[^']+'|\[[^\]]+\])" <<< "Lorem 'ipsum dolor sit' amet 'consecur' adipis [elit sed do] [eius] mod"
The regular expression used is:
Here's a breakdown of how it works:
(?: starts a non-capturing group
[^'\s\[]+ matches one or more characters that are not quotes, spaces, or brackets
| or...
'[^']+' matches a quoted substring: a quote, followed by one or more characters that are not quotes, followed by a quote
| or...
\[[^\]]+\] matches a bracketed substring: a left bracket, followed by one or more characters that are not right brackets, followed by a right bracket
) closes the non-capturing group
The -o option tells grep to print only the matched text, and the -P option enables PCRE (Perl-compatible regular expressions) syntax.
Note that this assumes that the input string is well-formed, with balanced quotes and brackets. If the input string can contain unbalanced or malformed quoted or bracketed substrings, you may need to add additional error handling or preprocessing.
Sed:
Code: Select all
sed -E "s/ ?('[^']+'|\[[^]]+\]|[^'\[[:space:]]+)/\1\n/g" <<< "Lorem 'ipsum dolor sit' amet 'consecur' adipis [elit sed do] [eius] mod"
Matching Patterns:
'[^']+' Matches substrings enclosed in single quotes.
\[[^]]+\] Matches substrings enclosed in brackets.
[^'\[[:space:]]+ Matches any sequence of characters that are not single quotes, brackets, or spaces.
Substitution:
?('\[^']+'|\[[^]]+\]|[^'\[[:space:]]+) Matches any of the patterns above and an optional leading space.
\1\n Replaces each match with itself followed by a newline.
PS. if you need to remove the last trailing newline add s/\n$//
Code: Select all
sed -E "s/ ?('[^']+'|\[[^]]+\]|[^'\[[:space:]]+)/\1\n/g;s/\n$//" <<< "Lorem 'ipsum dolor sit' amet 'consecur' adipis [elit sed do] [eius] mod"
Re: text splitting problem
Posted: Thu Jul 04, 2024 6:57 am
by retiredt00
This is a very informative thread and very close to what I would like to do, which is in a file that has lines with letters symbols and lumbers and lines that have only numbers or only letters or only symbols, eliminate lines that have only numbers or only symbols.
The problem I have in is that all the lines in the files have spaces in the front that I do not want to eliminate as the spacing if important for the layout
Code: Select all
bbbb
aaaaa
bbaaabbbbb > 123
* | *
123
456
ccc
abcdefghi || 789
45
to become this
Code: Select all
bbbb
aaaaa
bbaaabbbbb > 123
ccc
abcdefghi || 789
or this
Code: Select all
bbbb
aaaaa
bbaaabbbbb > 123
123
456
ccc
abcdefghi || 789
45
Re: text splitting problem
Posted: Thu Jul 04, 2024 9:02 am
by MochiMoppel
@superhik
Thanks for the new variants. Well, they work for Lorem but not for my "real life" string abc *[\ ]* ^[0-9].* $'a b c' $'\n'
(see my post of 2023-11-17)
I usually prefer sed or grep but here they are not only slower than @HerrBert's approach but probably not suitable at all.
@retiredt00
Your problem may be not that "all the lines in the files have spaces in the front" because some don't. Your line 456 starts with a tab, not a space.
To get the desired results you posted:
Code: Select all
sed '/^[[:blank:]]*[^a-z]*$/d' /path/to/file
or
Code: Select all
sed '/^[[:blank:]]*[^0-9a-z]*$/d' /path/to/file
Re: text splitting problem
Posted: Thu Jul 04, 2024 2:06 pm
by retiredt00
Thank you MochiMoppel but this still does not work
It works in the example I provided but not in the actual file, a part of which I attach.
This is part of an old dos file that dos2unix was transforming into 1 line, so I used sed -e "s/\r/\n/g" filename
to make it unix like and now the number lines are not removed by the sed '/^[[:blank:]]*[^0-9]*$/d' /path/to/file
command
Re: text splitting problem
Posted: Thu Jul 04, 2024 3:24 pm
by MochiMoppel
retiredt00 wrote: Thu Jul 04, 2024 2:06 pm
Thank you MochiMoppel but this still does not work
It works in the example I provided but not in the actual file
Then you can hardly call it an example
now the number lines are not removed by the sed '/^[[:blank:]]*[^0-9]*$/d' /path/to/file
command
Of course not. That's not the code I posted.
I'm not sure anymore what you want. To remove lines containing only numbers and only non alphanumerics you could try
sed '/^[0-9 ]*$/d ; /^[^0-9a-zA-Z]*$/d' test.txt
Above sed command consists of 2 patterns:
/^[0-9 ]*$/d
deletes lines containing only numbers and spaces (mind the space after 0-9 !)
/^[^0-9a-zA-Z]*$/d
deletes lines containing no alphanumerics ; e.g. | |* || | *|
This example
Code: Select all
>HI <RI P0I <qII stMI <III >ciI
>HI <AII Ns fcI T <AI >Ea BJI HpII
| | || | | | | | |||||||| ||| |||
> 250
* 1>
<
| |* || | *| |*| | |||||||| ||| |||
126 134 149 156 162 169
126 134 149 160 169
will be converted to
Code: Select all
>HI <RI P0I <qII stMI <III >ciI
>HI <AII Ns fcI T <AI >Ea BJI HpII
> 250
* 1>
Re: text splitting problem
Posted: Fri Jul 05, 2024 1:08 am
by superhik
MochiMoppel wrote: Thu Jul 04, 2024 9:02 am
@superhik
Thanks for the new variants. Well, they work for Lorem but not for my "real life" string abc *[\ ]* ^[0-9].* $'a b c' $'\n'
(see my post of 2023-11-17)
I usually prefer sed or grep but here they are not only slower than @HerrBert's approach but probably not suitable at all.
There are too many variations that don't work with @HerrBert's approach like abc *[\ ]* ^[a-z0-9].* $'a b c' $'\n'
.
Next will loop through each character. If outside of the bracketed and quoted strings, it will replace the spaces with the temporary replacement strings, which are unlikely to occur _@@@_.. The field separator is set to the replacement string and another loop prints every element while skipping the empty ones.which are between spaces outside of the bracketed and quoted strings.
Code: Select all
#!/bin/bash
#input="abc *[\ ]* ^[0-9].* $'a b c' $'\n'"
input="abc *[\ ]* ^[a-z0-9].* $'a b c' $'\n'"
output=""
in_quotes=false
in_brackets=0
for (( i=0; i<${#input}; i++ )); do
char="${input:$i:1}"
if [[ $char == "'" ]]; then
if $in_quotes; then
in_quotes=false
else
in_quotes=true
fi
output+="$char"
elif [[ $char == "[" ]]; then
((in_brackets++))
output+="$char"
elif [[ $char == "]" && $in_brackets -gt 0 ]]; then
((in_brackets--))
output+="$char"
elif [[ $char == " " ]] && [[ $in_quotes == false && $in_brackets -eq 0 ]]; then
output+="_@@@_"
else
output+="$char"
fi
done
OLDIFS=$IFS
IFS="_@@@_"
for w in $output; do
[[ -z "$w" ]] && continue
echo "$w"
done
IFS=$OLDIFS
Re: text splitting problem
Posted: Fri Jul 05, 2024 8:18 am
by retiredt00
MochiMoppel wrote: Thu Jul 04, 2024 3:24 pm
To remove lines containing only numbers and only non alphanumerics you could try
sed '/^[0-9 ]*$/d ; /^[^0-9a-zA-Z]*$/d' test.txt
Above sed command consists of 2 patterns:
/^[0-9 ]*$/d
deletes lines containing only numbers and spaces (mind the space after 0-9 !)
/^[^0-9a-zA-Z]*$/d
deletes lines containing no alphanumerics ; e.g. | |* || | *|
It worked!
Thanks
Re: text splitting problem
Posted: Fri Jul 05, 2024 2:54 pm
by superhik
Made a small app for this in C.
Let's name it Block Splitter or bsplit.
Save it as bsplit.c
and compile with gcc bsplit.c -o bsplit -std=c11 -Wall
Code: Select all
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include <string.h>
#include <unistd.h>
#include <getopt.h>
#include <sys/select.h>
#include <sys/stat.h>
// Define ssize_t if it's not available
#ifndef HAVE_SSIZE_T
typedef long ssize_t;
#endif
void printHelp(const char *program_name) {
printf("Usage: %s [-b block_start] [-e block_end] [-f file]\n", program_name);
printf("Options:\n");
printf(" -b <block_start> Specify the start character of a block (can be repeated)\n");
printf(" -e <block_end> Specify the end character of a block corresponding to the last specified start character\n");
printf(" -f <file> read input from a file instead of stdin\n");
printf(" -h print out help\n");
}
// Implement a simple getline function if getline is not available
ssize_t custom_getline(char **lineptr, size_t *n, FILE *stream) {
ssize_t rd = 0;
char *buf = NULL;
size_t bufsize = 0;
int c;
if (lineptr == NULL || n == NULL || stream == NULL) {
return -1;
}
buf = *lineptr;
bufsize = *n;
*lineptr = NULL;
*n = 0;
while ((c = fgetc(stream)) != EOF) {
if (rd >= bufsize) {
bufsize += 128;
buf = realloc(buf, bufsize);
if (buf == NULL) {
return -1;
}
}
buf[rd++] = c;
if (c == '\n') {
break;
}
}
if (rd == 0 || ferror(stream)) {
free(buf);
return -1;
}
// Trim newline character if present
if (buf[rd - 1] == '\n') {
buf[rd - 1] = '\0';
rd--; // Adjust read length
}
buf[rd] = '\0';
*lineptr = buf;
*n = bufsize;
return rd;
}
typedef struct {
char start;
char end;
} BlockDelimiter;
void printWords(const char *input, BlockDelimiter *block_delimiters, int num_blocks) {
bool in_block = false;
char current_word[100]; // Assuming maximum word length
int current_index = 0;
size_t input_len = strlen(input);
for (size_t i = 0; i < input_len; ++i) {
char ch = input[i];
for (int j = 0; j < num_blocks; j++) {
if (ch == block_delimiters[j].start && !in_block) {
in_block = true;
break;
} else if (ch == block_delimiters[j].end && in_block) {
in_block = false;
break;
}
}
if (ch == ' ' && !in_block) {
if (current_index > 0) {
current_word[current_index] = '\0';
printf("%s\n", current_word);
fflush(stdout);
current_index = 0;
}
} else {
current_word[current_index++] = ch;
}
}
// Print the last word if any
if (current_index > 0) {
current_word[current_index] = '\0';
printf("%s\n", current_word);
fflush(stdout);
}
}
int main(int argc, char *argv[]) {
FILE *input_file = NULL;
BlockDelimiter block_delimiters[10]; // Support up to 10 different block delimiters
int num_blocks = 0;
int opt;
while ((opt = getopt(argc, argv, "b:e:f:h:")) != -1) {
switch (opt) {
case 'b':
if (num_blocks < 10) {
block_delimiters[num_blocks].start = optarg[0];
}
break;
case 'e':
if (num_blocks < 10) {
block_delimiters[num_blocks].end = optarg[0];
num_blocks++;
}
break;
case 'f':
input_file = fopen(optarg, "r");
if (input_file == NULL) {
perror("Error opening file");
exit(EXIT_FAILURE);
}
break;
case 'h':
printHelp(argv[0]);
exit(0);
default:
printHelp(argv[0]);
exit(EXIT_FAILURE);
}
}
char *input = NULL;
size_t len = 0;
ssize_t rd;
if (input_file) {
while ((rd = custom_getline(&input, &len, input_file)) != -1) {
// Process each line here
if (rd > 0 && input[rd - 1] == '\n') {
input[rd - 1] = '\0'; // Remove newline if present
}
printWords(input, block_delimiters, num_blocks);
}
} else if (argc > optind) {
// Process remaining arguments as input
size_t input_len = 0;
for (int i = optind; i < argc; i++) {
input_len += strlen(argv[i]) + 1; // +1 for space
}
input = malloc(input_len + 1); // +1 for null terminator
if (input == NULL) {
perror("malloc");
exit(EXIT_FAILURE);
}
input[0] = '\0';
for (int i = optind; i < argc; i++) {
strcat(input, argv[i]);
if (i < argc - 1) {
strcat(input, " ");
}
}
rd = strlen(input);
} else {
// Check if stdin is connected to a terminal
int fd = fileno(stdin); // Get the file descriptor for stdin
// Use fstat to check if input is available on stdin
struct stat statbuf;
if (fstat(fd, &statbuf) == 0 && S_ISFIFO(statbuf.st_mode)) {
// Input is available from a pipe or redirection
// stdin is a regular file, proceed with reading from
while ((rd = custom_getline(&input, &len, stdin)) != -1) {
// Process each line here
if (rd > 0 && input[rd - 1] == '\n') {
input[rd - 1] = '\0'; // Remove newline if present
}
printWords(input, block_delimiters, num_blocks);
}
} else {
// No input provided via arguments, file, or stdin
fprintf(stderr, "No input provided via arguments, file, or stdin.\n");
exit(EXIT_FAILURE);
}
}
if (input_file) {
fclose(input_file);
}
free(input);
return 0;
}
Up to 10 blocks can be defined.
Running:
Code: Select all
echo "\"a b c\"1 *[\ ]* ^[a-z0-9].* $'a b c' $'\n'" | ./bsplit -b '[' -e ']' -b "'" -e "'" -b '"' -e'"'
while :; do echo "\"a b c\"1 *[\ ]* ^[a-z0-9].* $'a b c' $'\n'"; sleep 1; done | ./bsplit -b '[' -e ']' -b "'" -e "'" -b '"' -e'"'
./bsplit -b '[' -e ']' -b "'" -e "'" -b '"' -e'"' -f "file"
./bsplit -b '[' -e ']' -b "'" -e "'" -b '"' -e'"' -b '<' -e '>' <<<"\"a b c\"1 *[\ ]* ^[a-z0-9].* $'a b c' $'\n' < a >"
It uses the similar method I posted before: negative lookahead-like.It's slightly faster than shell.
Essentially bsplit replaces spaces with the newlines and has an option to define a range between the two characters ehere the spaces are ignored and therefore it will preserve blocks.
@MochiMoppel it's funny how a simple script can became a full application.
Re: text splitting problem
Posted: Fri Jul 12, 2024 9:55 am
by MochiMoppel
superhik wrote: Fri Jul 05, 2024 1:08 amThere are too many variations that don't work with @HerrBert's approach like abc *[\ ]* ^[a-z0-9].* $'a b c' $'\n'
.
Well, *this* one would work. I think you mean the example you use for your script which uses multiple spaces as word separators. That would indeed be problematic with @HerrBert's approach. For my use case it's irrelevant because my strings never have consecutive spaces. I always "sanitize" them with a string=$(echo $string)
command, which removes any leading, trailing and consecutive whitespace.
It would be easy though to preserve consecutive whitespace within the [...] and '...' compounds by temporarily replacing double spaces with a letter unlikely to appear in the text. I normally use hex01:
Code: Select all
string="abc *[\ ]* ^[a-z 0-9].* $'a b c' $'\n'"
TMP=$IFS
IFS=' '
for n in ${string// /$'\x1'}; do
n=${n//$'\x1'/ }
[[ $n = *[[\'\]]* ]] && ((i++))
[[ $n = *[[\']*[]\']* ]] && i=0
((i%2)) && echo -n "$n " || echo "$n"
done
IFS=$TMP
@MochiMoppel it's funny how a simple script can became a full application.
Now that we have a solution all we need is a suitable problem.
Re: text splitting problem
Posted: Sat Jul 13, 2024 9:06 am
by superhik
MochiMoppel wrote: Fri Jul 12, 2024 9:55 am
@MochiMoppel it's funny how a simple script can became a full application.
Now that we have a solution all we need is a suitable problem.
The Engineering of Consent, "use of an engineering approach - that is, action based only on thorough knowledge of the situation and on the application of scientific principles and tried practices to the task of getting people to support ideas and programs."