Convert Files to Text
Posted: Tue Nov 10, 2020 6:54 am
I wrote this script becasue libreoffice doesn't have the option to output to stdout. I'm using text conversion as part of an indexer (see post).
Here is some simple code (first stab), which works:
Code: Select all
#!/bin/bash
if [ ! -z "`which soffice`" ]; then
mkdir -p ./tmp_s243a_convert_$$
soffice --headless --convert-to txt:Text --outdir ./tmp_$$ "$1" #2>/dev/null >/dev/null
ls -1 ./tmp_$$ | xargs -I % cat '%'
rm -rf ./tmp_s243a_convert_$$
fi
Some modifications might be to allow the filename to be provided by stdin and maybe an option to also output standard error output.
I would like to expand this to use other conversions utilities and also use "file" to do some mime type checking. Here is some draft code (not tested):
Code: Select all
#!/bin/bash
if [ ! -z "`which soffice`" ]; then
mkdir -p ./tmp_s243a_convert_$$
soffice --headless --convert-to txt:Text --outdir ./tmp_$$ "$1" #2>/dev/null >/dev/null
ls -1 ./tmp_$$ | xargs -I % cat '%'
rm -rf ./tmp_s243a_convert_$$
elif [ ! -z "`which unoconv`" ]; then
unoconv --stdout -f $1
elif [ ! "`file --mime-type '$1'`" = */rtf ]; then
echo "Not implemented yet"
#Possible utilities;
#TEXTUTIL/
#unrtf https://superuser.com/questions/243084/rtf-to-txt-on-unix
#wv
else
if [ ! z- "`which antiword`" ]; then #Doesn't work for rtf files which may have a .doc extension.
antiword $1
elif [ "`file --mime-type '$1'`" = */vnd.oasis.opendocument* ]; then . vnd.oasis.opendocument
if [ ! z- "`which odf2txt`" ]; then
odf2txt $1 #Not sure if this can handle all open document formats
fi
fi
fi
#--filter=application/msword:'unoconv --stdout -f text'
#http://hitekhedhelp.blogspot.com/2011/08/omega-overview.html