Page 1 of 1

Convert Files to Text

Posted: Tue Nov 10, 2020 6:54 am
by s243a

I wrote this script becasue libreoffice doesn't have the option to output to stdout. I'm using text conversion as part of an indexer (see post).

Here is some simple code (first stab), which works:

Code: Select all

#!/bin/bash
if [ ! -z "`which soffice`" ]; then
  mkdir -p ./tmp_s243a_convert_$$
  soffice --headless --convert-to txt:Text --outdir ./tmp_$$ "$1" #2>/dev/null >/dev/null
  ls -1 ./tmp_$$ | xargs -I % cat '%'
  rm -rf ./tmp_s243a_convert_$$
fi

Some modifications might be to allow the filename to be provided by stdin and maybe an option to also output standard error output.

I would like to expand this to use other conversions utilities and also use "file" to do some mime type checking. Here is some draft code (not tested):

Code: Select all

#!/bin/bash
if [ ! -z "`which soffice`" ]; then
  mkdir -p ./tmp_s243a_convert_$$
  soffice --headless --convert-to txt:Text --outdir ./tmp_$$ "$1" #2>/dev/null >/dev/null
  ls -1 ./tmp_$$ | xargs -I % cat '%'
  rm -rf ./tmp_s243a_convert_$$
elif [ ! -z "`which unoconv`" ]; then
  unoconv --stdout -f $1
elif [ ! "`file --mime-type '$1'`" = */rtf ]; then
  echo "Not implemented yet"
  #Possible utilities;
  #TEXTUTIL/
  #unrtf https://superuser.com/questions/243084/rtf-to-txt-on-unix
  #wv
else
  if [ ! z- "`which antiword`" ]; then #Doesn't work for rtf files which may have a .doc extension.
    antiword $1
  elif [ "`file --mime-type '$1'`" = */vnd.oasis.opendocument* ]; then . vnd.oasis.opendocument
    if [ ! z- "`which odf2txt`" ]; then 
      odf2txt $1 #Not sure if this can handle all open document formats
    fi
  fi
fi


#--filter=application/msword:'unoconv --stdout -f text'
#http://hitekhedhelp.blogspot.com/2011/08/omega-overview.html