Page 1 of 1

xapian indexing software for searching

Posted: Sun Nov 08, 2020 11:14 pm
by s243a

I created slackbuilds for the following programs (Compiled in Fatdog810b):

xapian-core-1.4.17-x86_64-1_SBo.tgz
https://slackbuilds.org/repository/14.1 ... pian-core/
*slackbuild modified to use newer source

xapian-omega-1.4.17-x86_64-1_SBo.tgz
https://slackbuilds.org/repository/14.2 ... ian-omega/
*slackbuild modified to use newer source

xapian-bindings-1.4.17-x86_64-2_SBo.tgz
https://slackbuilds.org/repository/14.2 ... -bindings/
*slackbuild modified to use newer source

These programs can be used for things like indexing directories and web pages. Supports many document types such as htm, pdf and office documents. I tested this and it works
Create an index with the following command

Code: Select all

omindex -p --db info --url documents /mnt/data0/Documents.

https://www.ibm.com/developerworks/libr ... index.html
https://manpages.ubuntu.com/manpages/ar ... dex.1.html

Query the database as follows:

Code: Select all

quest --db=info redbook

https://www.ibm.com/developerworks/libr ... index.html

Some related links:

https://wiki.python.org/moin/HelpOnXapi ... g_an_index
https://xapian.org/download
https://github.com/xapian/xapian-docspr ... de/python3
https://getting-started-with-xapian.rea ... ample-code
https://xapian.org/docs/

alternativesto xapian:
https://unix.stackexchange.com/question ... t-indexing
https://www.tecmint.com/count-word-occu ... text-file/
http://swishplusplus.sourceforge.net/
https://web.archive.org/web/20061223111 ... ish-e.org/
https://en.wikipedia.org/wiki/SWISH-E
https://metacpan.org/pod/SWISH
https://www.linuxjournal.com/article/6652


Re: xapian indexing software for searching

Posted: Sun Nov 08, 2020 11:35 pm
by s243a

So here's a use example. On my dropbox, I have a number of folders for various job applications which contain resumes and job adds. I might want to search for a term in on of the documents to use as an example cover letter "CAD"

First navigate to the folder, and then create an index for it as follows:

Code: Select all

omindex -p --db ~/info_s243a_personal --url dropbox .

dropbox is the first part of the URL after the domain (this is just an abstraction, I should have used a longer path than just "dropbox" because the folder I indexed is nested more deeply into my dropbox than this. "." is the directory that I'm indexing. I first navigated to this folder before running the command. This keeps me from having to type out the whole path.

Now I can search for the term as follows:

Code: Select all

quest --db=info_s243a_personal CAD

*quest is installed as part of xapian-core
** info_s243a_personal is a directory (i.e. the database) in "~" (i.e. my root user home directory)

I'm using maestral to access dropbox. Roughly it can be installed as follows:

Code: Select all

python3 -m pip install --upgrade pip
python3 -m pip install --upgrade maestral
python3 -m pip install --upgrade maestral[gui] 

I requires devX to be loaded and may also need some other dependencies. For instance, see:
http://www.murga-linux.com/puppy/viewto ... 26#1046607

The search with xapian was very fast. It was much faster than even using the find command to just search file names, yet It was searching file contents.


Re: xapian indexing software for searching

Posted: Mon Nov 09, 2020 12:24 am
by s243a

So in my above example, I successfully index ".pdf" and ".html" type documents but not ".doc" type documents. I think to do so I need to specify a filter for the omindex command. For example:

Code: Select all

--filter=application/msword:'abiword --to=txt --to-name=fd://1'

https://xapian.org/docs/omega/overview.html

Some alternative commands:

Code: Select all

soffice --headless --convert-to txt:Text YOUR-DOCUMENT-HERE.DOC

https://ask.libreoffice.org/en/question ... ext-files/
*libreoffice can't be running

Code: Select all

odt2txt YOUR-DOCUMENT-HERE.DOC

Edit: I think the full indexing command should look like this:

Code: Select all

 omindex --filter=application/msword:'soffice --headless --convert-to txt:Text fd://1' -p --db ~/info_john_personal --url dropbox .

will post whether or not this works in the next post


Re: xapian indexing software for searching

Posted: Wed Nov 11, 2020 7:55 am
by s243a

The quoted files in the original post will be replaced by the following:

xapian-core-1.4.17-x86_64-1_SBo.tgz
xapian-omega-1.4.17-x86_64-1_SBo.tgz
xapian-bindings-1.4.17-x86_64-2_SBo.tgz

I decided to try a newer version because in the older command the --filter option wasn't using my conversion script to convert rtf files. Instead it was using antiword. I don't have this issue with the newer version.

Here is the command I was testing with:

Code: Select all

omindex -i -v --filter='text/rtf:s243a-convert' -p --db ~/info_john_personal_test --url dropbox . --overwrite

Where s243a-convert is defined at:

In theory, I should also be able to use wildcards in the mime time and filter by other things such as extension and encoding. I haven't tested this yet though with the new package.


Re: xapian indexing software for searching

Posted: Fri Apr 09, 2021 2:28 am
by s243a
s243a wrote: Wed Nov 11, 2020 7:55 am

The quoted files in the original post will be replaced by the following:

xapian-core-1.4.17-x86_64-1_SBo.tgz
xapian-omega-1.4.17-x86_64-1_SBo.tgz
xapian-bindings-1.4.17-x86_64-2_SBo.tgz

The following dependency is needed:
chmlib-0.40-x86_64-1_SBo.tgz (download)