hadoop w/ xipian-delve and awk (I'm brainstormming my ideas here)

interpretive language scripts


Moderator: Forum moderators

Post Reply
s243a
Posts: 501
Joined: Mon Dec 09, 2019 7:29 pm
Has thanked: 90 times
Been thanked: 37 times

hadoop w/ xipian-delve and awk (I'm brainstormming my ideas here)

Post by s243a »

I've wanted for quite a while to learn how to use hadoop to process data (e.g. hadoop streaming) and I've also been looking and indexing related tools specifically related to xipian. For example:
xipian-omega inclues: quest omindex
xipian-tools: includes xipian-delve
recoll

Recoll seems to provide better indexing than omindex and xipian delve is nice because it can work with the indexes both generated by omindex and also by recoll. Besides that the indexes generated by omindex and recoll aren't compatible with each other.

Here is an example use of xipian-delve applied to an index generated by recoll:

Code: Select all

xapian-delve -r 52 -d ./info_somebody_personal
Data for record #52:
url=file:///Remote/jobs/Companies/ABB/19-02-22 - Elec Eng/ABB - Cover Letter.doc
mtype=application/msword
fmtime=01578173440
origcharset=CP1252
fbytes=30720
pcbytes=30720
dbytes=1058
sig=307201578173440
abstract=?!#@ ABB. sombody  somewhere                                           February 25, 2019  Dear Hiring Manager  I am applying
filename=ABB - Cover Letter.doc

Term List for record #52: 10 110 2019 25 310 403 4411 4b4 4e8 6 7901 827 D20200104 M202001 Q/Remote/jobs/Companies/ABB/19-02-22 - Elec Eng/ABB - Cover Letter.doc| Tapplication/msword XCFNXXND XCFNXXST XCFNabb XCFNcover XCFNdoc XCFNletter XCFNletter.doc XEdoc XM77e988f31a04d58260c2f2e0974f2a6d XP XP19-02-22 - Elec Eng XPABB XPCompanies XPRemote XPjobs XSFNXXND XSFNXXST XSFNabb XSFNcover XSFNdoc XSFNletter XSFNletter.doc XSFSabb - cover letter.doc XXND XXST Y2020 ab abb add am and applying as c cable calgary cell com company components contact cover dear delivering department doc drawing drawings electrical enclosed eng engineer engineering equipment estimates experience experienced five for have hiring   how hvac i in included incorporated information interconnections into  knowledge last letter letter.doc lighting line manager material mccs me more my ne of on or p p.eng packages panel please position previous project projects requisitions resume review schedule se sincerely single skills st such swgr systems t2e t2g the these to transformers trays types ups value various were which within would years your
[root@Dpupbuster /mnt/sdd4] $ 

My idea is to use AWK (and xipian-delve) in order to pass this information into a form that will be suitable for hadoop. I like AWK because it is very fast at processing text files. One could parallelize this by splitting the database up into separate pieces (perhaps via some hash). Some links that might possibly helpful to my idea:

https://hadoop.apache.org/docs/current/ ... aming.html
https://dzone.com/articles/using-awk-and-friends-hadoop
https://stackoverflow.com/questions/166 ... ops-mapper

Post Reply

Return to “Scripts”