Erik Frey's project looks interesting due both to the networking techniques used and likely efficiency advantages. However, many applications don't need that degree of parallel computing. Even a single processor core can do a lot of computation these days and the sorting algorithm has nice scaling qualities from a computation efficiency standpoint.
I think that one can slow there bash script down a lot if they do a lot if they have to invoke a lot of sub-processes for small tasks. I also think that there is big efficiency gains if a typical file size that a given process processors is on the order of the block size of the file system.
The above is a bit speculative by me but I notice that the md5sum is very fast when applied to a whole directory of small files. For example one might type either "md5sum $(ls -1 *.files)" or "md5sum $(ls -1f)" or "md5sum $(ls -1 *.list; ls -1 *.md5sums)" as I do in some draft code.
When working with the amount of metadata used in a typical package manager the sort and join commands can be very fast without needing parallel processing. I've been playing around a bit with this more batch style processing for package manager syncing and didn't need a bashreduce type algorithm for computational speed.
However, a bashreduce-like algorithm could be useful for combining partial datasets to produce a complete dataset. This is similar to a join except without complete segregation into seperate tables without null entries. Here is my basic idea:
1. Say we have a bunch of lines with a common key, then for each line with a common key we do the following:
Code: Select all
split(line, fields , OFS)
if ( length(pkg) == 0 ){
pkg=fields[1]
}
if ( length(arch) == 0 ){
arch=fields[2]
sub(/$[:]/, "",arch)
}
if ( length(ver) == 0 ){
ver=fields[3]
}
if ( length(pkgfile) == 0 ){
pkgfile=fields[4]
}
if ( length(dir_name) == 0 ){
dir_name=fields[5]
}
if ( length(filelist) == 0 ){
filelist=fields[6]
}
if ( length(md5sum) == 0 ){
md5sum=fields[7]
}
Then once we looped through all the data with a common key, we print the fields:
Code: Select all
print pkg, arch, ver, pkgfile, dir_name, filelist, md5sum
More complete code can be found on pastebin. This technique I think might fall under a generalized version of mapreduce. Or at least it is more general then, the project I'm following which is by sorhus:
https://github.com/sorhus/bash-reduce
http://sorhus.github.io/2016/03/27/bash-reduce-part-1/
Sorhus bashredue implementation is easier to understand than "Erik Frey" version but likely less efficient then Erik's (this is just speculation). However, as noted above Sorhus, version is more than sufficient for the current application that I'm looking at.
Anyway, a big thing to understand about how Sorhu's implementation works, is to understand how data with a common key is grouped together. This is done by a script called shuffle.awk, which unfortunately has a misleading name.
Usually mapreduce type algorithms are used for numeric computations such as word count. For instance each word might be mapped to (word, 1) where word is the key and one is the word count of a single instance of the key. To get the total count all words that are the same (i.e. have the same key), you some up all the separate counts to produce the total count for that word. The shuffle algorithm compresses the representation by having a single line for a given key and then writing each value for each word instance as separate fields.
However, in a non-standard map-reduce-like implementation, we might consider having multiple fields for each key and having each field represent a different type of information. In this case we might need two distinct separators, one to separator unique fields associated with a given instance of a key and a different spectator to distinguish separate instances of that key. To accommodate this unique variation of the map-reduce genre of ideas I modified the so called "suffle.awk" function:
I also added the ability to repeat the key in the output of shuffle.awk for each unique instance of the key even though this is redundant since shuffle outputs all items which have the same key to a single line.
The shuffle step happens after the map and before the reduce part of the algorithm. The shuffle step doesn't add much value as One could alternatively check for a key to change value in order to know when one is working with a different set of items that all have the same key. However, it does simplify the reducer algorithm somewhat and also makes it easier to parallelize the reduce part of the algorithm.
If you want to see the output of the shuffle algorithm then you can use an identity function as the reducer. Alternatively, you might want to use a grep-like function as the reduce to show only lines which have multiple instances of a single key. For a package manager application this could identify when one has multiple versions of the same package installed.
I should wrap up by showing how I call the bash-reduce-like type of algorithm. This can be scene at the end of the following codeblock:
Code: Select all
dpkg)
pfx=3
cd "$a_dir/info"; dir_bname=$(basename "$a_dir")
md5sum $(ls -1 *.list; ls -1 *.md5sums) "" \
| sed -r 's#^([^[:space:]]+)([[:space:]]+)([^[:space:]]+.*)([.])([^.]*)$#\3||dpkg|\3\4|\1|'"$bname"'#g' \
| sed -r 's#([^:]+)(:?[^\|]*\|)(.*)#\1|||||\2\3#g' \
| sort > "${outfile}"_"$pfx"_md5 #f1=pkg, f2=:arch, f3=ver, f4=pkgfile, f5=dir_name, f6=filelist, f7=md5sum
cd ..
cat status | awk -v db_dir=dpkg "$AWK_fn_map_deb" > "${outfile}"_"$pfx"_db
cd $CWD/..
cat "${outfile}"_"$pfx"_md5 "${outfile}"_"$pfx"_db > "${outfile}"_"$pfx"_data
cd $CWD/..
./bash-reduce -v shuffle_OFS="@@@@" -v shuffle_FS="|" -v Keep_Key=true ./dpkg_mappers/identity.awk ./dpkg_reducers/LB_reducer.awk "${outfile}"_"$pfx"_data > "${outfile}"_"$pfx"
What I want to do here is combine the info I can glan from the files which list the files in a package, with that of the metadata given in the "status" file produced by dpkg. The actual call to bash-reduce is here:
Code: Select all
./bash-reduce -v shuffle_OFS="@@@@" -v shuffle_FS="|" -v Keep_Key=true ./dpkg_mappers/identity.awk ./dpkg_reducers/LB_reducer.awk "${outfile}"_"$pfx"_data > "${outfile}"_"$pfx"
Notice, I'm passing to bash-reduce some input variables, which are "shuffle_OFS", which sperates uniqe instances of times that have the same key and also the variable shuffle_FS, which separates unique fileds for a single instance of a key. The md5sum for the file that lists the contents of a package, might serve as a flag to see if this file list exists for the package. I could also use it as a check to see if there is a similar list of files in the puppy package manager. If the puppy package manager has a the same file with a different has I could maybe try to put the two files in the safe forum so I could have a single file referenced by two hard links.
One should also note above in this last example some of the processing is done without the bash-reduce algorithm and as I noted above this is sufficiently fast for my application.