Overview
The goal of ppm reduce is to apply algorithms similar to map reduce in order to mine meta-data from various package managers such at the ppm (puppy package manager) and dpkg. The meta data will be useful for several applications such as:
1. Modularizing versions of puppy
2. Slimming versions of puppy
3. Loading and Unloading Symlinks into a root file system
4. Syncing package managers
These goals are separate, but the intent is to apply a common methodology to all four goals.
History
The origins of ppm-reduce begin with sorhus version of bash-reduce (see post). In sorhus's bash-reduce data is first mapped to a key using a user supplied map function (awk based), then grouped by key (using shuffle.awk) and then each separately grouped item is processed to create a single key value pair that summarizes the results of the group.
The code to do this is as follows:
Code: Select all
function execute() {
awk $2 "$map" < $1 | \
sort -S 1G | \
awk "$shuffle" | \
awk $2 "$reduce" | \
sort -S 1G -k2nr -k1
}
https://github.com/sorhus/bash-reduce/b ... sequential
which can be done in parallel as follows:
Code: Select all
function execute() {
parallel --gnu --pipe $3 "awk $2 '$map' | sort -S 1G | awk '$shuffle'" < $1 | \
sort -S 1G | \
awk "$shuffle" | \
parallel --gnu --pipe $3 "awk $2 '$reduce'" | \
sort -S 1G -k2nr -k1
}
https://github.com/sorhus/bash-reduce/b ... e/parallel
The primary computational efficiency advantage of map-reduce types of algorithms is that sorting large data sets is very computationally efficient due to n log (n) complexity. Additionally, each stage of the algorithm: map, shuffle, reduce can be done in parallel. Even the sorting can be done in parallel.
Modifications to the bash-reduce algorithm
Modification#1 (Structured Values)
I modified the algorithm so each result that is grouped by a single key can be composed of multiple fields and so that the key which the data is grouped by can be one of these fields. An application here is that each result might be missing data in one of the fields but if you combine all the results for this field in a single group (by key) then you can fill in the missing field/s. This is a generalization of a join algrithm. In a join each side of the join is assumed to have unique fields. An alternative to join might be to have all the same fields on each side of the join but one each side of the join some of the values for these fields are missing.
So if for some reason a join isn't suitable (e.g. not handling unmatched keys well) then perhaps it might be better thought of as a merge-reduce, which would aim to be like a join but use an algorithm simmilar to map reduce.
Modification#2 (Allow Other types of Mappers and Reducers besides AWK)
While I expect that in most cases I will use AWK for the mapper and reducer, I began some of the work towards allowing other language implementations of these components by checking to see if the file is an executable script (in which case the bash-reduce script will not specify the interpreter) and I also provided an option for the user to specify the interpreter for cases where the file isn't executable. These changes have currently broken the parallel aspects of the original bash-reduce but I don't need parralization at the moment.
Modification#3 (Allow process substitution for the data input rather than a file)
In order to allow process substitution, I write the data input directly into a named pipe and keep the pipe alive using the nohup function. These seems to have a large performance impact when ran of a usb 2.0 device. Possible causes:
1. Slow usb 2.0 I/O (nohup seems to write to the device in a file called nohup.out. I'd like to disable this).
2. buffer sizes should be modified
3. A call to sleep is causing more things to sleep than I intended.
This is a modification of an idea that I posted in another post, about how I could redirect process substitution to a named pipe and return the path to the named pipe. The current implementation will clean up the namped pipe after a call to sleep. However, nohup.out is not yet cleaned up.
Code: Select all
#!/bin/bash
set -x
cd "$(dirname "$0")"
source ./ppm-reduce-functions
tmp_pipe="$(realpath .)/$(mktemp -u tmppipeXXXX)"
mkfifo "$tmp_pipe"
nohup bash -c "cat \"$@\"; sleep 100; rm \"$tmp_pipe\"" $1 >$tmp_pipe &
echo "$tmp_pipe"
exec 1>/dev/null
https://github.com/s243a/ppm_reduce/blo ... azy_cat.sh
Project Status
This project is preliminary enough that people shouldn't try to wade through my complete code and figure out how it works (or is supposed to). I only provided a few code examples to illustrate some concepts. Also no one should try to use my code just yet!