ppm-reduce (Short for Puppy Package Manager. Map Reduce)

s243a · Post by **s243a** » Sat Jan 09, 2021 5:15 am

Overview

The goal of ppm reduce is to apply algorithms similar to map reduce in order to mine meta-data from various package managers such at the ppm (puppy package manager) and dpkg. The meta data will be useful for several applications such as:
1. Modularizing versions of puppy
2. Slimming versions of puppy
3. Loading and Unloading Symlinks into a root file system
4. Syncing package managers

These goals are separate, but the intent is to apply a common methodology to all four goals.

History

The origins of ppm-reduce begin with sorhus version of bash-reduce (see post). In sorhus's bash-reduce data is first mapped to a key using a user supplied map function (awk based), then grouped by key (using shuffle.awk) and then each separately grouped item is processed to create a single key value pair that summarizes the results of the group.

The code to do this is as follows:

Code: Select all

function execute() {
    awk $2 "$map" < $1 | \
    sort -S 1G | \
    awk "$shuffle" | \
    awk $2 "$reduce" | \
    sort -S 1G -k2nr -k1
}

https://github.com/sorhus/bash-reduce/b ... sequential

which can be done in parallel as follows:

Code: Select all

function execute() {
    parallel --gnu --pipe $3 "awk $2 '$map' | sort -S 1G | awk '$shuffle'" < $1 | \
    sort -S 1G | \
    awk "$shuffle" | \
    parallel --gnu --pipe $3 "awk $2 '$reduce'" | \
    sort -S 1G -k2nr -k1
}

https://github.com/sorhus/bash-reduce/b ... e/parallel

The primary computational efficiency advantage of map-reduce types of algorithms is that sorting large data sets is very computationally efficient due to n log (n) complexity. Additionally, each stage of the algorithm: map, shuffle, reduce can be done in parallel. Even the sorting can be done in parallel.

Modifications to the bash-reduce algorithm

Modification#1 (Structured Values)

I modified the algorithm so each result that is grouped by a single key can be composed of multiple fields and so that the key which the data is grouped by can be one of these fields. An application here is that each result might be missing data in one of the fields but if you combine all the results for this field in a single group (by key) then you can fill in the missing field/s. This is a generalization of a join algrithm. In a join each side of the join is assumed to have unique fields. An alternative to join might be to have all the same fields on each side of the join but one each side of the join some of the values for these fields are missing.

So if for some reason a join isn't suitable (e.g. not handling unmatched keys well) then perhaps it might be better thought of as a merge-reduce, which would aim to be like a join but use an algorithm simmilar to map reduce.

Modification#2 (Allow Other types of Mappers and Reducers besides AWK)

While I expect that in most cases I will use AWK for the mapper and reducer, I began some of the work towards allowing other language implementations of these components by checking to see if the file is an executable script (in which case the bash-reduce script will not specify the interpreter) and I also provided an option for the user to specify the interpreter for cases where the file isn't executable. These changes have currently broken the parallel aspects of the original bash-reduce but I don't need parralization at the moment.

Modification#3 (Allow process substitution for the data input rather than a file)

In order to allow process substitution, I write the data input directly into a named pipe and keep the pipe alive using the nohup function. These seems to have a large performance impact when ran of a usb 2.0 device. Possible causes:
1. Slow usb 2.0 I/O (nohup seems to write to the device in a file called nohup.out. I'd like to disable this).
2. buffer sizes should be modified
3. A call to sleep is causing more things to sleep than I intended.

This is a modification of an idea that I posted in another post, about how I could redirect process substitution to a named pipe and return the path to the named pipe. The current implementation will clean up the namped pipe after a call to sleep. However, nohup.out is not yet cleaned up.

Code: Select all

#!/bin/bash
set -x
cd "$(dirname "$0")"
source ./ppm-reduce-functions
  tmp_pipe="$(realpath .)/$(mktemp -u tmppipeXXXX)"
  mkfifo "$tmp_pipe"
  nohup bash -c "cat \"$@\"; sleep 100; rm  \"$tmp_pipe\""  $1 >$tmp_pipe &
  echo "$tmp_pipe"
  exec 1>/dev/null

https://github.com/s243a/ppm_reduce/blo ... azy_cat.sh

Project Status

This project is preliminary enough that people shouldn't try to wade through my complete code and figure out how it works (or is supposed to). I only provided a few code examples to illustrate some concepts. Also no one should try to use my code just yet!

Post by **rockedge** » Sat Jan 09, 2021 1:31 pm

I am working on creating a Puppy Linux that uses the XBPS package manager that is used in Void Linux. It is a standalone package manager and so far I am attempting to compile xbps in a Bionic64.

The final goal is to have a hybrid OS capable of using PPM, Pkg and XBPS on the same system.

Making this an interesting project. The key would be for all three package managers to be in sync.

s243a · Post by **s243a** » Mon Jan 11, 2021 11:50 am

rockedge wrote: ↑Sat Jan 09, 2021 1:31 pm
I am working on creating a Puppy Linux that uses the XBPS package manager that is used in Void Linux. It is a standalone package manager and so far I am attempting to compile xbps in a Bionic64.

The final goal is to have a hybrid OS capable of using PPM, Pkg and XBPS on the same system.

Making this an interesting project. The key would be for all three package managers to be in sync.

That sounds like an interesting project. I think the first step is to document (or link to a doc about) how void XBPS stores the metadata for the packages. This information would be helpful for others that aren't familiar with void XBPS.

s243a · Post by **s243a** » Mon Jan 11, 2021 12:53 pm

I've made some progress here for user installed packages. First I prepare the input for my modified bash reduce function to be of this forum:

Code: Select all

f1=pkgfile_noExt    |f2=arch        |f3=ver     |f4=pkgfile               |f5=dir_name                   |f6=filelist                     |f7=md5sum                       |pkg
adb_8.1.0+r23-5_i386|               |8.1.0+r23-5|adb_8.1.0+r23-5_i386.deb |packages                      |adb_8.1.0+r23-5_i386.files      |                                |adb
adb_8.1.0+r23-5     |               |           |                         |packages|adb_8.1.0+r23-5.files                                 |00fe8bb85ae24f00fff63bdfbd9464ea|

**Note the spaces are shown to line up the columns and don't exist in the actual output.

The second last line is produced by sed and using the md4sum function.

Code: Select all

md5sum $(ls -1 *.files) "" | sed -r 's#^([^[:space:]]+)([[:space:]]+)([^[:space:]].*)([.]files)$#\3||||'$bname'|\3\4|\1'"|"'#g' | sort -t '|' -k1 >> "$outfile"_"$pfx"_md5

The last line is produced by using the cut and awk functions.

Code: Select all

AWK_fn_prepend_file_list='BEGIN {FS = "|"}
{  
	pkg=$1; arch=""; ver=$2; 
	pkgfile=$3
	 db_list=$3
	 pkgfile_noExt=$3
	sub(/\.[^.]+$/, "", pkgfile_noExt)
	filelist=pkgfile_noExt "." ext
	md5sum=""
    print pkgfile_noExt "|" arch "|" ver "|" pkgfile "|" db_dir "|" filelist "|" md5sum "|" pkg}

    cut -f2,3,8 -d'|' --output-delimiter="|" "$a_dir"/user-installed-packages | \
      awk -v ext=files -v db_dir=$bname "$AWK_fn_prepend_file_list" | sort -t '|' -k1 >> "$outfile"_"$pfx"_db #f1=file_list_name, f2=pkg, f3=version, f4=pkg_file, f5=packages

The first field is used to group the data by key. My assumption was that the files list (minus the extension) should match the name of the package file (minus the extension). In some cases this works, but in other cases it doesn't because in some cases the architecture is stripped from the name of the file list. In my opinion this stripping of the architecture in the file list name is bad because it will result in duplicate names if you install two packages with the same name but different architectures.

As an example of where my assumption worked, here is an example of my bash-reduce"-ish" output:

Code: Select all

archivemount_0.8.7-1+b1|i386|0.8.7-1+b1|archivemount_0.8.7-1+b1_i386.deb|packages|archivemount_0.8.7-1+b1_i386.files|f665d93a636c729b0e0d13cf2a38d7d1|archivemount_0.8.7-1+b1_i386

In this example you can see that the archive, version, and md5sum were sucesfully merged between the output of the cut commands and md5sum command. The bash code that I used to create this was:

Code: Select all

    "$PPPR_ROOT"/bash-reduce -v shuffle_OFS="@@@@" -v shuffle_FS="|" -v Keep_Key=true \
        "$PPPR_ROOT"/dpkg_mappers/identity.awk "$PPPR_ROOT"/dpkg_reducers/PKGS_reducer.awk \
        <(cat "${outfile}"_"$pfx"_md5 $outfile"_"$pfx"_db") > "$outfile"_${pfx}

The two files mreged are the one generated by md5 sum function (i.e. ${outfile}"_"$pfx"_md5) applied to the file lists and that generated by the cut function (i.e. $outfile"_"$pfx"_db") on the db file (i.e. /var/packages/user_installed_packages).

To see where this doesn't work, you can replace the reduce by the identify function:

Code: Select all

    "$PPPR_ROOT"/bash-reduce -v shuffle_OFS="@@@@" -v shuffle_FS="|" -v Keep_Key=true \
        "$PPPR_ROOT"/dpkg_mappers/identity.awk "$PPPR_ROOT"/core/identity.awk \
        <(cat "${outfile}"_"$pfx"_md5 $outfile"_"$pfx"_db") > "$outfile"_${pfx}_map_shuffle

In this example since both the mapper and reduce use the identify function the output is that of the shuffle function Alone. Here is an example of where the shuffle failed to combine the keys:

Code: Select all

f1=pkgfile_noExt    |f2=arch        |f3=ver     |f4=pkgfile               |f5=dir_name                   |f6=filelist                     |f7=md5sum                       |pkg
adb_8.1.0+r23-5_i386|               |8.1.0+r23-5|adb_8.1.0+r23-5_i386.deb |packages                      |adb_8.1.0+r23-5_i386.files      |                                |adb
adb_8.1.0+r23-5     |               |           |                         |packages|adb_8.1.0+r23-5.files                                 |00fe8bb85ae24f00fff63bdfbd9464ea|

and the reason is because the file list has the architecture stripped but the db entry doesn't

Post by **rockedge** » Mon Jan 11, 2021 2:49 pm

s243a wrote: ↑Mon Jan 11, 2021 11:50 am
That sounds like an interesting project. I think the first step is to document (or link to a doc about) how void XBPS stores the metadata for the packages. This information would be helpful for others that aren't familiar with void XBPS.

Of course! Here are some documents that detail XBPS https://docs.voidlinux.org/xbps/index.html

the source code : https://github.com/void-linux/xbps

Puppy Linux Discussion Forum

ppm-reduce (Short for Puppy Package Manager. Map Reduce)

ppm-reduce (Short for Puppy Package Manager. Map Reduce)

Re: ppm-reduce (Short for Puppy Package Manager. Map Reduce)

Re: ppm-reduce (Short for Puppy Package Manager. Map Reduce)

Re: ppm-reduce (Short for Puppy Package Manager. Map Reduce)

Re: ppm-reduce (Short for Puppy Package Manager. Map Reduce)