Page 1 of 1

Scrape Debian Package File List w/ wget & sed

Posted: Fri Dec 25, 2020 4:33 am
by s243a

Say you used wget to download a file list like this:
https://packages.debian.org/stretch/all ... c/filelist

but it is in html forum. Inspecting the html, I came up with the following sed expression to extract the file list:

Code: Select all

cat filelist | sed -nr '$! {H};$ {H;x;s#^(.*pfilelist"><pre>)(.*)(\n</pre></div>\n</div> <!-- end inner.*)$#\2#g;p}'

The sed code works as follows. If we aren't at the last line (i.e. $!) then we append the current line (i.e. pattern space), to the hold space using the H command.

Code: Select all

$! {H}

If we are at the last line (i.e. $), then we append the last line to the hold space, exchange the hold space with the pattern space (i.e. the "x" command) and finally we use the substitution command to replace the content of the entire file with the part that matches the file list into the pattern space. Then we print the resulting pattern space (with the "p" command")

Code: Select all

$ {H;
     x;
     s#^(.*pfilelist"><pre>)(.*)(\n</pre></div>\n</div> <!-- end inner.*)$#\2#g;
     p}

Some useful links:

7.16 Printing the Last Lines
7.7 Text search across multiple lines
6.3 Multiline techniques - using D,G,H,N,P to process multiple lines
4.3 selecting lines by text matching
3.3 Overview of Regular Expression Syntax
6.1 How sed Works

3.2 sed commands summary