Lazy Pipes


When working with some files for our paper, I needed to extract some input data from an archive, not all as there's many of them, convert them, and then run it all through some pipelines. There's different tools such as Nix or Snakemake which allow you to systematically work with isolated pipelines.

As an example you can create an attrset whose names are files inside a zip file and the values are derivations that extract the given file. These can be wrapped to have value be derivations that convert the given file to a target format such as AIG or BLIF. Then you can use these in your other derivations.

So another part of the workflow might look like this:

pkgs.runCommandLocal "result" {} ''
    some-tool ${zip.file1.blif} ${zip.file2.blif}
''

Due to the powerful property of Nix being lazy, only the extractions and conversions required to generate the final output are ever actually performed. But it's not ergonomic at all to work with outside of Nix.

This is all fine and dandy but I'd like something a bit more systematic, something which integrates with the system and isn't as hacky.

Something like regular pipes but with laziness, introspection and editing in mind.

If we want to ever see the results of older pipelines we need to isolate the input into some probably immutable form so they don't overwrite each other, we also must forbid any type of file access that isn't a random temporary file.

Theoretically unzipping specifically could be done with file-system translators however, conversions, selections and the like are definitely something one would need something more sophisticated for.

This also requires that the output is more structured than just text, as cutting off text is insufficient to provide the amount of laziness I desire.

The data model

What many utilities return is effectively a table of values, ls -l returns a table consisting of permissions, owning user, owning group, size, last-modified time and name. Ocassionally it is useful to also return extra values however that is probably best handled in a manner similar to common lisp, having the default always be the one that is returned, unless a special operator is used which returns a map of named values.

Perhaps an interface analoguing spreadsheets might be useful, immutable lazy sheets where the user may explicitly force a row, column, cell or entire table. Values can be changed by taking a subset of that data and handing it over to another program. Rank polymorphism would be extremely useful here the operation applicable to a cell could be applied to a row instead and simply map over it. It might also be useful to have separate functions for true projection and mapping over a subset.

This all creates a sort of DAG of tables where edges are operations on one constructing another.

Numeric example

Input Foo:

AB
12
34
(def C (truncate Foo.B 2))

(| Foo C):A + 1

Should take the B column from Foo and divide each cell by 2, then join that new column now named C onto Foo and incrementing A in-place by 1. Resulting in a table if forced looking like this:

ABC
224
446

As you know truncate also returns the remainder as a secondary value, so there's a separate sort of (value remainders C) in play here as well, which due to truncate already being called to evaluate the main table is materialised as well.

C.remainders
0
0

The | operator is some black magic I'd like to avoid, joining these lines on just their ordering seems risky and error-prone. It also violates the nice intuition of each row being a sort of struct, and only expanding and adding or removing slots of the given struct, albeit sometimes broadcasting across rows.

File example

(def files (ls #P"."))