Skip to content

port concepts from mem #22

@orbisvicis

Description

@orbisvicis

Mem has a better memoization framework. I think it might be worth considering porting some concepts over. As a long term project, this is more of a note than a real issue, as right now I don't have time and I doubt anyone else is interested. Overview:

Consider the following (mostly equivalent, from a memoizing standpoint) fbuild functions:

def obj(ctx, target:fbuild.db.DST, source:fbuild.db.SRC):
def obj(ctx, source:fbuild.db.SRC) -> fbuild.db.DST:
  1. Fbuild blurs inputs and outputs. The only requirements to enable determinism are: input path, input contents, and output path. However, fbuild uses: input path, input contents, output path, and output path exists. Not only does this confuse the concept of pure, deterministic functions, it has a major drawback (below).

  2. Fbuild doesn't handle target modification. For example assume obj copies source->target, in this case test.in->test.out. Consider:

    Initial execution:

    $ for i in test*; do echo "$i"; cat "$i"; done
    test.in
    1
    2
    $ fbuild
    Copying test.in to test.out...
    $ for i in test*; do echo "$i"; cat "$i"; done
    test.in
    1
    2
    test.out
    1
    2
    

    That was the initial memoization, so not much to see. Let's try the only condition supported by fbuild's fbuild.db.DST - removing test.out.

    $ rm test.out
    $ for i in test*; do echo "$i"; cat "$i"; done
    test.in
    1
    2
    $ fbuild
    Copying test.in to test.out...
    $ for i in test*; do echo "$i"; cat "$i"; done
    test.in
    1
    2
    test.out
    1
    2
    

    While the end result is acceptable, unfortunately fbuild had to rerun the obj function. Now let's trying modifying test.out. As for real-world scenarios, this could easily be an unintended side effect of a build command.

    $ echo "44" >test.out
    $ for i in test*; do echo "$i"; cat "$i"; done
    test.in
    1
    2
    test.out
    4
    $ fbuild
    $ for i in test*; do echo "$i"; cat "$i"; done
    test.in
    1
    2
    test.out
    44
    

    Well, that's not good at all.

    In fact, whether or not the target was removed or modified, the memoized function should never be run again. Instead, the target should be restored from the cache if and only if it was modified or removed. Let's compare fbuid to mem:

    • target unmodified: fbuild does not rerun the memoized function. [1/1]
    • target removed: fbuild detects this, but reruns the memoized function. [1/2]
    • target removed: fbuild doesn't detect this. [0/1]

    Like fbuild, mem memoizes function outputs. Now obviously no function should be expected to return a byte-for-byte copy of a file, suitable for pickling. Instead, mem introduces an extra processing step if the output object defines the functions hash, store, and restore. If the output hasn't been memoized, mem will call store(). If it has, mem wall first call hash(). If the hash remains unchanged from the cached version, mem does nothing. Otherwise, it calls restore(). For example, this is mem's file class:

    class File:
       def __init__(self, path):
           # notice the file's contents won't be serialized
           self.path = path
    
       def __hash__(self):
           """ checksum of self.path """
    
       def __store(self):
           """" store a copy of the file in the build cache """
    
       def __restore(self):
           """ restore the file from the build cache """
  3. Fbuild depends on python annotations to memoize file contents. While helpful, it is also obfuscating and confusing. Why not depend on the standard object-oriented paradigm, like mem does? Not only is this expected, it is less verbose, and simpler:

    obj_b(obj_a("file.a", "file.b"))
    @mem.memoize
    def obj_a(source_path_string, target_path_string): 
       # unfortunately, the inputs are python strings, without store()/restore(), and a __hash__() that doesn't depend on contents.
       # so, let's explicitly add a dependency on the path's contents
       mem.add_dep(source_path_string)
       mem.add_dep(target_path_string)
       # process the input, determine the outputs
       output = ...
       return mem.nodes.File(output)
    
    @mem.memoize
    def obj_b(source_path_node):
      # the inputs are already node objects, no need to use mem.add_dep()
      pass

    Now for convenience and backwards compatibility, I do like parameter annotations.

    @mem.memoize
    obj_a_alternative(source_path_string:fbuild.file.to_node, target_path_string:fbuild.file.to_node):
       pass

    Also why not add notation to prevent certain parameters from being memoized. Mem acknowledges this as a shortcoming of its design, but also notes that it has never needed such functionality:

    @mem.memoize
    obj_c(source_path_string:fbuild.file.to_node, dont_memoize:fbuild.db.ignore):
       pass
  4. Fbuild ties the build environment (compiler flags) to a complicated data structure (list(tuple(set, dict))) and a complicated class hierarchy. While this simplifies most build targets, the complexity makes edge-cases more difficult to implement. On the other hand, mem provides a much "flatter" hierarchy.

    1. Mem doesn't differentiate between extraneous and required environment (or environment and command-line options). The merged dictionary of both shell environment and specific flags (overrides) can by passed to any build target function decorated with mem.util.with_env:

      @mem.util.with_env(CFLAGS=[])         # only pass-in CFLAGS from the environment
      @mem.memoize
      def obj(target, source, CFLAGS):
          pass
      
      obj(target, source, env={k:v for d in (os.environ, {CFLAGS: "-O3"}) for k,v in d.items()})

      The decorator ensures that only the required flags are memoized.

    2. Mem provides a single compile operation, and a single link operation. You just need to make sure you pass the correct flags to each operation, depending on your needs:

      • build, program: []
      • build, static: []
      • build, shared: ["-fPIC"]
      • link, program: []
      • link, static: []
      • link, shared: ["-shared"] (at the very minimum)

      Compare to fbuild's over-engineered guess_static and guess_shared with either build_lib or build_exe. Yes, the guess_ function has a secondary use of finding the correct compiler, but the process of deciding static/shared then lib/exe makes the class hierarchy more complicated than it should be. An independent class maintaining a database of compiler flags would be more appropriate.

  5. Support for building a single object from multiple sources (link-time optimization):

    All mem build targets support multiple sources. If the output target is unspecified, instead of compiling an object for each input source, the input sources will be agglomerated (link-time optimization) and a single optimized output target will be produced. Admittedly, because mem is unmaintained, this depends on the outdated '-combine' flag.

  6. Just a tiny nitpick, but I find the term "cache" confusing, as the standard and pythonic term is "memoize".

Overall mem feels more pythonic. I only mention its advantages but in terms of features - as an unmaintained project - mem lags far behind fbuild.

  • path objects

    Fbuild overrides __truediv__ for convenience.

  • logging

    Fbuild provides loggers, though I'm not too impressed. Setting up tee-styled redirections might be better.

  • commands

    Using the provided execute function is required to log command output. Once again, I'm not impressed. I'd rather use the subprocess module directly, and have implicit logging at the program level from tee-style redirections.

  • command line options

    Fbuild uses argparse, but requires the definition of the pre_options function, which is magically loaded. It would be more transparent to explicitly pass the ArgumentParser object to fbuild.

  • installing files

    I'm not sure if install is memoized - haven't checked.

  • configuration testing

    Fbuild really shines here - and I mean really.

  • command-line targets

    This is well implemented - the decorator is sufficient, so interacting directly with argparse isn't necessary.

  • python 3

  • supports many more builders

  • cross platform

    While mem is technically also cross platform by virtue of python, per platform code must still be written to handle differences of compiler and environment.

I don't see the point of requiring a context object passed around. If the namespace was becoming too polluted, why not put all configuration into a global container object (sub-module)?

...

With a memoization framework like mem's, it would be possible to support an uninstall target. Even more impressive, uninstall would be able to restore files overwritten during installation.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions