Skip to content

Class_EngineDVFileJson

Johann Petrak edited this page Apr 16, 2018 · 8 revisions

Class EngineDVFileJson

This page describes the inner workings of class EngineDVFileJson which implements the engine for using external algorithms with dense vectors represented in JSON format (and handled out-of-memory).

Plan for how to change this later

  • decide what the best way is to invoke the scripts on both Linux and Windows (and Mac) and implement
  • May need to be intelligent about finding python

But for now, we simply call the train.sh and apply.sh scripts which in turn set up everything to run the actual wrapper's python script using the proper python command.

Other TODOs

  • make sure the wrapper config (wrappername.yaml) in the datadir is properly used, including for finding the proper python and for running the shell script using the proper shell
  • make sure the wrapper info (wrapperInfo.yaml) in the wrapper directory is properly used

Protocol of use

Currently (as of 2018-04-16), the invocation protocol for engines is a bit complex. The required protocol depends on the situation the engine gets used in (training versus application).

When training:

  • The engine class gets selected in the PR based on the trainingAlgorithm runtime PR
  • Engine.createEngine(trainingAlgorithm, algorithmParameters, featureInfo, TargetType, dataDirectory) is called
    • this executes the non-static initializeAlgorithm(algorithm,parms) method (overriden but empty for EngineDVFileJson)
    • then runs method initWhenCreating(directory, algorithm, parms, featureInfo, targetType): for EngineDVFileJson, this essentially creates the instance of the appropriate corpus representation and sets the mode to "adding".
    • creates and initializes the Info instance
    • returns the Engine instance
  • document processing uses the corpus representation retrieved from the engine to add new instances
  • After all documents have been processed, the engine's info gets updated
  • Then engine.trainModel(dataDir, instanceAnnotationType, algoParms) gets called:
    • turns off adding for the corpus representation
    • updates the info
    • copies the whole wrapper software unless already there (based on WRAPPER_NAME)
    • creates the command to invoke the training script, also using the settings in the config file WRAPPER_NAME.yaml which is treated as a key/value map
    • this optionally uses settings shellcmd and shellparms for running the shell script
    • TODO: this should also allow to configure the python path and python location
    • before running the command, sets environment variable WRAPPER_HOME which is a subdirectory of the data directory.
    • runs the command
    • updates the info and saves it
    • saves the featureInfo (NOTE: this is currently done again later in the saveEngine method)
  • Finally engine.saveEngine(dataDir) gets called (from base class Engine) which:
    • saves the feature info using featureInfo.save(dir)
    • invokes the engine-specific saveModel(dir) class, in this case, this does nothing since the model gets saved by the scripts we call
    • invokes the engine-specific saveCorpusRepresentation(dir) class, which in this case does nothing, since the corpus representation is already out-of-memory and stored to a file

When applying a model:

  • call engine.loadEngine(datadir, parms) -- this static method in turn runs:
    • load the Info
    • load the FeatureInfo
    • create a new instance of the Engine class (which is stored in the Info)
    • Set the info in the new instance
    • call the engine's initWhenLoading(dir, parms) method. This is NOT overridden by the EngineDVFileJson class and calls:
      • the engine-specific loadModel(dir, parms) class which is overriden:
        • runs loadAndSetCorpusRepresentation(dir) (NOTE: this is duplicate but non-harming duplicate see below)
        • if not already there, copies the wrapper software
        • builds the command for running the application script (similar to train script, just different name)
        • starts the script to communicate with
      • runs the engining specific loadAndSetCorpusRepresentation(dir) method
      • creates the algorithm instance
      • calls the engine's initializeAlgorithm(algorithm, parms) method -- overriden but does nothing
  • processes all the documents, calling engine.applyModel(...). This is overriden to, for each instance in the document:
    • convert the annotation to json
    • send the json to the process
    • get back the json from the process
    • convert what we get back to model application instances and collect them
    • return all the model application instances
Clone this wiki locally