Skip to content

PET Development and Software Architecture

annaeg edited this page Oct 29, 2014 · 16 revisions

overview: architecture, todo list

Architecture description

General architecture description

TODO: the understandable version ;-)

PERICLES Extraction Tool architecture abstract

Detailed architecture description

PERICLES Extraction Tool architecture detailed

TODO: remove references to other deliverable sections

  1. The user can give some basic configuration commands as arguments at start. To use the same commands automatically for every tool start, it is possible to write these commands into a configuration file. The tool provides a command line interface (CLI) and a graphical user interface (GUI) for the interaction with the user; the use of the GUI is optional, as it is possible to execute the tool without the need of a graphical desktop. The user has full control over the extraction process and can decide which extraction modules to use, which files to consider and which (subsets of the) extracted information to keep.

  2. The Extraction Controller Builder is responsible for building the Extraction Controller, the “heart” of the application, based on the given user commands. It is designed following the builder design pattern and is only executed once for building a single instance at tool start.

  3. The Extraction Controller is the main controlling class of the application. It has access to all other controllers and is responsible for the application flow. At tool start it initializes all other controllers, and shuts them down at the end of the tools execution. All specialized controller components communicate exclusively over the Extraction Controller with other controllers, so this main controller is responsible for updating the states of other components and for serving as an intermediary.

  4. Profiles are created by the user to organize the extraction components, e.g. based on use purposes. They contain a list of configured Extraction Modules, and a list of Extraction Result Collections to collect the information extracted by the modules and to keep references to important files where this information is related to. Both components are described in the following two points.

  5. Extraction Modules implement the techniques that designate how and when information is extracted. They provide different implementations of algorithms to be executed in the computer system environment of different operating systems. There are three different kinds of Extraction Modules: file-dependent, file-independent and daemons. File-dependent modules take as argument a path to a file to extract information that is valid only for this file, whereas file-independent modules extract environment information which is valid for all files within the environment. Daemon modules, on the other hand, don't extract information, but instead monitor the environment for the occurrence of designated events. It is also possible to develop and easily plug into the application customized modules for extracting specialized information or for monitoring specific events. A class template for supporting the developer(s) is provided for this purpose.

  6. Extraction Result Collections are the data structures that keep the extracted information. Each collection belongs to one of two sub-classes: Environment or Part. An Environment collects all extracted file-independent information that belongs to a Profile, whereby each Profile has only one Environment class. Parts keep the extracted information that is valid only for a specific file together with a path to this file. They can be seen as a file-part of a Digital Object, but, in order to increase flexibility, we intentionally didn't implement a Digital Object as a data structure.

  7. The Profile Controller manages all Profiles and Profile Templates, which can be used for the fast creation of a preconfigured Profile. It is possible to export existing Profiles as Profile Templates, to be able to pass them to other PET users.

  8. The Module Controller searches (with the help of Java reflections) for all available module classes and creates a list of generic extraction modules provided for creating Extraction Module instances for Profiles. After their creation, most of the Extraction Modules have to be configured before they can be executed.

  9. The Extractor is responsible for executing the Extraction Modules and for saving the Extraction Results into the right Extraction Result Collections. It supports two extraction modes: (a) a snapshot extraction that executes each Extraction Module of each Profile for capturing the current information state, (b) a continuous extraction mode that initiates a new extraction by the Event Controller when an event is detected by the environment monitoring daemons (the File Monitor and the Daemon Modules).

  10. The Event Controller receives all Events detected by the monitoring daemons and controls the event handling. It uses a queue for handling the events in the order of emerging.

  11. Monitoring daemons are the File Monitor (see 12) and the Daemon Modules (see 5).

  12. The File Monitor is responsible for observing the files that are added to the Profiles for changes. If a modification to one of the files is detected, a new extraction for all modules related to this file will be initiated. In case of a file deletion, all Profiles that include this file as Part are informed, and will remove the file from their list. Contrary to the exchangeable daemon modules, this is an inherent component of the application.

  13. The Configuration Saver saves the state of the application at the end of the tools execution to configuration files, and loads the state at the next start of the tool. The Profiles will be saved with all the added files and their modules with their configurations. Furthermore, the current extraction mode and general usage options are saved.

  14. The Storage Controller allows generic access to the exchangeable Storage. It provides methods for saving and loading extracted information to and from the Storage.

  15. Storage: save and load metadata using a modular storage support. Currently implemented three storage interfaces: defaults to a simple flat filesystem storage with Json mapping, one using elasticsearch [75] and a third using mapdb [76].

  16. PET works together with an information encapsulation tool, also developed during the PERICLES project, to be able to encapsulate the extracted information together with its related files in a sheer curation scenario.

  17. The weighted graphs described in chapter 6.3 could be implemented for suggesting information to be extracted based on the use cases.

Ideas for further PET developments

In this section we collect a list of issues and feature ideas for further PET developments, guided by close discussion with stakeholders. Although it will likely be impossible to implement all of these in the course of PERICLES, we hope to come back to the list in later project and to inspire the open source community for contributions. Therefore we plan to maintain an updated list of ideas in PET repository website [A link will be added, once PET is published].

  • Address the creation context of Digital Objects by adding newly created Digital Objects based on environment events detected by the Environment Monitoring Daemons. We already tested the concept with the Directory Monitoring Daemon by adding the newly created files of an observed directory after getting a file creation event.
  • An option to trigger the execution of defined Extraction Modules repetitive in a configurable time interval, for example every 5 minutes, instead of only in case of an environment event.
  • Explore the possibility to support extraction using the PREMIS 3 vocabulary and LRM dependency definitions.
  • Implementation of the investigated weighted graphs and integration into PET. The FreeMind [81] open source tool (MIT license) could be used to visualize the weighted graph.
  • Automated inference of dependencies from the monitored environment events.
  • Extraction Module configuration Templates could be developed similar to the Profile Templates to export and ship single configured modules.
  • GUI refactor: Help- and event-tab are Profile independent and should be shown outside the Profile area.
  • Further development of the “General Native Command Module” that allows the execution of customized terminal commands as Extraction Module. A support for the extraction of parameters from the command output would be useful.
  • The Information Tree is the main GUI display for extraction results. There are two other display methods available, which both allow the filtering of information by the Extraction Module that was used to extract it. It would be useful to enable such filtering also for the Information Tree. The same “Combo Box” for selecting the Extraction Module can be used for all three displays.
  • Some intelligent redundancy management for the extraction results could be implemented.
  • Some operating system files could be naturally excluded from extraction and monitoring, for example using DF databases, such as the National Software Reference Library (NSRL) Hashsets and Diskprints. This would be helpful for handling of large amounts of files.
  • Currently a “Part” is the data structure that represents an important file to be investigated during the extraction process. This concept could be extended to also support directories as “Parts”. It would allow to include all future files created in a directory to the investigation.
  • The configuration of Extraction Modules could be supported by the CLI. At the moment the user has to modify the JSON configuration file directly to avoid the use of the GUI.
  • At the moment there could be conflicts if two Extraction Modules get the same name. An UUID for Extraction Modules would be useful to avoid this.
  • Think about if there is any good method to show the extracted information at the CLI. The problem is hard to solve, because of the great amount of extracted information.
  • The Extraction Modules could be extended by a variable that indicates their current state. A state could indicate problems as “Further configuration needed”.
  • Extensively test on windows. (The current version was developed and mainly tested on Linux and OS X, with limited testing on Windows)
  • The configuration of Extraction Modules is currently possible by using a GUI editor for manipulating a JSON file. An improvement would be the generation of a configuration GUI.
  • Currently all existing Extraction Modules are loaded to the default profile at the first tool start. This is good for presentations, but less good for real usages. Better would be to load no module and to display a text to the user that Extraction Modules should be added to the profile.
  • Develop further Extraction Modules:
  • Exif information extraction
  • Maven / Ant dependency extraction
  • Extraction of installed and used drivers
  • Extraction of information about installed programming languages (already existing for Java)
  • Extraction of provenance and comments from version control systems
  • IDE (software development environment) information extraction
  • IDE project information extraction
  • Dependency extraction form IDE
  • Include the TIMBUS Extractors as modules to enable the extraction of business activity context information
Clone this wiki locally