-
Notifications
You must be signed in to change notification settings - Fork 1
PET Development and Software Architecture
This section describes the software architecture of the PERICLES Extraction Tool. Have a look at the general architecture description for an abstract overview. The detailed architecture description provides further information for software developers.

Extraction Modules define how information can be extracted from the system environment, including customized algorithms and system calls for different operating systems and the usage of external tools and libraries. They can be configured to fit for customized needs. The Extractor of the tool executes the Extraction Modules during an information extraction run and they return the extracted information.
PET manages lists of Digital Object-Parts, which are the representation of an important file in the tool structure. They are organized in Profiles by the user or by templates that support specific purposes. Profiles also keep a set of configured Extraction Modules which extract the SEI for the Profile related to the Digital Object-Parts.
The last components displayed in this abstract architecture image are Environment Monitor Daemons, which observe the computer system environment for designated events and trigger the extraction of other Extraction Modules or other event handling processes in case of the occurrence of these events. It follows a more detailed schema of the PET architecture.

-
The user can give some basic configuration commands as arguments at start. To use the same commands automatically for every tool start, it is possible to write these commands into a configuration file. The tool provides a command line interface (CLI) and a graphical user interface (GUI) for the interaction with the user; the use of the GUI is optional, as it is possible to execute the tool without the need of a graphical desktop. The user has full control over the extraction process and can decide which extraction modules to use, which files to consider and which (subsets of the) extracted information to keep.
-
The Extraction Controller Builder is responsible for building the Extraction Controller, the “heart” of the application, based on the given user commands. It is designed following the builder design pattern and is only executed once for building a single instance at tool start.
-
The Extraction Controller is the main controlling class of the application. It has access to all other controllers and is responsible for the application flow. At tool start it initializes all other controllers, and shuts them down at the end of the tools execution. All specialized controller components communicate exclusively over the Extraction Controller with other controllers, so this main controller is responsible for updating the states of other components and for serving as an intermediary.
-
Profiles are created by the user to organize the extraction components, e.g. based on use purposes. They contain a list of configured Extraction Modules, and a list of Extraction Result Collections to collect the information extracted by the modules and to keep references to important files where this information is related to. Both components are described in the following two points.
-
Extraction Modules implement the techniques that designate how and when information is extracted. They provide different implementations of algorithms to be executed in the computer system environment of different operating systems. There are three different kinds of Extraction Modules: file-dependent, file-independent and daemons. File-dependent modules take as argument a path to a file to extract information that is valid only for this file, whereas file-independent modules extract environment information which is valid for all files within the environment. Daemon modules, on the other hand, don't extract information, but instead monitor the environment for the occurrence of designated events. It is also possible to develop and easily plug into the application customized modules for extracting specialized information or for monitoring specific events. A class template for supporting the developer(s) is provided for this purpose.
-
Extraction Result Collections are the data structures that keep the extracted information. Each collection belongs to one of two sub-classes: Environment or Part. An Environment collects all extracted file-independent information that belongs to a Profile, whereby each Profile has only one Environment class. Parts keep the extracted information that is valid only for a specific file together with a path to this file. They can be seen as a file-part of a Digital Object, but, in order to increase flexibility, we intentionally didn't implement a Digital Object as a data structure.
-
The Profile Controller manages all Profiles and Profile Templates, which can be used for the fast creation of a preconfigured Profile. It is possible to export existing Profiles as Profile Templates, to be able to pass them to other PET users.
-
The Module Controller searches (with the help of Java reflections) for all available module classes and creates a list of generic extraction modules provided for creating Extraction Module instances for Profiles. After their creation, most of the Extraction Modules have to be configured before they can be executed.
-
The Extractor is responsible for executing the Extraction Modules and for saving the Extraction Results into the right Extraction Result Collections. It supports two extraction modes: (a) a snapshot extraction that executes each Extraction Module of each Profile for capturing the current information state, (b) a continuous extraction mode that initiates a new extraction by the Event Controller when an event is detected by the environment monitoring daemons (the File Monitor and the Daemon Modules).
-
The Event Controller receives all Events detected by the monitoring daemons and controls the event handling. It uses a queue for handling the events in the order of emerging.
-
Monitoring daemons are the File Monitor (see 12) and the Daemon Modules (see 5).
-
The File Monitor is responsible for observing the files that are added to the Profiles for changes. If a modification to one of the files is detected, a new extraction for all modules related to this file will be initiated. In case of a file deletion, all Profiles that include this file as Part are informed, and will remove the file from their list. Contrary to the exchangeable daemon modules, this is an inherent component of the application.
-
The Configuration Saver saves the state of the application at the end of the tools execution to configuration files, and loads the state at the next start of the tool. The Profiles will be saved with all the added files and their modules with their configurations. Furthermore, the current extraction mode and general usage options are saved.
-
The Storage Controller allows generic access to the exchangeable Storage. It provides methods for saving and loading extracted information to and from the Storage.
-
Storage: save and load metadata using a modular storage support. Currently implemented three storage interfaces: defaults to a simple flat filesystem storage with Json mapping, one using elasticsearch and a third using mapdb.
-
PET works together with an information encapsulation tool, also developed during the PERICLES project, to be able to encapsulate the extracted information together with its related files in a sheer curation scenario.
-
The weighted graphs described in chapter 6.3 could be implemented for suggesting information to be extracted based on the use cases.
The user executes the tool for the first time, without defining start commands. A screenshot of the PET tool with its modules can be seen in the figure below. Consequently, the Extraction Controller Builder builds a default Extraction Controller, which initializes the other controllers and the user interfaces CLI and GUI. The user views a default Profile at the GUI, but creates an empty new Profile for own purposes.

He/she adds files with the GUI to that Profile, which are parts of important Digital Objects. Internally, the following process is executed: The paths to the files that should be added are passed to the Profile Controller, where they are validated and added to a Part data structure. These Parts are added to the selected Profile and the interfaces are then updated.
During the next step, the user adds Extraction Modules that fit the use case, using the corresponding GUI dialog. Internally, the following steps take place: for displaying the GUI dialog the GUI requests the list of available Extraction Modules. This is provided by the Module Controller, that looked at tool start for all Extraction Module classes and created a set of Generic Modules. After the user selects which modules to create, Extraction Module instances are created from the the Generic Modules and added to the selected Profile.
Most of the Extraction Modules need a configuration, before they can be executed. The user browses the configurations of the added modules and adjusts them to fit for the scenario. The configuration is saved as Module Configuration class.
Now all configurations are ready for the first extraction. The user decides to run a single snapshot extraction, to get an overview of the environment information. Therefore the Extractor executes all Extraction Modules of the Profile. The file-independent modules return more general information, which is stored in the Environment class of the Profile. Each file-dependent module is executed once for each Part of the Profile, because it returns different pieces of information, depending on the file that is represented by the Part. The extraction result is saved into the related Part class, and, after the extraction run, the Storage Controller is used for serializing and saving them to the currently configured Storage. Daemon modules are ignored at the snapshot extraction. After the extraction run, the GUI is updated and the user can browse the results.
The user decides to start the continuous extraction mode for capturing information during a working session, during which the files that are added to the tool will be altered. He/she closes the GUI and starts working. The tool components act as follows: first the Extractor starts the daemon Extraction Modules of the Profile, which begin monitoring the computer system environment. Furthermore, the File Monitor is started and observes the Profiles files for alterations or deletions. The monitoring components create Events in case of a detected event that they want to report and pass them to the Event Controller, where they are handled and trigger update extractions. After the working session, the user closes the continuous extraction and browses the results.
The user regards his/her created Profile as useful and wants a colleague to use the same. Therefore, he/she exports the Profile as Profile Template and sends the generated template files to his/her colleague, who can then import them to create the same Profile on his PET installation. Internally, all Extraction Modules of the Profile and their Module Configurations are serialized by the Configuration Saver (and with the aid of the Jackson API) to JSON objects and saved to files into a template directory. The Profile Controller at the other PET installation uses the Configuration Saver to deserialize the Extraction Modules and recreate the Profile. The Profiles files are not saved to Profile Templates, as they are probably not present on other environments, so another user has to add his/her own files before using the Profile.
Finally the user shuts PET down, but expects to get the same configuration back at the next tool start. At shut down, the Extraction Controller initiates the shutdown of each tool component and saves all configurations. The Profile Controller starts saving each Profile; for this the Configuration Saver is used to save all Extraction Modules and all Parts into the Profiles output directory. Furthermore, the Extraction Controller uses the Configuration Saver to save more general tool options, as if the GUI is running, and if the continuous extraction mode is enabled. These will all be loaded at the next start.
This project has received funding from the European Union’s Seventh Framework Programme for research, technological development and demonstration under grant agreement no FP7- 601138 PERICLES.
<img src="https://github.com/pericles-project/pet/blob/master/wiki-images/PERICLES%20logo_black.jpg" width="200"/ align="right">