-
Notifications
You must be signed in to change notification settings - Fork 35
Introduction
Catmandu is a data processing toolkit developed as part of the [LibreCat] project. Catmandu provides a command line client and a suite of Perl modules to ease the import, storage, retrieval, export and transformation of data.
With Catmandu we want to make it easier to extract, transform data such as: JSON, YAML, CSV, MARC, MAB, OAI-PMH, Z39.50, SRU, RDF and much more. Using small transformation language called Fix language, we want to ease the communication between domain specialists and programmers how data should be transformed.
Combine Catmandu modules with web application frameworks such as PSGI/Plack document stores such as MongoDB and full text indexes as Solr to create a rapid development environment for digital library services such as institutional repositories and search engines.
In other words, Catmandu facilitates processing of open data, big data, data science, data journalism, and other buzzwords!
Catmandu is used in the LibreCat project for several applications.
We have more than 60 Catmandu projects available at GitHub LibreCat.
Create a search engine, one of your first tasks will to import data from various sources, map the fields to a common data model and post it to a full-text search engine. Perl modules such as WebService::Solr or ElasticSearch provide easy access to your favorite document stores, but you keep writing a lot of boilerplate code to create the connections, massaging the incoming data into the correct format, validating and uploading and indexing the data in the database. Next morning you are asked to provide a fast dump of records into an Excel worksheet. After some fixes are applied you are asked to upload it into your database. Again you hit Emacs or Vi and provide an ad-hoc script. In our LibreCat group we saw this workflow over and over. We tried to abstract this problem to a set of Perl tools which can work with library data such as MARC, Dublin Core, EndNote protocols such as OAI-PMH, SRU and repositories such as DSpace and Fedora. In data warehouses these processes are called ETL, Extract, Transform, Load. Many tools currenty exist for ETL processing but none adress typical library data models and services.
As programmers, we would like to reuse our code and algorithms as easy as possible. In fast application development you typically want to copy and paste parts of existing code in a new project. In Catmandu we use a functional style of programming to keep our code tight and clean and suitable for copy and pasting. When working with library data models we use native Perl hashes and arrays to pass data around. In this way adhere to the rationale of Alan J. Perlis: "It is better to have 100 functions operate on one data structure than to have 10 functions operate on 10 data structures." Our functions are all based on a few primary data structures on which we define many functions (map, count, each, first, take, ...)
Working with native Perl hashes and arrays we would like to use an easy mechanism to store and index this data in a database of choice. In the past it was a nuisance to create database schemas and indexes to store and search your data. Certainly in institutional repositories this can be a ongoing job for a programmer because the metadata schemas are not fixed in time. Any new report will require you to add new data fields and new relations for which you need to change your database schema. With the introduction of schemaless databases the storage of complex records is really easy. Create a Perl hash excute the function 'add' and your record is stored into the database. Execute 'get' to load a Perl hash from the database in memory. With our ElasticSearch plugin we even can provide you a CQL style query language for retrieval.
my $obj = { name => { last => 'Bond' , full => 'James Bond' } , occupation => 'Secret Agent' };
$store->bag->add($obj);
$store->bag->search(cql_query => 'name.last = Bond')->each(sub {
my $obj = shift;
printf "%s\n" , $obj->{name}->{full};
});
See Installation for installation and Concepts for basic terms.
Most of the Catmandu processing doesn't require you to write any Perl code. With command line tools you can store data files into databases, index your data, export data in various formats and provide basis data cleanup operations.
Say, you have a YAML file test.yml
like:
---
first: Charly
last: Parker
job: Artist
---
first: Albert
last: Einstein
job: Physicist
---
first: Bruce
last: Wayne
job: Superhero
...
and you are required to transform it into JSON. Using the 'catmandu
' command you can do this with these options:
$ catmandu convert YAML to JSON < test.yml
Basically you connect a YAML importer to a JSON exporter.
Need some fancy export? Then use the Template exporter which uses a template file like 'test.xml.tt' below to render the output.
<foo>
<first>[% first %]</first>
<last>[% last %]</last>
<job>[% job %]</job>
</foo>
To run the catmandu
command you need to provide 'Template' as the exporter to write into and a full path to the template file (without the .tt
extension). Note that optional arguments for Importers and Exporters can be provided with the --from-[NAME]
, --into-[NAME]
syntax:
$ catmandu convert YAML to Template --template `pwd`/test.xml < test.yml
Which produces the output:
<foo>
<first>Charly</first>
<last>Parker</last>
<job>Artist</job>
</foo>
<foo>
<first>Albert</first>
<last>Einstein</last>
<job>Physicist</job>
</foo>
<foo>
<first>Bruce</first>
<last>Wayne</last>
<job>Superhero</job>
</foo>
Using this command line tools indexing data becomes also very easy. Boot up the ElasticSearch and run the command below to index the test.yml
file:
$ catmandu import YAML to ElasticSearch --index_name demo < test.yml
To show the results from your hard work we can export all the records from the ElasticSearch store:
$ catmandu export ElasticSearch --index_name demo to JSON
{"first":"Albert","_id":"3A07B0F8-0973-11E2-98F8-F84380C42756","last":"Einstein","job":"Physicist"}
{"first":"Charly","_id":"3A0792D0-0973-11E2-8724-A22A2812F5B2","last":"Parker","job":"Artist"}
{"first":"Bruce","_id":"3A07B5EE-0973-11E2-97BF-E053E6A92BE5","last":"Wayne","job":"Superhero"}
We can even be more lazy by creating a catmandu.yml file containing the connection parameters to the ElasticSearch:
---
store:
Demo:
package: ElasticSearch
options:
index_name: demo
Using the configuration file above indexation of YAML data can be done like this:
$ catmandu import YAML to Demo < ~/Desktop/test.yaml
And exporting all data can be done like this:
$ catmandu export Demo
For Catmandu stores that support a query language, exporting data can be very powerfull using the '--query' option. E.g. we can export all records about 'Einstein' from our ElasticSearch store using:
$ catmandu export Demo --query "Einstein"
If you like to learn more about these commands, the proceed to the chapter about the command line client. If you are interested in writing web applications, then please proceed to the chapter about Dancer & Catmandu.