Skip to content
Patrick Hochstenbach edited this page Apr 14, 2016 · 52 revisions

Catmandu is a data processing toolkit developed as part of the LibreCat project. Catmandu provides a command line client and a suite of modules to ease the import, storage, retrieval, export and transformation of data.

With Catmandu we want to make it easier to extract, transform data such as: JSON, YAML, CSV, MARC, MAB, OAI-PMH, Z39.50, SRU, RDF and much more. Using small transformation language called Fix language, we want to ease the communication between domain specialists and programmers how data should be transformed.

Combine Catmandu modules with web application frameworks such as PSGI/Plack document stores such as MongoDB and full text indexes as ElasticSearch to create a rapid development environment for digital library services such as institutional repositories and search engines.

In other words, Catmandu facilitates processing of open data, big data, data science, data journalism, and other buzzwords!

Overview

Where do we use it?

Catmandu is used in the LibreCat project for several applications.

We have more than 60 Catmandu projects available at GitHub LibreCat.

Why do we use it?

Extract, Transform and Load

Create a search engine, one of your first tasks will to import data from various sources, map the fields to a common data model and post it to a full-text search engine. Modules such as WebService::Solr or ElasticSearch provide easy access to your favorite document stores, but you keep writing a lot of boilerplate code to create the connections, massaging the incoming data into the correct format, validating and uploading and indexing the data in the database. Next morning you are asked to provide a fast dump of records into an Excel worksheet. After some fixes are applied you are asked to upload it into your database. Again you hit Emacs or Vi and provide an ad-hoc script. In our LibreCat group we saw this workflow over and over. We tried to abstract this problem to a set of tools which can work with library data such as MARC, Dublin Core, EndNote protocols such as OAI-PMH, SRU and repositories such as DSpace and Fedora. In data warehouses these processes are called ETL, Extract, Transform, Load. Many tools currenty exist for ETL processing but none adress typical library data models and services.

Copy and Paste

As programmers, we would like to reuse our code and algorithms as easy as possible. In fast application development you typically want to copy and paste parts of existing code in a new project. In Catmandu we use a functional style of programming to keep our code tight and clean and suitable for copy and pasting. When working with library data models we use native Perl hashes and arrays to pass data around. In this way adhere to the rationale of Alan J. Perlis: "It is better to have 100 functions operate on one data structure than to have 10 functions operate on 10 data structures." Our functions are all based on a few primary data structures on which we define many functions (map, count, each, first, take, ...)

Schemaless databases

In the past it was a nuisance to create database schemas and indexes to store and search your data. Certainly in institutional repositories this can be a ongoing job for a programmer because the metadata schemas are not fixed in time. Any new report will require you to add new data fields and new relations for which you need to change your database schema. With the introduction of schemaless databases the storage of complex records is really easy. With our ElasticSearch Store we even can provide you a CQL style query language for retrieval.

Before you start

See Installation for installation and Concepts for basic terms.

Command line

Most of the Catmandu processing doesn't require you to write any code. With our command line tools you can store data files into databases, index your data, export data in various formats and provide basic data cleanup operations.

convert

For example, if you have a YAML file test.yml like:

---
first: Charly
last: Parker
job: Artist
---
first: Albert
last: Einstein
job: Physicist
---
first: Bruce
last: Wayne
job: Superhero
...

and you are required to transform it into JSON. Using the catmandu command you can do this with these options:

$ catmandu convert YAML to JSON < test.yml

Basically you connect a YAML importer to a JSON exporter (see Concepts if you don't know what importers and exporters are).

Need some fancy export? Then use the Template exporter which uses a template file like 'test.xml.tt' below to render the output.

<foo>
 <first>[% first %]</first>
 <last>[% last %]</last>
 <job>[% job %]</job>
</foo>

To run the catmandu command you need to provide 'Template' as the exporter to write into and a full path to the template file (without the .tt extension):

$ catmandu convert YAML to Template --template `pwd`/test.xml < test.yml

Which produces the output:

<foo>
 <first>Charly</first>
 <last>Parker</last>
 <job>Artist</job>
</foo>
<foo>
 <first>Albert</first>
 <last>Einstein</last>
 <job>Physicist</job>
</foo>
<foo>
 <first>Bruce</first>
 <last>Wayne</last>
 <job>Superhero</job>
</foo>

import

Using this command line tools indexing data becomes also very easy. Boot up a ElasticSearch and run the command below to index the test.yml file:

$ catmandu import YAML to ElasticSearch --index_name demo < test.yml

To show the results from your hard work we can export all the records from the ElasticSearch store:

$ catmandu export ElasticSearch --index_name demo to JSON
{"first":"Albert","_id":"3A07B0F8-0973-11E2-98F8-F84380C42756","last":"Einstein","job":"Physicist"}
{"first":"Charly","_id":"3A0792D0-0973-11E2-8724-A22A2812F5B2","last":"Parker","job":"Artist"}
{"first":"Bruce","_id":"3A07B5EE-0973-11E2-97BF-E053E6A92BE5","last":"Wayne","job":"Superhero"}

We can even be more lazy by creating a catmandu.yml file containing the connection parameters to the ElasticSearch:

---
store:
 Demo:
  package: ElasticSearch
  options:
    index_name: demo

Using the configuration file above indexation of YAML data can be done like this:

$ catmandu import YAML to Demo < ~/Desktop/test.yaml

export

And exporting all data can be done like this:

$ catmandu export Demo

For Catmandu stores that support a query language, exporting data can be very powerfull using the '--query' option. E.g. we can export all records about 'Einstein' from our ElasticSearch store using:

$ catmandu export Demo --query "Einstein"

cleanup

The previous examples were all pretty easy. In real life we, as librarians, have to deal with data formats such as: MARC, MAB, DublinCore, RDF and more. With Catmandu you can process all these formats with the same commands as shown above.

You were given a MARC file and were asked to extract all the titles from the file? With our command line tools that can be as easy as:

$ catmandu convert MARC --fix 'marc_map("245","title") remove_field(record")' < marc.mrc

This will produce a JSON output containing all the titles. The '--fix' option on the command line will execute these fixes:

marc_map("245","title") 
remove_field(record)

The Fix language is a tiny language we invented to manipulate data. Fixes should be easy to read or create even if you don't have any programming skills!

When you know how to extract data from a MARC file, it is very easy to import it into ElasticSearch:

$ catmandu import MARC --fix 'marc_map("245","title") remove_field(record")' to Demo < marc.mrc

Notice how we made use of the 'Demo' shortcut as explained above.


  • If you like to learn more about these commands, then proceed to the chapter about the command line client.
  • If you need to cleanup your data, then proceed to the chapter about the Fixes.
  • If you are interested in writing web applications, then please proceed to the chapter about Dancer & Catmandu.
  • Or, follow our advent calendar
Clone this wiki locally