traject
is a ruby-based data transformation framework, designed
especially with an eye toward indexing MARC (library bibliographic) data into Solr.
This is a sample Traject project based on production code running the
indexing process for the HatiTrust
catalog. It covers much of the standard
functionality, but you'll still definitely want to read the traject
documentation for fuller explanations of what's going on and a
complete description of the out-of-the-box macros available.
You should just be able to clone this repository and then run
bundle install
That should install all the things you need, including traject
itself. Then take a peek around.
- a simple, stand-alone configuration file that just pulls out id, title, and author from a set of MARC-XML records is at simplest_possible_traject_config.rb
- A much more extensive indexing file (extracted from the code used to index the HathiTrust catalog) is at index.rb
- sample configuration files that specify readers and writers are in the appropriate subdirectories.
Note that the file writer/solr.rb is just an example; you'll need to customize it to actually talk to your solr installation.
And of course your actual indexing code will need to be producing the fields
expected by your own solr's schema.conf
.
The file simplest_possible_traject_config.rb can be run on the included, 20-record MARC-XML file very simply as:
traject -c simplest_possible_traject_config.rb sample_data/20.xml
That will read in the file, pull out the id/title/author, and dump the results using DebugWriter
to a file named debug_output.txt
We can do the exact same run, but show off using multiple configuration files, with:
traject -c reader/marc-xml.rb -c simplest_possible_traject_config.rb -c writer/debug.rb sample_data/20.xml
Again, the data will be in debug_output.txt
, as configured in the writer/debug.rb
file.
This time we'll use the more complete sample index file in index.rb
traject -c reader/marc-xml.rb -c writer/debug.rb -c index.rb sample_data/20.xml
Look through the index.rb
file and the debug_output.txt
files to see how the translation works.
traject
takes a variety of options, many of which can be seen by simply running traject --help
. Two of the most important are:
-c configfile
: Load a configuration file, such as those found underreader/
andwriter/
in this repository-s setting.name=value
: Provide a command-line equivalent to theprovide
command used to control settings in a configuration file.
Here are a few examples:
# Run a MARC binary file and get debug output to stdout
traject --debug-mode -c index.rb path/to/file.mrc
# Ditto, but put the debug information in a file
traject --debug-mode -c index.rb -s output_file=debug.out /path/to/file.mrc
# ...and look at debug.out to see what happened
# Use a configuration file to get better control over the debug information
traject -c index.rb -c writer/debug.rb /path/to/file.mrc
# Ditto, but read in a MARC-XML file
traject -c index.rb -c reader/marc-xml.rb -c writer/debug.rb /path/to/file.xml
# Dump results to a json file for later processing / examination
traject -c index.rb -c reader/marc-xml.rb -c writer/json.rb /path/to/file.xml
# Send things to solr, but turn logging to debug level
traject -c index.rb -c reader/marc-xml.rb -c writer/solr.rb -s log.level=debug /path/to/file.xml