Collectionspace Migration Tools

🔥

Neither CollectionSpace (the organization) nor LYRASIS offers support for collectionspace_migration_tools.

This application is created and maintained for internal LYRASIS staff use only. Many of its design decisions are based on:

our CS hosting architecture
the assumption that it will be used by data migration experts on migration projects where the target CS instance is not in active use

This means:

it is highly unlikely anyone outside LYRASIS will be able to clone this repository and use the tool as-is
using this tool on a CS instance in active use is dangerous to the data integrity of that instance

However, we have made this code available in the spirit of open-source and transparency, in case any of it might be informative for CS institutions/users who wish to build their own tooling for working with CS data at scale.

Table of Contents

Requirements
One-time setup
Per-instance setup
Test your setup
Usage
Development
Tests
License
Security
Code of Conduct
Contributions
Versions
Credits

See doc/decisions.adoc for more info/background on some of the decisions made

Requirements

Ruby 3.1.0.
Docker and docker-compose.
- Running the required docker-compose command will by default set up two instances of Redis on ports 6380 and 6381. The port numbers used are configurable as described below
The user of this tool will have set up the Github and AWS CLI as detailed in our team technical documentation.
For each CS instance this tool will be used to manipulate:
- The user of this tool must be able to connect directly to the CollectionSpace database and perform read-only/select queries. Should be handled by correct assignment of user to the AWS MigrationsRole by Hosting team.
- If you plan to do batch ingests, there must be an AWS S3 bucket (Fast Import Bucket) set up for ingesting into the instance, and the user of this tool must have permission to add and delete objects to that bucket, as well as list objects in the bucket. Should be handled by correct assignment of user to the AWS MigrationsRole by Hosting team.

One-time setup

Add `bin` to PATH (optional)

To avoid having to prepend thor, rspec and other commands with bundle exec, add this repo’s ./bin to your PATH.

If you do not do this step, and running a command you see in the documentation fails, try prepending `bundle exec ` to the command.

Redis config

This should "just work" without you having to do anything, but you might want to change it if the ports being used for Redis conflict with something you use for other work.

If you want to change the Redis ports, you need to update them in two places:

the ./docker-compose.yml file (which builds the Redis instances and makes them locally accessible via the given ports)
the ./redis.yml file (which tells the application which port/Redis instance to use for storing RefNames vs. CSIDs)

Nothing in ./redis.yml is sensitive, as it’s all just on your local machine.

System config

Use the commented sample_system_config.yml as an example or starting point.

Location of file

You have three options for where to put your system config file. These will be checked in the following order. The first one found will be used:

Custom filename and location indicated in COLLECTIONSPACE_MIGRATION_TOOLS_SYSTEM_CONFIG environment variable. You can set this environment variable per-session or permanently in your shell/terminal config.
~/.config/collectionspace_migration_tools/system_config.yml
system_config.yml file in the base directory of this repository

File contents/settings

csv_chunk_size: The use/purpose of reading CSVs in chunks is explained in Faster Parsing CSV With Parallel Processing. Each chunk is sent to a parallel worker for processing. A chunk with more rows will take longer to process, but I have not investigated the tradeoff between queueing up/passing on more chunks vs. larger chunks.
max_threads: Maximum number of threads that will be spun up for a given parallel process run in threads.
max_processes: Maximum number of processes that will be spun up for a given parallel process run in processes.
aws_profile: The name of the second AWS profile you created (the one with role_arn and source_profile specified.

ℹ️

The default settings seem to be working ok for not-gigantic migration projects on my DTS-issued Macbook Pro, but I have not yet done much testing to figure out optimal settings for these. I assume if things are running super slowly, try upping max_threads/max_processes. If your system is too strained, lower max_threads/max_processes. I confess I’m not entirely sure if it thread vs process makes a difference in terms of system resource usage, but it seemed like a good idea to separate them in case this mattered.

💡	You can find what uses threads vs. processes by searching this codebase for `CMT.config.system.max_threads` and `CMT.config.system.max_processes`.

Per-instance setup

Use the commented sample_client_config.yml as an example or starting point.

See client config management documentation for more details.

You will need to have:

The tenant name for the instance. These can be found via CHIA. Do bin/console, and then CHIA.tenant_names to get a list of current tenant names.
Record mappers for the instance, downloaded from the cspace-config-untangler repo. Instances without their own UI plugin use community-supported profile mappers for the latest release of CollectionSpace. Mappers for hosted clients with their own UI plugins are in Lyrasis-hosted profiles.
The domain profile and profile version used by the site. These should match the file name prefixes of the mappers for the instance.
If planning to ingest into the instance, the name of the S3 Fast Import bucket for the CS instance (currently we need to request that Mark set this bucket up)
If planning to map data into CS XML, an unused Redis db number. Do thor config:redis_dbs to see which Redis dbs are already in use.

Test your setup

Once you have done the one-time config and set up at least one instance, you can verify that your AWS access works by doing the following in this repo’s base directory:

thor config switch #{instance config filename without .yml on the end}

bin/console
CMT::Build::S3Client.call

If you get Success(#<Aws::S3::Client>), good. If you get a Failure, something is not right.

Usage

Ensure desired config is in place (See One-time setup and Per-instance setup sections above)

cd into repository root

docker-compose up -d (Starts Redis instances. The -d puts docker-compose into the background, so you can use the terminal for other things)

thor list (to see available commands)

Run available commands as necessary.

❗	Most of the commands for routine workflow usage are under `thor batch` and `thor batches`. See workflow overview documentation for details.

docker-compose down (Stops and closes Redis containers. The Redis volumes are NOT removed, so your cached data should still be available next time you run docker-compose up -d.)

Development

You can also use the IRB console for direct access to all objects:

bin/console

💡	If you make changes to code while you are in the console, running `CMT.reload!` will reload the application without you needing to exit and restart console. This doesn’t always work to pick up all changes, but saves a lot of time anyway.

Tests

To test, run:

rspec

At least initially, a lot of the functionality around database connections, querying, and anything that relies on a database call is not covered in automated tests. This is mainly because I did not have time to figure out how to test that stuff in a meaningful way without exposing data that needs to be kept private.

Credits

Built by Kristina Spurgin with design/infrastructure input from Mark Cooper
Project scaffold built with Rubysmith.

Name		Name	Last commit message	Last commit date
Latest commit History 648 Commits
benchmark		benchmark
bin		bin
config		config
data		data
doc		doc
lib		lib
spec		spec
utils		utils
.git-blame-ignore-revs		.git-blame-ignore-revs
.gitignore		.gitignore
.reek.yml		.reek.yml
.rubocop.yml		.rubocop.yml
.ruby-version		.ruby-version
CHANGELOG.adoc		CHANGELOG.adoc
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
Guardfile		Guardfile
LICENSE.adoc		LICENSE.adoc
README.adoc		README.adoc
Rakefile		Rakefile
Thorfile		Thorfile
client_batch_config.json		client_batch_config.json
current_config		current_config
docker-compose.yml		docker-compose.yml
redis.yml		redis.yml
sample_client_config.yml		sample_client_config.yml
sample_system_config.yml		sample_system_config.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Collectionspace Migration Tools

Requirements

One-time setup

Add `bin` to PATH (optional)

Redis config

System config

Location of file

File contents/settings

Per-instance setup

Test your setup

Usage

Development

Tests

License

Security

Code of Conduct

Contributions

Versions

Credits

About

Uh oh!

Releases 20

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

lyrasis/collectionspace_migration_tools

Folders and files

Latest commit

History

Repository files navigation

Collectionspace Migration Tools

Requirements

One-time setup

Add bin to PATH (optional)

Redis config

System config

Location of file

File contents/settings

Per-instance setup

Test your setup

Usage

Development

Tests

Credits

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Uh oh!

Languages

Add `bin` to PATH (optional)