Farms to Freeways Corpus Tools

This repository documents how to build a language corpus from the Farms to Freeways history project data.

The data are [archived at Western Sydney University].

And are available in an Omeka Repository.

Peter Sefton exported the data into an RO-Crate, using this process.

These tools work on the resulting RO-Crate.

Install

Run:

npm install

Usage

Create a file named make_run.sh containing the following data:

#!/usr/bin/env bash

make BASE_DATA_DIR=<base path> \
 REPO_OUT_DIR=/opt/storage/oni/ocfl \
 REPO_SCRATCH_DIR=/opt/storage/oni/scratch-ocfl \
 BASE_TMP_DIR=./storage/temp \
 NAMESPACE=farms-to-freeways-example-dataset \
 DATA_DIR=<base path>/farms_to_freeways

Update the <base path> sections to the appropriate locations for your local installation.

Running this file using bash make_run.sh (or appropriate command) will generate an RO-Crate for the corpus.

Making CSV files from PDF transcripts

This work has all been done and is not automated but here are notes about how it was done.

The transcripts in the Omeka repository are in PDF format and speaker turns are only indicated using bold-face text.

There are some plain text versions available but they don't have speaker turns indicated.

To extract text from the PDF files in the repo first use open office:

On a mac, this command will create a bunch of SVG files in the working directory.

find farms-to-freeways/ -name "*.pdf" -exec /Applications/LibreOffice.app/Contents/MacOS/soffice --headless --convert-to svg {} \;

Move these into an svgfiles directory:

mv *.svg svgfiles/

Run svg2csv to create csv files in csvfiles/

node svg2csv.js

copy the CSV files to cloudstor

rsync csvfiles/*  ~/cloudstor/atap-repo-misc/farms_to_freeways_csv_files/ -ruvi

Convert the metadata file from a plain-old crate to being a corpus

Assuming there is a copy of the Farms to Freeways data as exported from Omeka in cloudstor.

Run the script.

make BASE_DATA_DIR=/farms-to-freeways/data REPO_OUT_DIR=/your/ocfl-repo BASE_TMP_DIR=/your/temp

How to run your own oni

See oni/README.md for instructions

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
csvs		csvs
oni		oni
output		output
template		template
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
index.js		index.js
make_ocfl_for_local_oni.sh		make_ocfl_for_local_oni.sh
package-lock.json		package-lock.json
package.json		package.json
svg2csv.js		svg2csv.js
test.js		test.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Farms to Freeways Corpus Tools

Install

Usage

Making CSV files from PDF transcripts

Convert the metadata file from a plain-old crate to being a corpus

How to run your own oni

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

License

Language-Research-Technology/corpus-tools-farms-to-freeways

Folders and files

Latest commit

History

Repository files navigation

Farms to Freeways Corpus Tools

Install

Usage

Making CSV files from PDF transcripts

Convert the metadata file from a plain-old crate to being a corpus

How to run your own oni

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages