Skip to content
brendan-busch edited this page Feb 23, 2023 · 5 revisions

Data Storage

All raw and processed data is stored on Pawsey Supercomputing Centre's Acacia S3 storage, within the "csiem" bucket. Raw data is stored in the data-lake directory, and all processed data is stored in the data-warehouse directory.

Data Lake

A Data Lake is simply a centralised store of raw, disparate datasets, be it structured or unstructured. Centralising and cataloguing this raw data allows for customised ETL processes to be constructed that can be tailored to each analytics usecase, as opposed to forcing the data into a one size fits all structure.

Data Warehouse

The Data Warehouse is the store of boutique, customised data products produced through the batch ETL (Extract, Transform and Load) pipelines. Each folder within the data-warehouse directory on GitHub contains processed data in different formats, based upon end user requirements.

All data usage outside of the ETL pipelines should and must be carried out from products within the warehouse. This will ensure both the efficient and repeatability of any script or data product produced downstream, as well as providing a constant data pathway for data validation.

There are currently three separate directories within the data-warehouse:

  • csv
  • mat
  • marvl

CSV

The csv directory contains data that has been directly imported from the data lake, standardised based on the variable catalogue. Data is separated into different site, variable and in some cases sampling campaigns.

MAT

The mat directory contains different Matlab mat file formatted files.

  • seaf.mat: All data variables in the units found in the variable catalogue
  • cockburn.mat: Subset of variable required for the TFV-AED model, converted into the units in the TFV catalogue.

MARVL

Default txt format for the SEAF platform

Data Access

An Access Key ID and Secret are requried to be able to access the data. They can be obtained by contacting Brendan Busch (brendan.busch@uwa.edu.au).

Data can be accessed a variety of ways, as the Pawsey storage is based upon Amazons S3 storage protocol. Below is a how-to for WinSCP.

WinSCP

Open WINSCP.

A login window should pop up or select Session -> Start New Session

Select new site -> File Protocol -> Amazon S3

In the Host Name enter: projects.pawsey.org.au

Port Number 443 (should be default)

Enter the Access Key ID and Secret access key below (see Brendan Busch for access).

In the advanced menu ->Environment -> S3

Change URL style to ‘Path’

To create a bucket, follow the instruction noting naming conventions:

https://support.pawsey.org.au/documentation/display/US/Acacia+-+Introduction

CSIEM Data Wiki

Overview

Governance

Vocabularies

Storage & Access

Mapping

Data Overview

MARVL

Clone this wiki locally