Skip to content

This repository contains code for a scripted process that reviews published and unpublished datasets in a Dataverse instance and produces reports which identify datasets to be considered for deaccessioning.

Notifications You must be signed in to change notification settings

TexasDigitalLibrary/dv-data-retention-reviewer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

dv-data-retention-reviewer

This repository contains code for a scripted process that reviews published and unpublished datasets in a Dataverse instance and produces reports which identify datasets to be considered for deaccessioning.

It has been developed to specifically support data retention decision making in the Texas Data Repository (https://dataverse.tdl.org/) but designed to be adaptable for other Dataverse installations.

Instructions for using the dv-data-retention-reviewer

This scripted process has been designed to be run locally using Python 3. If you do not already have Python 3 on your machine you can download it at https://www.python.org/downloads/. To download the code to your machine you can either clone the repository if you are comfortable using git or download a ZIP containing the files in the repository. The recommended approach for using this Python script is to utilize a code editor like VSCode for modifying the .env config file and running the script through your installed Python3 interpretor using the VSCode Python extension. Instructions for running a Python script in VSCode can be found at https://code.visualstudio.com/docs/python/run. In order to run the code successfully you should have admin privileges to all of the repositories in the dataverse that you want to run the script against and you will need a valid dataverse API key associated with your account in the Data instance. Instructions for obtaining and using a dataverse API key can be found at https://guides.dataverse.org/en/latest/api/getting-started.html.

Configuring the .env file

Once this repo is cloned locally, the .env.template file should be renamed to just .env and the contents of the file should be edited to replace the example values that are provided in the file by default with the correct values based on the institution for which the script will be run. Take care to preserve the JSON formatting of the .env file to ensure proper functioning of the Python scripts in the repository which depend on the parameters defined in the .env file.

Explanation of configurable parameters

  • dataverse_api_key: "", The personalized dataverse instance token generated by the user of the script should be supplied
  • dataverse_api_host: "", The base URL of the dataverse instance, e.g. "https://dataverse.tdl.org"
  • showdatasetdetails: "True" or "False", default = "True", determines if dataset metadata should be recorded in log files
  • showdataretentionscoredetails: "True" or "False", default = "True", determines if data retention score details should be included in generated reports
  • institutionaldataverse": Name of individual dataverse instance within multi-instutional Dataverse installation, e.g. "utexas" could be used to identify UT Austin datasets within the Texas Data Repository
  • unpublisheddatasetreviewthresholdinyears":1, the threshold for determining how long a dataset can remain unpublished in a Dataverse instance before being identified as needing review for potential deaccessioning - all unpublished datasets less than this many years old will be listed as not needing review
  • publisheddatasetreviewthresholdinyears":2, the threshold for determining how long a dataset can remain published in a Dataverse instance before being identified as needing review for potential deaccessioning - all published datasets less than this many years old will be listed as not needing review
  • unpublisheddatasetreviewthresholdingb":1, the threshold for determining how large an unpublished dataset can be in a Dataverse instance before being identified as needing review for potential deaccessioning - all unpublished datasets less than this many GB will be listed as not needing review even if they exceed the age threshold for unpublished datasets defined above
  • publisheddatasetreviewthresholdingb":2,the threshold for determining how large a published dataset can be in a Dataverse instance before being identified as needing review for potential deaccessioning - all published datasets less than this many GB will be listed as not needing review even if they exceed the age threshold for published datasets defined above
  • mitigatingfactormincitationcount":1, the threshold for the minimum number of citations for a dataset which will be used to determine if a dataset exceeding the age threshold and size threshold should still be retained and not considered for deaccessioning
  • mitigatingfactormindownloadcount":1, the threshold for the minimum number of downloads for a dataset which will be used to determine if a dataset exceeding the age threshold and size threshold should still be retained and not considered for deaccessioning
  • mitigatingfactorfundedresearch: "True" or "False", default = "True", this binary value determines if a datasets which have associated grant funding information in their metadata should still be retained and not considered for deaccessioning even if they exceed the age and size threshold set above
  • processunpublisheddatasets: "True" or "False", default = "True", this binary value determines if unpublished datasets should be processed by the script - this can be set to false if unpublished datasets will not be considered for removal
  • processpublisheddatasets: "True" or "False", default = "True", this binary value determines if published datasets should be processed by the script - this can be set to false if published datasets will not be considered for removal
  • processdeaccessioneddatasets: "True" or "False", default = "True", this binary value determines if deaccessioned datasets should be processed by the script - this can be set to false if you do not want the script to spend time gathering information about deaccessioned datasets

Contact

For any questions about this repository, please contact the the UT Austin Research Data Services team that has lead initial development of this tool by sending an email to utl-rds@austin.utexas.edu.

About

This repository contains code for a scripted process that reviews published and unpublished datasets in a Dataverse instance and produces reports which identify datasets to be considered for deaccessioning.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages