dv-data-retention-reviewer

This repository contains code for a scripted process that reviews published and unpublished datasets in a Dataverse instance and produces reports which identify datasets to be considered for deaccessioning.

It has been developed to specifically support data retention decision making in the Texas Data Repository (https://dataverse.tdl.org/) but designed to be adaptable for other Dataverse installations.

Instructions for using the dv-data-retention-reviewer

This scripted process has been designed to be run locally using Python 3. If you do not already have Python 3 on your machine you can download it at https://www.python.org/downloads/. To download the code to your machine you can either clone the repository if you are comfortable using git or download a ZIP containing the files in the repository. The recommended approach for using this Python script is to utilize a code editor like VSCode for modifying the .env config file and running the script through your installed Python3 interpretor using the VSCode Python extension. Instructions for running a Python script in VSCode can be found at https://code.visualstudio.com/docs/python/run. In order to run the code successfully you should have admin privileges to all of the repositories in the dataverse that you want to run the script against and you will need a valid dataverse API key associated with your account in the Data instance. Instructions for obtaining and using a dataverse API key can be found at https://guides.dataverse.org/en/latest/api/getting-started.html.

Configuring the .env file

Once this repo is cloned locally, the .env.template file should be renamed to just .env and the contents of the file should be edited to replace the example values that are provided in the file by default with the correct values based on the institution for which the script will be run. Take care to preserve the JSON formatting of the .env file to ensure proper functioning of the Python scripts in the repository which depend on the parameters defined in the .env file.

Explanation of configurable parameters

dataverse_api_key: "", The personalized dataverse instance token generated by the user of the script should be supplied
dataverse_api_host: "", The base URL of the dataverse instance, e.g. "https://dataverse.tdl.org"
showdatasetdetails: "True" or "False", default = "True", determines if dataset metadata should be recorded in log files
showdataretentionscoredetails: "True" or "False", default = "True", determines if data retention score details should be included in generated reports
institutionaldataverse": Name of individual dataverse instance within multi-instutional Dataverse installation, e.g. "utexas" could be used to identify UT Austin datasets within the Texas Data Repository
unpublisheddatasetreviewthresholdinyears":1, the threshold for determining how long a dataset can remain unpublished in a Dataverse instance before being identified as needing review for potential deaccessioning - all unpublished datasets less than this many years old will be listed as not needing review
publisheddatasetreviewthresholdinyears":2, the threshold for determining how long a dataset can remain published in a Dataverse instance before being identified as needing review for potential deaccessioning - all published datasets less than this many years old will be listed as not needing review
unpublisheddatasetreviewthresholdingb":1, the threshold for determining how large an unpublished dataset can be in a Dataverse instance before being identified as needing review for potential deaccessioning - all unpublished datasets less than this many GB will be listed as not needing review even if they exceed the age threshold for unpublished datasets defined above
publisheddatasetreviewthresholdingb":2,the threshold for determining how large a published dataset can be in a Dataverse instance before being identified as needing review for potential deaccessioning - all published datasets less than this many GB will be listed as not needing review even if they exceed the age threshold for published datasets defined above
mitigatingfactormincitationcount":1, the threshold for the minimum number of citations for a dataset which will be used to determine if a dataset exceeding the age threshold and size threshold should still be retained and not considered for deaccessioning
mitigatingfactormindownloadcount":1, the threshold for the minimum number of downloads for a dataset which will be used to determine if a dataset exceeding the age threshold and size threshold should still be retained and not considered for deaccessioning
mitigatingfactorfundedresearch: "True" or "False", default = "True", this binary value determines if a datasets which have associated grant funding information in their metadata should still be retained and not considered for deaccessioning even if they exceed the age and size threshold set above
processunpublisheddatasets: "True" or "False", default = "True", this binary value determines if unpublished datasets should be processed by the script - this can be set to false if unpublished datasets will not be considered for removal
processpublisheddatasets: "True" or "False", default = "True", this binary value determines if published datasets should be processed by the script - this can be set to false if published datasets will not be considered for removal
processdeaccessioneddatasets: "True" or "False", default = "True", this binary value determines if deaccessioned datasets should be processed by the script - this can be set to false if you do not want the script to spend time gathering information about deaccessioned datasets

Contact

For any questions about this repository, please contact the the UT Austin Research Data Services team that has lead initial development of this tool by sending an email to utl-rds@austin.utexas.edu.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.env.template		.env.template
.gitignore		.gitignore
README.md		README.md
dv-data-retention-review-processor.py		dv-data-retention-review-processor.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

dv-data-retention-reviewer

Instructions for using the dv-data-retention-reviewer

Configuring the .env file

Explanation of configurable parameters

Contact

About

Uh oh!

Releases

Packages

Languages

TexasDigitalLibrary/dv-data-retention-reviewer

Folders and files

Latest commit

History

Repository files navigation

dv-data-retention-reviewer

Instructions for using the dv-data-retention-reviewer

Configuring the .env file

Explanation of configurable parameters

Contact

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages