Code For PDX Open Data Project FAQ

Code For PDX Open Data Project

Audience

This is an introduction to the CodeForPdx Open Data project for prospective volunteers

Overview

Currently, data that is gathered by Portland’s various bureaus (traffic sensor data, crime data, environmental data, etc) is stored in disparate sources and has varying formats. By creating metadata that has a consistent format, datasets can be stored and examined systematically. Much of this data will also be made available to the public for analysis and the improvement of the city.

We propose web application to serve as the entry point for new datasets and prepare them for publication to an open (public) site. The long term is to add tools supporting data quality, integrity, privacy, data descriptors to the data publication pipeline,

Another important goal is to connect data governance to users and stakeholders.

Code for PDX and volunteers will help maintain and upgrade the framework. Volunteers will also develop tools and features for data curation and assessment. Our current implementation uses Django for front and backend, Postgres, and includes an API, and runs in Docker containers.

Bureau data manage uploads a weekly dataset that consistently contains quality or integrity errors

The Data Pipeline

Ingestion

The website will ingest the data by uploading from files or urls. We currently support comma separated value (CSV) format but plan on offering JSON ingestion as well. Metadata will be captured during the upload process from user input.

Curation

At this stage we access the raw dataset and the metadata for integrity, quality, and Privacy and generate a report which will be used to remedy the problem if necessary.

The integrity tests assess whether the data is properly formatted (syntactically correct).

The quality tests whether the value conforms to the datatype for the column. Data values are also checked for proper format. Values should also make sense in context (data typing and semantics)

The privacy tests assure that the data is free from privacy restrictions These will probably be coded in python, numpy, and pandas.

Publishing

We do not yet know what our role in the publishing process at The City Of Portland will be. Our current implementation includes a dataset API for published datasets. It can also be used to view the data for those who have an account. To be truely open data we would have to allow anonymous viewers access to the data.

Use Cases

Bureau data manager uploads the dataset and fills in the metadata form. That data quality satisfies the assessment. A data package is constructed for the data and the dataset status is updated indicating it can be published.

Bureau data manager uploads the dataset, and fills in metadata form. The data fails the quality assessment. The data manager is contacted to help make corrections. When the data quality has improved the process

Bureau data manager uploads the dataset, and fills in metadata form. The data fails the quality or integrity assessment or both. From experience, we know how to transform that data and have written scripts that always run for this data source.

FAQ

Where can I find a copy of the PORTLAND OPEN DATA HANDBOOK?

https://www.smartcitypdx.com/open-data-program

What must be included with a dataset submitted for publication on the PDX open data portal?

Datasets are typically submitted as comma-separated value (CSV) files. Along with the CSV files, some metadata file(s) must be included as part of the dataset. More detail about the metadata can be found in the PORTLAND OPEN DATA HANDBOOK.

Where will the datasets be published?

Possible targets for the open data project include http://www.civicapps.org/datasets

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Code For PDX Open Data Project FAQ

Code For PDX Open Data Project

Audience

Overview

The Data Pipeline

Ingestion

Curation

Publishing

Use Cases

FAQ

Where can I find a copy of the PORTLAND OPEN DATA HANDBOOK?

What must be included with a dataset submitted for publication on the PDX open data portal?

Where will the datasets be published?

Uh oh!

Uh oh!

Clone this wiki locally