-
Notifications
You must be signed in to change notification settings - Fork 4
Code For PDX Open Data Project FAQ
This is an introduction to the CodeForPdx Open Data project for prospective volunteers
Currently, data that is gathered by Portland’s various bureaus (traffic sensor data, crime data, environmental data, etc) is stored in disparate sources and has varying formats. By creating metadata that has a consistent format, datasets can be stored and examined systematically. Much of this data will also be made available to the public for analysis and the improvement of the city.
We propose web application to serve as the entry point for new datasets and prepare them for publication to an open (public) site. The long term is to add tools supporting data quality, integrity, privacy, data descriptors to the data publication pipeline,
Another important goal is to connect data governance to users and stakeholders.
Code for PDX and volunteers will help maintain and upgrade the framework. Volunteers will also develop tools and features for data curation and assessment. Our current implementation uses Django for front and backend, Postgres, and includes an API, and runs in Docker containers.
Bureau data manage uploads a weekly dataset that consistently contains quality or integrity errors
The website will ingest the data by uploading from files or urls. We currently support comma separated value (CSV) format but plan on offering JSON ingestion as well. Metadata will be captured during the upload process from user input.
At this stage we access the raw dataset and the metadata for integrity, quality, and Privacy and generate a report which will be used to remedy the problem if necessary.
The integrity tests assess whether the data is properly formatted (syntactically correct).
The quality tests whether the value conforms to the datatype for the column. Data values are also checked for proper format. Values should also make sense in context (data typing and semantics)
The privacy tests assure that the data is free from privacy restrictions These will probably be coded in python, numpy, and pandas.
We do not yet know what our role in the publishing process at The City Of Portland will be. Our current implementation includes a dataset API for published datasets. It can also be used to view the data for those who have an account. To be truely open data we would have to allow anonymous viewers access to the data.
Bureau data manager uploads the dataset and fills in the metadata form. That data quality satisfies the assessment. A data package is constructed for the data and the dataset status is updated indicating it can be published.
Bureau data manager uploads the dataset, and fills in metadata form. The data fails the quality assessment. The data manager is contacted to help make corrections. When the data quality has improved the process
Bureau data manager uploads the dataset, and fills in metadata form. The data fails the quality or integrity assessment or both. From experience, we know how to transform that data and have written scripts that always run for this data source.
https://www.smartcitypdx.com/open-data-program
Datasets are typically submitted as comma-separated value (CSV) files. Along with the CSV files, some metadata file(s) must be included as part of the dataset. More detail about the metadata can be found in the PORTLAND OPEN DATA HANDBOOK.
Possible targets for the open data project include http://www.civicapps.org/datasets