Skip to content

aarbizu/baseballdatabank

 
 

Repository files navigation

Chadwick Baseball Databank

Forked from https://github.com/chadwickbureau/baseballdatabank, adding on some automation to spin up and load databases (mysql or postgres) with the data from the Chadwick CSV files.

For the README associated with the parent repo go here

This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. For details see: http://creativecommons.org/licenses/by-sa/3.0/

How to Use

  1. Install Docker Desktop if you haven't already.

  2. Pick either MySQL or postgres. YMMV, but on my MBP the postgres image is smaller and loads faster. Both work.

  3. Change into the appropriate directory and initialize the db. E.g. for postgres, after ensuring that docker is installed and ready

    cd pg-scripts
    sh init_pg_statsdb.sh
    

    This will get you a running docker container named bb-stats by default, which is built from an image you can stop and start if you like, using the start_db.sh and stop_db.sh scripts in the same directory as the "init" command you picked.

    You don't have to worry about the .sql files or the run_.sh files, as those are built into the image as some glue code to enable loading or tearing down the db

  4. Start using the dbs! If you are familiar with databases and SQL, you can connect directly to the running images and submit queries. E.g.

    docker exec -it bb-stats /bin/bash -c "psql -U postgres stats"
    
    # OR for mysql
    
    docker exec -it bb-stats /bin/bash -c "mysql stats" 
    

    If you want to connect to them via python or some other means, the ports are at the default setting unless you yourself modify them: 3306 for MySQL, 5432 for postgres =======

Baseball Databank

Baseball Databank is a compilation of historical baseball data in a convenient, tidy format, distributed under Open Data terms.

This work is licensed by Chadwick Baseball Bureau under the Creative Commons Attribution-ShareAlike 3.0 Unported License. For details see http://creativecommons.org/licenses/by-sa/3.0/

About this data

  • This is a legacy resource. Data in this format has been circulated by various people for many years, and there are many applications and users who have tools which take data in this format. It is maintained by Chadwick Baseball Bureau to support compatibility with those tools and programs. As such, the schema is not open to amendments, either in terms of the scope of coverage or in terms of the data categories available.
  • This is a free resource. Statistical data will be updated once at some point during the MLB offseason. To borrow the slogan used by ProMods, "It's ready when it's ready." New releases will be announced via our Twitter account at @chadwickbureau. We, politely, will not be able to respond to any enquiries as to when new versions of the data will be released.
  • These data are maintained wholly by Chadwick Baseball Bureau, for the benefit of the community. Users who require data of a different scope, in a different format, and/or with more specific schedules for updates are encouraged to enquire about our various licensing options.

Using or citing this data

We repeat, this is a legacy resource intended for backwards compatibility only. It is suitable for casual or exploratory use, as a convenient dataset for students to practice their data skills, and so forth.

It is not suitable for use as the basis for any kind of publication. The legacy parts of this data are not maintained, most likely contain errors, and definitely do not reflect many of the latest revisions to the historical record.

Researchers wanting a dataset that is suitable for research or publication purposes should contact Chadwick Baseball Bureau for enquiries.

Organisation of the files

There are three directories in the repository.

  • core/ contains the databank itself. These files are automatically produced from our larger dataset.
  • contrib/ contains files which are manually maintained by others using the same identifier system as the core. We bundle these for the convenience of the community.
  • upstream/ contains files used to construct the databank.

Maintenance and sources

Most of the data in the Databank is provided by Chadwick Baseball Bureau (http://www.chadwick-bureau.com). The data differ from the data the Bureau provides to its clients in that it contains less detail, is updated less frequently, and is provided on an as-is basis.

The Databank is historically based in part on the Lahman Baseball Database, version 2015-01-24, which is Copyright (C) 1996-2015 by Sean Lahman.

The tables Parks.csv and HomeGames.csv are based on the game logs and park code table published by Retrosheet. This information is available free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at http://www.retrosheet.org.

Enquiries and suggested revisions

Enquiries and suggested revisions to the data can be posted in the issue tracker at https://github.com/chadwickbureau/baseballdatabank/issues.

Files in core/ are all generated by scripts. As such they are not edited manually (and therefore pull requests should not be submitted against these files).

Files in upstream/ are manually-maintained files which contain information specific to constructing the Databank. As they are maintained manually, it is valid to submit pull requests containing corrections or additions to these files.

About

Development for baseball databank, an Open Data collection of historical baseball data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Shell 100.0%