Juriscraper is a scraper library started several years ago that gathers judicial opinions, oral arguments, and PACER data in the American court system. It is currently able to scrape:
- a variety of pages and reports within the PACER system
- opinions from all major appellate Federal courts
- opinions from all state courts of last resort except for Georgia (typically their "Supreme Court")
- oral arguments from all appellate federal courts that offer them
Juriscraper is part of a two-part system. The second part is your code, which calls Juriscraper. Your code is responsible for calling a scraper, downloading and saving its results. A reference implementation of the caller has been developed and is in use at CourtListener.com. The code for that caller can be found here. There is also a basic sample caller included in Juriscraper that can be used for testing or as a starting point when developing your own.
Some of the design goals for this project are:
- extensibility to support video, oral argument audio, etc.
- extensibility to support geographies (US, Cuba, Mexico, California)
- Mime type identification through magic numbers
- Generalized architecture with minimal code repetition
- XPath-based scraping powered by lxml's html parser
- return all meta data available on court websites (caller can pick what it needs)
- no need for a database
- clear log levels (DEBUG, INFO, WARN, CRITICAL)
- friendly as possible to court websites
First step: Install Python 3.9+, then:
On Ubuntu based distributions/Debian Linux:
sudo apt-get install libxml2-dev libxslt-dev libyaml-dev
On Arch based distributions:
sudo pacman -S libxml2 libxslt libyaml
On macOS with Homebrew <https://brew.sh>:
brew install libyaml
pip install juriscraper
You can set an environment variable for where you want to stash your logs (this can be skipped, and /var/log/juriscraper/debug.log will be used as the default if it exists on the filesystem):
export JURISCRAPER_LOG=/path/to/your/log.txt
Some websites are too difficult to crawl without some sort of automated WebDriver. For these, Juriscraper either uses a locally-installed copy of geckodriver or can be configured to connect to a remote webdriver. If you prefer the local installation, you can download Selenium FireFox Geckodriver:
# choose OS compatible package from: # https://github.com/mozilla/geckodriver/releases/tag/v0.26.0 # un-tar/zip your download sudo mv geckodriver /usr/local/bin
If you prefer to use a remote webdriver, like Selenium's docker image, you can configure it with the following variables:
WEBDRIVER_CONN
: Use this to set the connection string to your remote
webdriver. By default, this is local
, meaning it will look for a local
installation of geckodriver. Instead, you can set this to something like,
'http://YOUR_DOCKER_IP:4444/wd/hub'
, which will switch it to using a remote
driver and connect it to that location.
SELENIUM_VISIBLE
: Set this to any value to disable headless mode in your
selenium driver, if it supports it. Otherwise, it defaults to headless.
For example, if you want to watch a headless browser run, you can do so by starting selenium with:
docker run \ -p 4444:4444 \ -p 5900:5900 \ -v /dev/shm:/dev/shm \ selenium/standalone-firefox-debug
That'll launch it on your local machine with two open ports. 4444 is the
default on the image for accessing the webdriver. 5900 can be used to connect
via a VNC viewer, and can be used to watch progress if the SELENIUM_VISIBLE
variable is set.
Once you have selenium running like that, you can do a test like:
WEBDRIVER_CONN='http://localhost:4444/wd/hub' \ SELENIUM_VISIBLE=yes \ python sample_caller.py -c juriscraper.opinions.united_states.state.kan_p
Kansas's precedential scraper uses a webdriver. If you do this and watch selenium, you should see it in action.
We welcome contributions! If you'd like to get involved, please take a look at our CONTRIBUTING.md guide for instructions on setting up your environment, running tests, and more.
Juriscraper is licensed under the permissive BSD license.