Skip to content

File downloaders rework #229

@EvilDrPurple

Description

@EvilDrPurple

Context

Our file downloaders could use a bit of a rework. They seem overly complex and only able to support a few different file types; with various modules calling to each other and requiring a specific order that is unclear. Not to mention all the defunct scripts littered about. I believe a much more straightforward approach is possible and will go a long way in helping people understand how and when to use our util modules. During work on #227, I found this way that will download any file type when provided with a download url:

r = requests.get(url, stream=True)
with open(file_path, 'wb') as fd:
    for chunk in r.iter_content():
        fd.write(chunk)

SEE: downloaders.py, get_files.py, muckrock_scraper.py

Requirements

  • Should be simple and easy for people to understand how to consume the module(s) and how they work
  • Should be clear what modules, in what order, and when they should be called
  • Should not break any existing functionality of scrapers or other util scripts

Docs

  • Docs related to the file downloaders and util scripts should be updated where necessary
  • New docs should be written to explain how to use and consume the file downloaders

Open questions

  • This will likely be time consuming to understand what's going on with the code, what functionality should be kept, and how to untangle it
  • Perhaps think about keeping an entire pipeline of functionality in one folder for organizational purposes

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions