Skip to content

More output options #56

@Chaphasilor

Description

@Chaphasilor

Is your feature request related to a problem? Please describe.

We are currently running into the problem that we have very large (3GB+) JSON files generated by ODD, but can't process them because we don't have enough RAM to parse the JSON.
I personally love JSON, but it seems like the format is not well-suited for the task (it's not streamable).

Now, you might ask, why don't you guys just use the .txt file?; the problem is that this is only created after the scan is finished, including file size estimations. After scanning a large OD for ~6h yesterday, I had a couple million links, with over 10M links left in queue for file size estimation. The actual urls were already there, but the only way to save them was through hitting J for saving as JSON.

Describe the solution you'd like

There are multiple features that would be useful for very large ODs:

  • add a key command to prematurely save the .txt-file
    this should be no problem at all and is simply a missing option/command at this point
  • adopt a new file format that supports streaming parsers
    think jsonlines, csv, whatever
    it might also be a good idea to restructure the meta info of the scan and files in order to remove duplicate info and make the output files smaller and easier to work with
  • while we're at it, an option for saving the reddit output as well as error logs to a separate file would also be appreciated! :D

@MCOfficer and I would be glad to discuss the new file structure further, if you're so inclined :)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions