-
-
Notifications
You must be signed in to change notification settings - Fork 105
Description
Is your feature request related to a problem? Please describe.
We are currently running into the problem that we have very large (3GB+) JSON files generated by ODD, but can't process them because we don't have enough RAM to parse the JSON.
I personally love JSON, but it seems like the format is not well-suited for the task (it's not streamable).
Now, you might ask, why don't you guys just use the .txt file?; the problem is that this is only created after the scan is finished, including file size estimations. After scanning a large OD for ~6h yesterday, I had a couple million links, with over 10M links left in queue for file size estimation. The actual urls were already there, but the only way to save them was through hitting J
for saving as JSON.
Describe the solution you'd like
There are multiple features that would be useful for very large ODs:
- add a key command to prematurely save the
.txt
-file
this should be no problem at all and is simply a missing option/command at this point - adopt a new file format that supports streaming parsers
thinkjsonlines
,csv
, whatever
it might also be a good idea to restructure the meta info of the scan and files in order to remove duplicate info and make the output files smaller and easier to work with - while we're at it, an option for saving the reddit output as well as error logs to a separate file would also be appreciated! :D
@MCOfficer and I would be glad to discuss the new file structure further, if you're so inclined :)