Speed up process_map function by using data.table #68
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The
process_mapfunction felt a bit slow, so I rewrote parts of it to try and speed it up. Based on theprofvisprofiling output, I replaced thedplyrstyle slicing with data.table methods, but kept the output exactly the same (column order, sorting).I used
microbenchmarkto runsprocess_map(data)50 times and report how much less time it took (median time) to run the new version compared to the old. The resulting speed up is dependent on the data set:sepsistakes ~33% less timepatientstakes ~34% less timehospital_billingstakes ~70% less time (from 1525 ms to 459ms)traffic_finestakes ~72% less time (from 1457ms to 395ms)On my real life dataset containing almost 200,000 rows it takes ~30% less time to run
process_map(data, frequency('absolute-case'))(from 32s to 21s). The benchmark on the real life dataset was run on a different device than the example datasets.