Skip to content

Spark random forest issues #19

@szilard

Description

@szilard

This is to collaborate on some issues with Spark RF also addressed by @jkbradley in comments to this post http://datascience.la/benchmarking-random-forest-implementations/ (see comments by Joseph Bradley). cc: @mengxr

Please see “Absolute Minimal Benchmark” for random forests https://github.com/szilard/benchm-ml/tree/master/z-other-tools and let's use the 1M row training set and the test set linked in from there.

@jkbradley says: One-hot encoder: Spark 1.4 includes this, plus a lot more feature transformers. Preprocessing should become ever-easier, especially using DataFrames (Spark 1.3+).

Yes, indeed. Can you please provide code that reads in the original dataset (pre- 1-hot encoding) and does the 1-hot encoding in Spark. Also, if random forest 1.4 API can use data frames, I guess we should use that for the training. Can you please provide code for that too.

@jkbradley says: AUC/accuracy: The AUC issue appears to be caused by MLlib tree ensembles aggregating votes, rather than class probabilities, as you suggested. I re-ran your test using class probabilities (which can be aggregated by hand), and then got the same AUC as other libraries. We’re planning on including this fix in Spark 1.5 (and thanks for providing some evidence of its importance!).

Fantastic. Can you please share code that does that already? I would be happy to check it out.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions