Skip to content
This repository was archived by the owner on Feb 16, 2024. It is now read-only.

Convert To Svm

Benoit Favre edited this page Mar 14, 2014 · 2 revisions

Convert examples from icsiboost to svm_light file format

The icsiboost file format splits the information in two files: the .names file that contains a list of labels and a description of feature namespaces, types and generators. The svm_light format assumes that labels are specified as natural numbers (1... n). Features are denoted as couples of an id (1... n) and a value (real number). Each example is made of a label followed by a list of features in ascending ids. This format is compatible with the mlcomp project.

A script is provided to convert data from the icsiboost format to the svm_light format.

USAGE: python2 icsiboost_to_svm.py --names <names-file> [options] <input_files>
OPTIONS:
  --names <names-file>    feature type specification file
  -N <text_expert>        choose a text expert between fgram, ngram and sgram
  -W <ngram_length>       specify window length of text expert
  --keep-empty-examples   empty examples are removed by default
  -o <feature-dict>       output feature dictionary
  -i <feature-dict>       input feature dictionary
  -e <output-extension>   extension of output files (by concatenation, defaults to .svm)
  --cutoff <freq>         ignore nominal features occuring unfrequently
  --ignore <columns>      ignore a comma separated list of columns
  --ignore-regex <regex>  ignore columns that match a given regex
  --only <columns>        use only a comma separated list of columns
  --only-regex <regex>    use only columns that match a given regex

You should pass the same -N, -W, --cutoff options as to icsiboost. It will generate the corresponding ngram/sgram/fgram features and drop infrequent ones. By default, examples without features after cutoff are skipped (they don't appear in the output). If you use an input feature dictionary, then the cutoff is not applied and new features are ignored.

Note that boostexter scored text features are not supported.

Example:

python2 icsiboost_to_svm.py --names adult.names adult.data adult.test -o features.dict

This results in adult.data.svm, adult.test.svm containing examples in svm_light format and features.dict contains pairs of "namespace:feature id".

Clone this wiki locally