Skip to content
This repository was archived by the owner on Feb 16, 2024. It is now read-only.

File Formats

Benoit Favre edited this page Mar 14, 2014 · 2 revisions

description of the file formats used by icsiboost

The file formats are mostly compatible with BoosTexter.

Column description: stem.names

This file defines the classes, the column names and type of weak learners to generate for them. All the lines are ended by a period. The first line contains a comma separated list of class names. Then, each line contains a column definition. A column definition consists of a name, a colon, and a type. Column names consist of letters, underscores and digits (other characters can be used, but it may induce parse errors in the model file). If the type is "text", the data will be split in words (on spaces) and binary n-gram experts will be generated (the type of gram and length can be selected by the -N and -W command line options). If the type is "continuous", thresholding experts will be generated. They are ternary because they consider the cases where the feature is above or below the threshold and the case where the feature is unknown. If the type is "ignore", then the column is ignored (this is NOT compatible with BoosTexter). The type can also be a comma separated list of nominal values. This type is deprecated and will generate the same thing as the "text" expert.

class1, class2, class3.
column1: text.
column2: text.
column3: text.
column4: continuous.
column5: continuous.
column6: continuous.
column7: continuous.
column8: ignore.
column9: ignore.

Starting at r104, you can specify the type of text expert and its length on a per-column basis in the names file (previously set globally with -N ngram -W 3...). For example:

class1, class2, class3.
column1: text: expert_type=ngram expert_length=5 cutoff=3.
column2: text.
column3: text: cutoff=10.
column3: text: expert_type=sgram expert_length=10.
...

cutoff=n is equivalent to the --cutoff option and removes textual features that appear less than n times. The default values are the ones set on the command line.

Example files: stem.{data,dev,test}

The example files contain one instance per line. Each line consists of columns separated by commas and is ended by a period. Columns should be populated in the same order as defined in the .names file. The last column is the true class (or nothing at test time). Words in "text" columns are separated by spaces. Unknown values are represented by a question mark. The .data file contains training examples, the .dev file contains the development set and the .test file contains the test set.

word1 word2 word3, word4, word5, 0.1, 0.2, 0.3, 0.4, garbage1, garbage2, class1.
word1 word3, ?, word6, 0.6, 0.2, ?, 0.4, garbage3, garbage4, class2.

Each example can have multiple class labels, by separating them by spaces in the class column:

word1 word2 word3, word4, word5, 0.1, 0.2, 0.3, 0.4, garbage1, garbage2, class1 class4.
word1 word3, ?, word6, 0.6, 0.2, ?, 0.4, garbage3, garbage4, class2 class7 class8.

Model file: stem.shyp

The model file contains definition of the weak classifiers along with their contribution to the final score. See the articles and the source code for more details. Here are two examples of a text and a continuous classifier. The contribution to each class is a space separated list of floating point values in the same order as the class definition in the names file. Note that weak classifiers are given in the same order as found during training.

   weight Text:SGRAM:column_name:token_value

contribution to each class if absent or unknown

contribution to each class if present


   weight Text:THRESHOLD:column_name:

contribution to each class if unknown

contribution to each class if below threshold

contribution to each class if above threshold

threshold_value

For example, the two first classifiers from adult.shyp:

100 <- number of iterations

   1.000000000000 Text:SGRAM:marital-status:0

label is ">50K" label is "<=50K"
-1.3300887032 1.3300887032  <- marital-status != "Married-civ-spouse" (item 0 in names file definition)

-0.1066994013 0.1066994013  <- marital-status == "Married-civ-spouse"


   1.000000000000 Text:THRESHOLD:capital-gain:

0.0000000000 0.0000000000   <- capital-gain is unknown

-0.1099545814 0.1099545814  <- capital-gain <= 7073.5

2.6628799617 -2.6628799617  <- capital-gain > 7073.5

7073.5000000000   <- the actual threshold

The model can be packed. That means that identical classifiers are averaged and reweighed in order to reduce the number of classification steps at test time. This is Boostexter's default behavior, but you should not pack your model if you want to analyze or use the individual training steps (the unpacked models conserves the order of the classifiers). The number on the first line of a .shyp file represents the number of weak learners (ie. iterations) to load at test time. It is not necessarily the actual number of weak learners in the file (can be affected by the --optimal-iterations option) but can be overridden with the -n option.

Classification mode, short output

The prediction of an example are output on a line. There are two group of values and as many values as there are classes in the .names file. The order is the same as in the .names file. The first group is one binary flag per class corresponding to the reference activation of each class (if available). The second group correspond to the actual predictions. In the single-label case, the label with the highest score is predicted. In the multi-label case, multiple classes can be predicted. In this case, all classes with a positive score are predicted.

0 1 -0.018425116509 0.018425116509
0 1 -0.004071426535 0.004071426535
1 0 -0.001862755005 0.001862755005
1 0 0.014314053147 -0.014314053147
0 1 -0.037851334231 0.037851334231
0 1 -0.015030095760 0.015030095760
0 1 -0.011415782398 0.011415782398
1 0 0.004168640079 -0.004168640079
0 1 -0.014499579535 0.014499579535
0 1 -0.007088537326 0.007088537326
1 0 0.001905840098 -0.001905840098

Classification mode, long output

All the features corresponding to the example are output along with their name. Then, the correct labels (correct label = ...) followed by the scores and decisions. ** means reference label, > means decision. When the decision is right: *>.

age: 25
workclass: Private
fnlwgt: 226802
education: 11th
education-num: 7
marital-status: Never-married
occupation: Machine-op-inspct
relationship: Own-child
race: Black
sex: Male
capital-gain: 0
capital-loss: 0
hours-per-week: 40
native-country: United-States
correct label = <=50K
   -0.018425 : >50K
*>  0.018425 : <=50K

age: 38
workclass: Private
fnlwgt: 89814
education: HS-grad
education-num: 9
marital-status: Married-civ-spouse
occupation: Farming-fishing
relationship: Husband
race: White
sex: Male
capital-gain: 0
capital-loss: 0
hours-per-week: 50
native-country: United-States
correct label = <=50K
   -0.004071 : >50K
*>  0.004071 : <=50K
Clone this wiki locally