-
Notifications
You must be signed in to change notification settings - Fork 1
SOM
(syntaxis changed as of version 1.2)
This program implements the well-known Kohonen Self-Organizing Map. It maps a set of high dimensional input vectors into a two-dimensional grid. For more theoretical information, please see the following reference:
Pattern-Recognition and classification of Images of Biological Macromolecules using Artificial Neural Networks.
Marabini R., Carazo, J.M. Biophys J. 66:(6) 1804-1814 Jun 1994.
$ classify_som ...
Parameters
- ``The input data file (raw file). It should be a text file with each row representing the data items and each column representing the variables. It should have the following format:
3 1000 12 34 54 -12 45 76 ... 32 45 76
The first line indicates the dimension of the vectors (in this case 3) and the number of vectors (in this case 1000). Please note that vector components (variables) are separated by empty spaces. Additionally, the last column can also be used as a label for the vector. Example:
3 1000 12 34 54 labelA -12 45 76 labelB ... 32 45 76 labelN
- `` The output code vectors file. This parameter will set the base name for the generated output files. SOM produces several files with different information and all of them will use this name but with different extensions. The generated files will be:
-
[basename].cod
resulting code vectors. The generated code vectors also follows the same format as the input data, except that a few extra information is also stored in the first line of the file. Example:
-
3 rect 10 7 gaussian 11 31 52 labelA -10 43 71 labelB ... 29 39 71 labelN
The first line first indicates the dimension of the vectors (in this case 3), the topology of the map (in this case rectangular), the XY dimension (in this case 10x7) and the "gaussian" label that is only there to be fully compatible with the Kohonen's SOM_PAKHttp://www.cis.hut.fi/research/som_lvq_pak.shtmlPackage
-
[basename].inf
Information file about the parameters used and the resulting quantification error. It will look like this:
Kohonen SOM algorithm Input data file : g0u.dat Input code vectors file : g0u.cod Code vectors output file : g0u.cod Algorithm information output file : g0u.inf Number of feature vectors: 2457 Number of variables: 15 Horizontal dimension (Xdim) = 10 Vertical dimension (Ydim) = 5 Hexagonal topology Gaussian neighborhood function Initial learning rate (alpha) = 0.1 Initial neighborhood radius (radius) = 10 Total number of iterations = 20000 Input data not normalized Quantization error : 12.9864
-
-
[basename].his
Information about the number of input vectors assigned to each code vector. It is like an histogram of the resulting code vectors. The file contains two columns, the first column is the number of the code vector and the second column is the number of input vectors assigned to it -
[basename].err
Average quantization error for each code vector. The file contains two columns, the first column is the number of the code vector and the second column is the average quantization error for each codevector
-
- `` The input code vectors file. This parameter is optional and it is useful when the code vectors are going to be initialized with a set of predefined values. Usually when a several runs of the algorithm are going to be used and the output of one run is going to be used as input to the next one.
- `` Save a file for each code vector with a list of the input items that were assigned to it. It will generate a file for each codevector containing a list of the indexes of the input vectors assigned to it. Example: If a 10x7 map is used, then 70 files named
[basename].[Codevector Index]
(`baseneme.0`,`basename.1`, etc) will be generated. - `` Horizontal size of the map
- `` Vertical size of the map
- `` Rectangular Topology (Default)
- `` Hexagonal Topology. The following picture will help in inderstanding the differences between both topologies and the map axis convention: Xdim is ------> HEXAGONAL: O O O O O O O O O
O O O & & & O O O O O & @ @ & O O O O O & @ + @ & O O O O & @ @ & O O O O O O & & & O O O O O O O O O O O O
RECTANGULAR: O O O O O O O O O 0 O O O & O O O O O O O & @ & O O O O O & @ + @ & O O O O O & @ & O O O O O O O & O O O O O O O O O O O O O
- `` Gaussian neighborhood learning kernel. (Default)
- `` Bubble neighborhood learning kernel. These parameters define the way the neighboors of the winning neuron are updated during the training process. For details see:
T. Kohonen, Self-Organizing Maps, Second Edition, Springer-Verlag (1997).
- `` Initial learning rate value (default = 0.1). This is the initial learning rate value, which is decreased during training
- `` Initial neighborhood radius (default = max(xdim, ydim)). Thiss is the initial neighborhood radius, which represent the set of neighboors that are going to be updated along with the winning node during training. It is is decreased during training. As default it will use the maximum value of the map dimensions (the whole map)
- `` Use truly randomized codevectors. The code vectors are initialized to real random values
- `` Iterations number (Default = 10000)
- `` Normalize input data (Default = No)
- `` Information level while running:
- `` No information (default)
- `` Progress bar with the elapsed time and estimated time to finish
- `` Code vectors changes between iterations
Example 1: Maps a set of data stored in "test.dat" file into a 10x7 hexagonal map
$ classify_som -i test.dat -o test -xdim 10 -ydim 7
In this case the following parameters are set by default:
Input data file : test.dat
Output file name : test
Horizontal dimension (Xdim) = 10
Vertical dimension (Ydim) = 7
Hexagonal topology
Gaussian neighborhood function
Initial learning rate (alpha) = 0.1
Initial neighborhood radius (radius) = 10
Total number of iterations = 10000
verbosity level = 0
Do not normalize input data
So, we are going to generate an 10x7 (-xdim
10 and-ydim
7) output map using 10000 iterations (-iter
10000). An hexagonal topology is going to be used (-hexa
). In this case a Gaussian neighborhood function is used (-gaussian
) with an initial learning rate of 0.1 (-alpha
0.1) and initial neighborhood radius of 10 (-radius
10). In this case no textual information will be given in the output console (-verb
0).
As results, the SOM application will generate the following output files:
-
test.cod
The final code vector file in the format described above -
test.inf
Information file about the parameters used and the resulting quantification error -
test.his
Information about the number of input vectors assigned to each code vector. It is like an histogram -
test.err
Average quantization error for each code vector
Example 2: Maps a set of data stored in "test.dat" file into a 10x7 rectangular map with other initialization values
$ classify_som -i test.dat -o test -xdim 10 -ydim 7 -rect -alpha 0.5 -radius 5 -norm -bubble -verb 1 -saveclusters
In this case the following parameters are set by default:
Input data file : test.dat
Output file name : test
Horizontal dimension (Xdim) = 10
Vertical dimension (Ydim) = 7
Rectangular topology
Bubble neighborhood function
Initial learning rate (alpha) = 0.5
Initial neighborhood radius (radius) = 5
Total number of iterations = 10000
verbosity level = 1
Normalize input data
In this case we are going to generate an 10x7 (-xdim
10 and-ydim
7) output map using 10000 iterations (-iter
10000). A rectangular topology is going to be used (-rect
). A bubble neighborhood function is used (-bubble
) with an initial learning rate of 0.5 (-alpha
0.5) and initial neighborhood radius of 5 (-radius
5). In this case a progress bar and elpased/estimated time will be shown in the output console (-verb
1). Since the -saveclusters parameter is used, a list of input data assigned to each code vector is stored in thetest.0
totest.69
files
The following files are going to be generated:
-
test.cod
The final code vector file in the format described above -
test.inf
Information file about the parameters used and the resulting quantification error -
test.his
Information about the number of input vectors assigned to each code vector. It is like an histogram -
test.err
Average quantization error for each code vector -
test.0
totest.69
Each file is a list of the input data vectors assigned to each codevector
--Main.AlfredoSolano - 26 Jan 2007