SVM-OOPS is an implementation of Vapnik's Support Vector Machine for the problem of binary classification. The software supports the standard linear L1-loss support vector machines, using an exact reformulation that is better suited to interior point methods. See the papers at the bottom of this page for more details.
It is based on techniques developed by Kristian Woodsend and Jacek Gondzio, and it is a central output of Kristian's PhD under Jacek's supervision.
The algorithm is designed for solving large-scale classification problems, where the number of samples greatly exceeds the number of features. Running on a single processor, problems involving tens or hundreds of thousands of support vectors can be trained efficiently, while larger problems can be solved using clustered computing. The software is based on the OOPS interior point solver.
The program makes use of the OOPS interior point solver, and the source for this is not openly available. Therefore the program is distributed here as a binary for Linux, suitable for running on a single PC with multicore processors. The program is free for academic use. Please see the References section for how to cite it.
SVM-OOPS is currently at version 0.5, released on 16 July 2009.
Download the tar file by clicking on the above link, and unpack the files. The tar file contains the binary program and a readme file.
If you are looking for a parallel version of the software, see the parallel section below.
The same program covers the training phase, the test phase and also a conversion of data files to a binary format. Run the program from the command line, using the format:
svmoops [options] <training data file>
The program will be in training mode unless you use one of these flags in the options:
By default, SVM-OOPS will expect its own binary format for the data file. Use these options to specify another supported format (see section on data file formats).
Options that control SVM training are:
Options that are useful during the training phase are:
<training file>.model
A typical command line to train on the file a1a
(from the
Adult data set)
would be:
./svmoops -c 0.5 -sl a1a
The program will create a model file a1a.model
in the same
directory.
During the test phase, use these command line options:
The program expects test data files to be in the same format as training data files (see below). In particular, note that target labels are expected. They will be used to compare against predicted labels, and to give accuracy results.
To test the accuracy of the model a1a.model
against the data
in file a1a.t
, the command line would be:
./svmoops -test -o a1a.model -sl a1a.t
SVM-OOPS supports data files in two text formats, and also its own native binary format.
You can use the file format of
SVMlight and
LibSVM for training
and test files.
Add the option -sl
in the command line parameters.
If the data is completely dense, it is slightly more efficient to use the
option -sld
.
The first lines may contain comments and are ignored if they start with #. Each of the following lines represents one training example and is of the following format:
<line> .=. <target> <feature>:<value> <feature>:<value> ... <feature>:<value> <target> .=. {+1,-1} <feature> .=. <integer> <value> .=. <float>
Unfortunately, this format does not explicitly state how many features
are in the data.
Instead, the program has to discover them when reading through the file.
It is possible that a dataset will not contain the full set of features.
Meanwhile, SVM-OOPS requires that the number of features in the
dataset exactly matches the model.
To force the number of features to a higher value, use the -sized
option together with specifying the number of features and samples.
In this example, we force the program to consider 123 features and
1605 samples.
./svmoops -sl -sized -m 123 -n 1605 a1a
This option is only really useful for the SVMlight/LibSVM file format. Additional features will be set to zero.
File format of SVMTorch is also supported.
This format is suitable for dense data.
Specify the option -st
in the command line.
The first line contains the number of samples, some whitespace, then the number of tokens on each line (which will be one more than the number of features).
<first line>:<number of samples n> <number of features m + 1>
Each of the following lines represents one training sample and has the following format:
<line> .=. <feature 1 value> ... <feature m value> <target> <feature value> .=. <float> <target> .=. {+1,-1}
Feature values are separated by whitespace. The order is obviously important. The target value denotes the class of each sample.
SVM-OOPS also supports a binary data format, which is much more efficient for reading in data than the text formats above. It makes sense to use this format if you are doing repeated training runs.
The training data is represented using two files:
<train_file>.dat
containing the feature data, and
<train_file>.lab
containing labels.
The .lab file is human-readable. The first line contains the number of samples, some whitespace, then the number of features.
<first line>:<number of samples n> <number of features m>
Each of the following lines contains the class label for one training sample:
<line> .=. <label> <label> .=. {+1,-1}
The idea is that for data originally in multiple classes, the labels can be reworked to create binary classification problems without affecting the whole file.
The .dat file contains the feature data. Use the program SVM-OOPS to create a file in this format from one of the supported text formats. For example, to convert from a file in SVMlight format:
./svmoops -convert -sl a1a
The output will be the files <train_file>.dat
and <train_file>.lab
in the same directory.
To use the resulting files, call SVM-OOPS and provide it with the .lab file name as the training file. The program will look for the corresponding .dat file in the same directory. For example:
./svmoops a1a.lab
SVM-OOPS is designed to work on MPI and shared-memory parallel systems. The parallel version of SVM-OOPS is available for academic research, but as it may require the source code, and it will almost certainly need more support, we are not making it available through anonymous download. If you want to use this version, please contact us so we can discuss terms and help you through the installation process.
To contact the author Kristian Woodsend, please email me at k.woodsend@ed.ac.uk.
Please contact me in particular if:
Even if all you do is download the software and find it useful, then I'd be delighted to know. It'll encourage me to continue developing it.
If you use SVM-OOPS in your published research, these are the relevant papers to cite.
If you use the serial version, please cite:
For the parallel version of the software, please use: