Introduction

SVM-OOPS is an implementation of Vapnik's Support Vector Machine for the problem of binary classification. The software supports the standard linear L1-loss support vector machines, using an exact reformulation that is better suited to interior point methods. See the papers at the bottom of this page for more details.

It is based on techniques developed by Kristian Woodsend and Jacek Gondzio, and it is a central output of Kristian's PhD under Jacek's supervision.

The algorithm is designed for solving large-scale classification problems, where the number of samples greatly exceeds the number of features. Running on a single processor, problems involving tens or hundreds of thousands of support vectors can be trained efficiently, while larger problems can be solved using clustered computing. The software is based on the OOPS interior point solver.

Features

Download

The program makes use of the OOPS interior point solver, and the source for this is not openly available. Therefore the program is distributed here as a binary for Linux, suitable for running on a single PC with multicore processors. The program is free for academic use. Please see the References section for how to cite it.

SVM-OOPS is currently at version 0.5, released on 16 July 2009.

Download the tar file by clicking on the above link, and unpack the files. The tar file contains the binary program and a readme file.

If you are looking for a parallel version of the software, see the parallel section below.

How to use this program

The same program covers the training phase, the test phase and also a conversion of data files to a binary format. Run the program from the command line, using the format:

svmoops [options] <training data file>

The program will be in training mode unless you use one of these flags in the options:

-convert
Convert the input data file into SVM-OOPS binary format
-test
Test the data file based on the model
-v
Show version and exit

By default, SVM-OOPS will expect its own binary format for the data file. Use these options to specify another supported format (see section on data file formats).

-sl
Input file is in SVMlight / LibSVM format
-sld
Input file is in SVMlight format but guaranteed to be dense
-st
Input file is in SVMTorch format

Training phase

Options that control SVM training are:

-c float
Value for loss penalty parameter. Default: 1
-cneg float
Multiplication factor for the loss penalty parameter C on negative labels. Default: 1
-n int
Number of data samples to use from training file if you do not want to use them all

Options that are useful during the training phase are:

-e float
Duality gap tolerance for terminating the optimization. Default: 1e-3
-it int
Maximum number of IPM iterations. Default: 100
-o filename
Filename for the model file. Default: <training file>.model
-omp int
Number of OMP threads. Default: as many as there are processor cores.
-sv
Write out the support vector dual variable values. Default: do not write out these values.

A typical command line to train on the file a1a (from the Adult data set) would be:

./svmoops -c 0.5 -sl a1a

The program will create a model file a1a.model in the same directory.

Test phase

During the test phase, use these command line options:

-test
Sets the program in test mode
-o filename
Filename for model
-to filename
Filename to write the predictions

The program expects test data files to be in the same format as training data files (see below). In particular, note that target labels are expected. They will be used to compare against predicted labels, and to give accuracy results.

To test the accuracy of the model a1a.model against the data in file a1a.t, the command line would be:

./svmoops -test -o a1a.model -sl a1a.t

Data file formats

SVM-OOPS supports data files in two text formats, and also its own native binary format.

SVMlight / LibSVM text format

You can use the file format of SVMlight and LibSVM for training and test files. Add the option -sl in the command line parameters. If the data is completely dense, it is slightly more efficient to use the option -sld.

The first lines may contain comments and are ignored if they start with #. Each of the following lines represents one training example and is of the following format:

<line> .=. <target> <feature>:<value> <feature>:<value> ... <feature>:<value>
<target> .=. {+1,-1}
<feature> .=. <integer>
<value> .=. <float>

Unfortunately, this format does not explicitly state how many features are in the data. Instead, the program has to discover them when reading through the file. It is possible that a dataset will not contain the full set of features. Meanwhile, SVM-OOPS requires that the number of features in the dataset exactly matches the model. To force the number of features to a higher value, use the -sized option together with specifying the number of features and samples. In this example, we force the program to consider 123 features and 1605 samples.

./svmoops -sl -sized -m 123 -n 1605 a1a

This option is only really useful for the SVMlight/LibSVM file format. Additional features will be set to zero.

SVMTorch format

File format of SVMTorch is also supported. This format is suitable for dense data. Specify the option -st in the command line.

The first line contains the number of samples, some whitespace, then the number of tokens on each line (which will be one more than the number of features).

<first line>:<number of samples n> <number of features m + 1>

Each of the following lines represents one training sample and has the following format:

<line> .=. <feature 1 value> ... <feature m value> <target>
<feature value> .=. <float>
<target> .=. {+1,-1}

Feature values are separated by whitespace. The order is obviously important. The target value denotes the class of each sample.

SVM-OOPS binary format

SVM-OOPS also supports a binary data format, which is much more efficient for reading in data than the text formats above. It makes sense to use this format if you are doing repeated training runs.

The training data is represented using two files:

The .lab file is human-readable. The first line contains the number of samples, some whitespace, then the number of features.

<first line>:<number of samples n> <number of features m>

Each of the following lines contains the class label for one training sample:

<line> .=. <label>
<label> .=. {+1,-1}

The idea is that for data originally in multiple classes, the labels can be reworked to create binary classification problems without affecting the whole file.

The .dat file contains the feature data. Use the program SVM-OOPS to create a file in this format from one of the supported text formats. For example, to convert from a file in SVMlight format:

./svmoops -convert -sl a1a

The output will be the files <train_file>.dat and <train_file>.lab in the same directory.

To use the resulting files, call SVM-OOPS and provide it with the .lab file name as the training file. The program will look for the corresponding .dat file in the same directory. For example:

./svmoops a1a.lab

Parallel implementation

SVM-OOPS is designed to work on MPI and shared-memory parallel systems. The parallel version of SVM-OOPS is available for academic research, but as it may require the source code, and it will almost certainly need more support, we are not making it available through anonymous download. If you want to use this version, please contact us so we can discuss terms and help you through the installation process.

Contact details

To contact the author Kristian Woodsend, please email me at k.woodsend@ed.ac.uk.

Please contact me in particular if:

Even if all you do is download the software and find it useful, then I'd be delighted to know. It'll encourage me to continue developing it.

References

If you use SVM-OOPS in your published research, these are the relevant papers to cite.

If you use the serial version, please cite:

K. Woodsend and J. Gondzio Exploiting separability in large scale linear support vector machine training Technical Report MS 07-002, School of Mathematics, submitted
Published in Computational Optimization and Applications.

For the parallel version of the software, please use:

K. Woodsend and J. Gondzio Hybrid MPI/OpenMP parallel support vector machine training Technical Report ERGO 09-001, School of Mathematics.
Published in Journal of Machine Learning Research.