Class StringKernel
java.lang.Object
weka.classifiers.functions.supportVector.Kernel
weka.classifiers.functions.supportVector.StringKernel
- All Implemented Interfaces:
Serializable
,CapabilitiesHandler
,OptionHandler
,RevisionHandler
,TechnicalInformationHandler
Implementation of the subsequence kernel (SSK) as
described in [1] and of the subsequence kernel with lambda pruning (SSK-LP)
as described in [2].
For more information, see
Huma Lodhi, Craig Saunders, John Shawe-Taylor, Nello Cristianini, Christopher J. C. H. Watkins (2002). Text Classification using String Kernels. Journal of Machine Learning Research. 2:419-444.
F. Kleedorfer, A. Seewald (2005). Implementation of a String Kernel for WEKA. Wien, Austria. BibTeX:
Lambda pruning is parametrized by the maximum lambda exponent. It is a good idea to choose that value to be about 3 or 4 times the subsequence length as a rule of thumb. YMMV.
For details and qualitative experiments about SSK, see [1]
For details about lambda pruning and performance comparison of SSK and SSK-LP (SSK with lambda pruning), see [2] Note that the complexity estimation in [2] assumes no caching of intermediate results, which has been implemented in the meantime and greatly improves the speed of the SSK without lambda pruning.
For more information, see
Huma Lodhi, Craig Saunders, John Shawe-Taylor, Nello Cristianini, Christopher J. C. H. Watkins (2002). Text Classification using String Kernels. Journal of Machine Learning Research. 2:419-444.
F. Kleedorfer, A. Seewald (2005). Implementation of a String Kernel for WEKA. Wien, Austria. BibTeX:
@article{Lodhi2002, author = {Huma Lodhi and Craig Saunders and John Shawe-Taylor and Nello Cristianini and Christopher J. C. H. Watkins}, journal = {Journal of Machine Learning Research}, pages = {419-444}, title = {Text Classification using String Kernels}, volume = {2}, year = {2002}, HTTP = {http://www.jmlr.org/papers/v2/lodhi02a.html} } @techreport{Kleedorfer2005, address = {Wien, Austria}, author = {F. Kleedorfer and A. Seewald}, institution = {Oesterreichisches Forschungsinstitut fuer Artificial Intelligence}, number = {TR-2005-13}, title = {Implementation of a String Kernel for WEKA}, year = {2005} }Valid options are:
-D Enables debugging output (if available) to be printed. (default: off)
-P <0|1> The pruning method to use: 0 = No pruning 1 = Lambda pruning (default: 0)
-C <num> The size of the cache (a prime number). (default: 250007)
-IC <num> The size of the internal cache (a prime number). (default: 200003)
-L <num> The lambda constant. Penalizes non-continuous subsequence matches. Must be in (0,1). (default: 0.5)
-ssl <num> The length of the subsequence. (default: 3)
-ssl-max <num> The maximum length of the subsequence. (default: 9)
-N Use normalization. (default: no)
Theory
Overview
The algorithm computes a measure of similarity between two texts based on the number and form of their common subsequences, which need not be contiguous. This method can be parametrized by specifying the subsequence length k, the penalty factor lambda, which penalizes non-contiguous matches, and optional 'lambda pruning', which takes maxLambdaExponent,m
, as
parameter. Lambda pruning causes very 'stretched' substring matches not to be
counted, thus speeding up the computation. The functionality of SSK and
SSK-LP is explained in the following using simple examples.
Explanation & Examples
for all of the following examples, we assume these parameter values:k=2 lambda=0.5 m=8 (for SSK-LP examples)
SSK
Example 1
SSK(2,"ab","axb")=0.5^5 = 0,03125There is one subsequence of the length of 2 that both strings have in common, "ab". The result of SSK is computed by raising lambda to the power of L, where L is the length of the subsequence match in the one string plus the length of the subsequence match in the other, in our case:
ab axb L= 2 + 3 = 5hence, the kernel yields 0.5^5 = 0,03125
Example 2
SSK(2,"ab","abb")=0.5^5 + 0.5^4 = 0,09375Here, we also have one subsequence of the length of 2 that both strings have in common, "ab". The result of SSK is actually computed by summing over all values computed for each occurrence of a common subsequence match. In this example, there are two possible cases:
ab abb -- -- L=4 -- - - L=5we have two matches, one of the length of 2+2=4, one of the length of 2+3=5, so we get the result 0.5^5 + 0.5^4 = 0,09375.
SSK-LP
Without lambda pruning, the string kernel finds *all* common subsequences of the given length, whereas with lambda pruning, common subsequence matches that are too much stretched in both strings are not taken into account. It is argued that the value yielded for such a common subsequence is too low (lambda ^(length[match_in_s] + length[match_in_t]
) . Tests have
shown that a tremendous speedup can be achieved using this technique while
suffering from very little quality loss. Lambda pruning is parametrized by the maximum lambda exponent. It is a good idea to choose that value to be about 3 or 4 times the subsequence length as a rule of thumb. YMMV.
Example 3
Without lambda pruning, one common subsequence, "AB" would be found in the following two strings. (With k=2)SSK(2,"ab","axb")=0.5^14 = 0,00006103515625lambda pruning allows for the control of the match length. So, if m (the maximum lambda exponent) is e.g. 8, these two strings would yield a kernel value of 0:
with lambda pruning: SSK-LP(2,8,"AxxxxxxxxxB","AyB")= 0 without lambda pruning: SSK(2,"AxxxxxxxxxB","AyB")= 0.5^14 = 0,00006103515625This is because the exponent for lambda (=the length of the subsequence match) would be 14, which is > 8. In Contrast, the next result is > 0
m=8 SSK-LP(2,8,"AxxB","AyyB")=0.5^8 = 0,00390625because the lambda exponent would be 8, which is just accepted by lambda pruning.
Normalization
When the string kernel is used for its main purpose, as the kernel of a support vector machine, it is not normalized. The normalized kernel can be switched on by -F (feature space normalization) but is much slower. Like most unnormalized kernels, K(x,x) is not a fixed value, see the next example.Example 4
SSK(2,"ab","ab")=0.5^4 = 0.0625 SSK(2,"AxxxxxxxxxB","AxxxxxxxxxB") = 12.761724710464478SSK is evaluated twice, each time for two identical strings. A good measure of similarity would produce the same value in both cases, which should indicate the same level of similarity. The value of the normalized SSK would be 1.0 in both cases. So for the purpose of computing string similarity the normalized kernel should be used. For SVM the unnormalized kernel is usually sufficient.
Complexity of SSK and SSK-LP
The time complexity of this method (without lambda pruning and with an infinitely large cache) isO(k*|s|*|t|)Lambda Pruning has a complexity (without caching) of
O(m*binom(m,k)^2*(|s|+n)*|t|)
k... subsequence length (ssl) s,t... strings |s|... length of string s binom(x,y)... binomial coefficient (x!/[(x-y)!y!]) m... maxLambdaExponent (ssl-max)Keep in mind that execution time can increase fast for long strings and big values for k, especially if you don't use lambda pruning. With lambda pruning, computation is usually so fast that switching on the cache leads to slower computation because of setup costs. Therefore caching is switched off for lambda pruning.
For details and qualitative experiments about SSK, see [1]
For details about lambda pruning and performance comparison of SSK and SSK-LP (SSK with lambda pruning), see [2] Note that the complexity estimation in [2] assumes no caching of intermediate results, which has been implemented in the meantime and greatly improves the speed of the SSK without lambda pruning.
Notes for usage within Weka
Only instances of the following form can be processed using string kernels:+----------+-------------+---------------+ |attribute#| 0 | 1 | +----------+-------------+---------------+ | content | [text data] | [class label] | +----------------------------------------+ ... or ... +----------+---------------+-------------+ |attribute#| 0 | 1 | +----------+---------------+-------------+ | content | [class label] | [text data] | +----------------------------------------+
- Version:
- $Revision: 14512 $
- Author:
- Florian Kleedorfer (kleedorfer@austria.fm), Alexander K. Seewald (alex@seewald.at)
- See Also:
-
Field Summary
Modifier and TypeFieldDescriptionstatic final int
Pruning method: Lambda See [2] for details.static final int
Pruning method: No Pruningstatic final Tag[]
Pruning methods -
Constructor Summary
ConstructorDescriptiondefault constructorStringKernel
(Instances data, int cacheSize, int subsequenceLength, double lambda, boolean debug) creates a new StringKernel object. -
Method Summary
Modifier and TypeMethodDescriptionvoid
buildKernel
(Instances data) builds the kernel with the given data.Returns the tip text for this propertyvoid
clean()
Frees the memory used by the kernel.double
Computes the result of the kernel function for two instances.int
Gets the size of the cacheReturns the Capabilities of this kernel.int
Gets the size of the internal cachedouble
Gets the lambda constant used in the string kernelint
Returns the maximum length of the subsequenceString[]
Gets the current settings of the Kernel.Gets the method used for pruning.Returns the revision string.int
Returns the length of the subsequenceReturns an instance of a TechnicalInformation object, containing detailed information about the technical background of this class, e.g., paper reference or book this class is based on.boolean
Returns whether normalization is used.Returns a string describing the kernelReturns the tip text for this propertyReturns the tip text for this propertyReturns an enumeration describing the available options.Returns the tip text for this propertydouble
normalizedKernel
(char[] s, char[] t) evaluates the normalized kernel between s and t.int
Returns the number of dot product cache hits.int
numEvals()
Returns the number of kernel evaluation performed.Returns the tip text for this propertyvoid
setCacheSize
(int value) Sets the size of the cache to use (a prime number)void
setInternalCacheSize
(int value) sets the size of the internal cache for intermediate results.void
setLambda
(double value) Sets the lambda constant used in the string kernelvoid
setMaxSubsequenceLength
(int value) Sets the maximum length of the subsequence.void
setOptions
(String[] options) Parses a given list of options.void
setPruningMethod
(SelectedTag value) Sets the method used to for pruning.void
setSubsequenceLength
(int value) Sets the length of the subsequence.void
setUseNormalization
(boolean value) Sets whether to use normalization.Returns the tip text for this propertydouble
unnormalizedKernel
(char[] s, char[] t) evaluates the unnormalized kernel between s and t.Returns the tip text for this propertyMethods inherited from class weka.classifiers.functions.supportVector.Kernel
debugTipText, forName, getChecksTurnedOff, getDebug, getDoNotCheckCapabilities, makeCopies, makeCopy, setChecksTurnedOff, setDebug, setDoNotCheckCapabilities
-
Field Details
-
PRUNING_NONE
public static final int PRUNING_NONEPruning method: No Pruning- See Also:
-
PRUNING_LAMBDA
public static final int PRUNING_LAMBDAPruning method: Lambda See [2] for details.- See Also:
-
TAGS_PRUNING
Pruning methods
-
-
Constructor Details
-
StringKernel
public StringKernel()default constructor -
StringKernel
public StringKernel(Instances data, int cacheSize, int subsequenceLength, double lambda, boolean debug) throws Exception creates a new StringKernel object. Initializes the kernel cache and the 'lambda cache', i.e. the precalculated powers of lambda from lambda^2 to lambda^MAX_POWER_OF_LAMBDA- Parameters:
data
- the dataset to usecacheSize
- the size of the cachesubsequenceLength
- the subsequence lengthlambda
- the lambda valuedebug
- whether to output debug information- Throws:
Exception
- if something goes wrong
-
-
Method Details
-
globalInfo
Returns a string describing the kernel- Specified by:
globalInfo
in classKernel
- Returns:
- a description suitable for displaying in the explorer/experimenter gui
-
getTechnicalInformation
Returns an instance of a TechnicalInformation object, containing detailed information about the technical background of this class, e.g., paper reference or book this class is based on.- Specified by:
getTechnicalInformation
in interfaceTechnicalInformationHandler
- Returns:
- the technical information about this class
-
listOptions
Returns an enumeration describing the available options.- Specified by:
listOptions
in interfaceOptionHandler
- Overrides:
listOptions
in classKernel
- Returns:
- an enumeration of all the available options.
-
setOptions
Parses a given list of options. Valid options are:-D Enables debugging output (if available) to be printed. (default: off)
-P <0|1> The pruning method to use: 0 = No pruning 1 = Lambda pruning (default: 0)
-C <num> The size of the cache (a prime number). (default: 250007)
-IC <num> The size of the internal cache (a prime number). (default: 200003)
-L <num> The lambda constant. Penalizes non-continuous subsequence matches. Must be in (0,1). (default: 0.5)
-ssl <num> The length of the subsequence. (default: 3)
-ssl-max <num> The maximum length of the subsequence. (default: 9)
-N Use normalization. (default: no)
- Specified by:
setOptions
in interfaceOptionHandler
- Overrides:
setOptions
in classKernel
- Parameters:
options
- the list of options as an array of strings- Throws:
Exception
- if an option is not supported
-
getOptions
Gets the current settings of the Kernel.- Specified by:
getOptions
in interfaceOptionHandler
- Overrides:
getOptions
in classKernel
- Returns:
- an array of strings suitable for passing to setOptions
-
pruningMethodTipText
Returns the tip text for this property- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
setPruningMethod
Sets the method used to for pruning.- Parameters:
value
- the pruning method to use.
-
getPruningMethod
Gets the method used for pruning.- Returns:
- the pruning method to use.
-
setCacheSize
public void setCacheSize(int value) Sets the size of the cache to use (a prime number)- Parameters:
value
- the size of the cache
-
getCacheSize
public int getCacheSize()Gets the size of the cache- Returns:
- the cache size
-
cacheSizeTipText
Returns the tip text for this property- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
setInternalCacheSize
public void setInternalCacheSize(int value) sets the size of the internal cache for intermediate results. Memory consumption is about 16x this amount in bytes. Only use when lambda pruning is switched off.- Parameters:
value
- the size of the internal cache
-
getInternalCacheSize
public int getInternalCacheSize()Gets the size of the internal cache- Returns:
- the cache size
-
internalCacheSizeTipText
Returns the tip text for this property- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
setSubsequenceLength
public void setSubsequenceLength(int value) Sets the length of the subsequence.- Parameters:
value
- the length
-
getSubsequenceLength
public int getSubsequenceLength()Returns the length of the subsequence- Returns:
- the length
-
subsequenceLengthTipText
Returns the tip text for this property- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
setMaxSubsequenceLength
public void setMaxSubsequenceLength(int value) Sets the maximum length of the subsequence.- Parameters:
value
- the maximum length
-
getMaxSubsequenceLength
public int getMaxSubsequenceLength()Returns the maximum length of the subsequence- Returns:
- the maximum length
-
maxSubsequenceLengthTipText
Returns the tip text for this property- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
setLambda
public void setLambda(double value) Sets the lambda constant used in the string kernel- Parameters:
value
- the lambda value to use
-
getLambda
public double getLambda()Gets the lambda constant used in the string kernel- Returns:
- the current lambda constant
-
lambdaTipText
Returns the tip text for this property- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
setUseNormalization
public void setUseNormalization(boolean value) Sets whether to use normalization. Each time this value is changed, the kernel cache is cleared.- Parameters:
value
- whether to use normalization
-
getUseNormalization
public boolean getUseNormalization()Returns whether normalization is used.- Returns:
- true if normalization is used
-
useNormalizationTipText
Returns the tip text for this property- Returns:
- tip text for this property suitable for displaying in the explorer/experimenter gui
-
eval
Computes the result of the kernel function for two instances. If id1 == -1, eval use inst1 instead of an instance in the dataset.- Specified by:
eval
in classKernel
- Parameters:
id1
- the index of the first instance in the datasetid2
- the index of the second instance in the datasetinst1
- the instance corresponding to id1 (used if id1 == -1)- Returns:
- the result of the kernel function
- Throws:
Exception
- if something goes wrong
-
clean
public void clean()Frees the memory used by the kernel. (Useful with kernels which use cache.) This function is called when the training is done. i.e. after that, eval will be called with id1 == -1. -
numEvals
public int numEvals()Returns the number of kernel evaluation performed. -
numCacheHits
public int numCacheHits()Returns the number of dot product cache hits.- Specified by:
numCacheHits
in classKernel
- Returns:
- the number of dot product cache hits, or -1 if not supported by this kernel.
-
normalizedKernel
public double normalizedKernel(char[] s, char[] t) evaluates the normalized kernel between s and t. See [1] for details about the normalized SSK.- Parameters:
s
- first input stringt
- second input string- Returns:
- a double indicating their distance, or similarity
-
unnormalizedKernel
public double unnormalizedKernel(char[] s, char[] t) evaluates the unnormalized kernel between s and t. See [1] for details about the unnormalized SSK.- Parameters:
s
- first input stringt
- second input string- Returns:
- a double indicating their distance, or similarity
-
getCapabilities
Returns the Capabilities of this kernel.- Specified by:
getCapabilities
in interfaceCapabilitiesHandler
- Overrides:
getCapabilities
in classKernel
- Returns:
- the capabilities of this object
- See Also:
-
buildKernel
builds the kernel with the given data.- Overrides:
buildKernel
in classKernel
- Parameters:
data
- the data to base the kernel on- Throws:
Exception
- if something goes wrong, e.g., the data does not consist of one string attribute and the class
-
getRevision
Returns the revision string.- Specified by:
getRevision
in interfaceRevisionHandler
- Overrides:
getRevision
in classKernel
- Returns:
- the revision
-