ValueError using recursive feature elimination for SVM with rbf kernel in scikit-learn


I'm trying to use the recursive feature elimination (RFE) function in scikit-learn but keep getting the error ValueError: coef_ is only available when using a linear kernel. I am trying to perform feature selection for a support vector classifier (SVC) using a rbf kernel. This example from the website executes fine:

print(__doc__)

from sklearn.svm import SVC
from sklearn.cross_validation import StratifiedKFold
from sklearn.feature_selection import RFECV
from sklearn.datasets import make_classification
from sklearn.metrics import zero_one_loss

# Build a classification task using 3 informative features
X, y = make_classification(n_samples=1000, n_features=25, n_informative=3,
                       n_redundant=2, n_repeated=0, n_classes=8,
                       n_clusters_per_class=1, random_state=0)

# Create the RFE object and compute a cross-validated score.
svc = SVC(kernel="linear")
rfecv = RFECV(estimator=svc, step=1, cv=StratifiedKFold(y, 2),
          scoring='accuracy')
rfecv.fit(X, y)

print("Optimal number of features : %d" % rfecv.n_features_)

# Plot number of features VS. cross-validation scores
import pylab as pl
pl.figure()
pl.xlabel("Number of features selected")
pl.ylabel("Cross validation score (nb of misclassifications)")
pl.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
pl.show()

However, simply changing the kernel type from linear to rbf, as follows, produces the error:

print(__doc__)

from sklearn.svm import SVC
from sklearn.cross_validation import StratifiedKFold
from sklearn.feature_selection import RFECV
from sklearn.datasets import make_classification
from sklearn.metrics import zero_one_loss

# Build a classification task using 3 informative features
X, y = make_classification(n_samples=1000, n_features=25, n_informative=3,
                       n_redundant=2, n_repeated=0, n_classes=8,
                       n_clusters_per_class=1, random_state=0)

# Create the RFE object and compute a cross-validated score.
svc = SVC(kernel="rbf")
rfecv = RFECV(estimator=svc, step=1, cv=StratifiedKFold(y, 2),
          scoring='accuracy')
rfecv.fit(X, y)

print("Optimal number of features : %d" % rfecv.n_features_)

# Plot number of features VS. cross-validation scores
import pylab as pl
pl.figure()
pl.xlabel("Number of features selected")
pl.ylabel("Cross validation score (nb of misclassifications)")
pl.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
pl.show()

This seems like it could be a bug, but if anyone could spot something I'm doing wrong that would be great. Also, I'm running python 2.7.6 with scikit-learn version 0.14.1.

Thanks for the help!


Answers:


This seems to the expected outcome. RFECV requires the estimator to have an coef_ which signifies the feature importances:

estimator : object

A supervised learning estimator with a fit method that updates a coef_ attribute that holds the fitted parameters. Important features must correspond to high absolute values in the coef_ array.

By changing the kernel to RBF, the SVC is no longer linear and the coef_ attribute becomes unavailable, according to the documentation:

coef_

array, shape = [n_class-1, n_features]

Weights asigned to the features (coefficients in the primal problem). This is only available in the case of linear kernel.

The error is raised by SVC (source) when RFECV is trying to access coef_ when the kernel is not linear.