交叉验证-Python应用
主要介绍Skearn中交叉验证的相关函数:
数据切分函数交叉验证函数
部分交叉验证函数是和网格搜索等调参方法一起用的,此文不涉及,另Package版本:
Sklearn版本为0.22Pandas版本0.23.4Numpy版本1.17.4
数据划分
train_test_split
随机把数据划分为成两部分,先看help
import pandas as pdimport numpy as np from sklearn import svmfrom sklearn import datasetsfrom sklearn.model_selection import train_test_splitfrom sklearn.model_selection import
help(train_test_split)
Help on function train_test_split in module sklearn.model_selection._split:train_test_split(*arrays, **options) Split arrays or matrices into random train and test subsets Quick utility that wraps input validation and ``next(ShuffleSplit().split(X, y))`` and application to input data into a single call for splitting (and optionally subsampling) data in a oneliner. Read more in the :ref:`User Guide `. Parameters ---------- *arrays : sequence of indexables with same length / shape[0] Allowed inputs are lists, numpy arrays, scipy-sparse matrices or pandas dataframes. test_size : float, int or None, optional (default=None) If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If ``train_size`` is also None, it will be set to 0.25. train_size : float, int, or None, (default=None) If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size. random_state : int, RandomState instance or None, optional (default=None) If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by `np.random`. shuffle : boolean, optional (default=True) Whether or not to shuffle the data before splitting. If shuffle=False then stratify must be None. stratify : array-like or None (default=None) If not None, data is split in a stratified fashion, using this as the class labels. Returns ------- splitting : list, length=2 * len(arrays) List containing train-test split of inputs. .. versionadded:: 0.16 If the input is sparse, the output will be a ``scipy.sparse.csr_matrix``. Else, output type is the same as the input type. Examples -------- >>> import numpy as np >>> from sklearn.model_selection import train_test_split >>> X, y = np.arange(10).reshape((5, 2)), range(5) >>> X array([[0, 1], [2, 3], [4, 5], [6, 7], [8, 9]]) >>> list(y) [0, 1, 2, 3, 4] >>> X_train, X_test, y_train, y_test = train_test_split( ... X, y, test_size=0.33, random_state=42) ... >>> X_train array([[4, 5], [0, 1], [6, 7]]) >>> y_train [2, 0, 3] >>> X_test array([[2, 3], [8, 9]]) >>> y_test [1, 4] >>> train_test_split(y, shuffle=False) [[0, 1, 2], [3, 4]]
构造测试数据
data = datasets.load_iris()irisDf = pd.DataFrame(data=data.data, columns=data.feature_names)# 增加属性数据irisDf["target"] = data.target# 重编码irisDf["species"] = irisDf.target.map({0:"Setosa",1:"Versicolour",2:"Virginica"})# 列名变换 irisDf.columns = ["sepalLength","sepalWidth","petalLength","petalWidth","target","species"]irisDf.head()
| sepalLength
| sepalWidth
| petalLength
| petalWidth
| target
| species
|
0
| 5.1
| 3.5
| 1.4
| 0.2
| 0
| Setosa
|
1
| 4.9
| 3.0
| 1.4
| 0.2
| 0
| Setosa
|
2
| 4.7
| 3.2
| 1.3
| 0.2
| 0
| Setosa
|
3
| 4.6
| 3.1
| 1.5
| 0.2
| 0
| Setosa
|
4
| 5.0
| 3.6
| 1.4
| 0.2
| 0
| Setosa
|
简单的参数说明:
test_size:float时是划分比例,int则是样本数random_state:指定随机数种子,使得结果可复现stratify:数据类型为array-like,该参数是指split数据时,按照该数据进行分层抽样,通常是为了保证split之后的正负样本比一致If not None, data is split in a stratified fashion, using this as the class labels.
X = irisDf.iloc[:, [0, 1, 2, 3]]y = irisDf.targetX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)print(" X_train.shape: ", X_train.shape, "\n", "X_test.shape: ", X_test.shape)print("trainset的样本分布:")print(y_train.value_counts())print("testset的样本分布:")print(y_test.value_counts())
X_train.shape: (105, 4) X_test.shape: (45, 4)trainset的样本分布:1 402 330 32Name: target, dtype: int64testset的样本分布:0 182 171 10Name: target, dtype: int64
此时样本的分布不是和总体数据不一致,可以通过stratify调整
X = irisDf.iloc[:, [0, 1, 2, 3]]y = irisDf.targetX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123, stratify=y)print(" X_train.shape: ", X_train.shape, "\n", "X_test.shape: ", X_test.shape)print("trainset的样本分布:")print(y_train.value_counts())print("testset的样本分布:")print(y_test.value_counts())
X_train.shape: (105, 4) X_test.shape: (45, 4)trainset的样本分布:2 351 350 35Name: target, dtype: int64testset的样本分布:2 151 150 15Name: target, dtype: int64
交叉验证评估
cross_val_score
还是先看文档
help(cross_val_score)
Help on function cross_val_score in module sklearn.model_selection._validation:cross_val_score(estimator, X, y=None, groups=None, scoring=None, cv=None, n_jobs=None, verbose=0, fit_params=None, pre_dispatch='2*n_jobs', error_score=nan) Evaluate a score by cross-validation Read more in the :ref:`User Guide `. Parameters ---------- estimator : estimator object implementing 'fit' The object to use to fit the data. X : array-like The data to fit. Can be for example a list, or an array. y : array-like, optional, default: None The target variable to try to predict in the case of supervised learning. groups : array-like, with shape (n_samples,), optional Group labels for the samples used while splitting the dataset into train/test set. Only used in conjunction with a "Group" :term:`cv` instance (e.g., :class:`GroupKFold`). scoring : string, callable or None, optional, default: None A string (see model evaluation documentation) or a scorer callable object / function with signature ``scorer(estimator, X, y)`` which should return only a single value. Similar to :func:`cross_validate` but only a single metric is permitted. If None, the estimator's default scorer (if available) is used. cv : int, cross-validation generator or an iterable, optional Determines the cross-validation splitting strategy. Possible inputs for cv are: - None, to use the default 5-fold cross validation, - integer, to specify the number of folds in a `(Stratified)KFold`, - :term:`CV splitter`, - An iterable yielding (train, test) splits as arrays of indices. For integer/None inputs, if the estimator is a classifier and ``y`` is either binary or multiclass, :class:`StratifiedKFold` is used. In all other cases, :class:`KFold` is used. Refer :ref:`User Guide ` for the various cross-validation strategies that can be used here. .. versionchanged:: 0.22 ``cv`` default value if None changed from 3-fold to 5-fold. n_jobs : int or None, optional (default=None) The number of CPUs to use to do the computation. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details. verbose : integer, optional The verbosity level. fit_params : dict, optional Parameters to pass to the fit method of the estimator. pre_dispatch : int, or string, optional Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be: - None, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs - An int, giving the exact number of total jobs that are spawned - A string, giving an expression as a function of n_jobs, as in '2*n_jobs' error_score : 'raise' or numeric Value to assign to the score if an error occurs in estimator fitting. If set to 'raise', the error is raised. If a numeric value is given, FitFailedWarning is raised. This parameter does not affect the refit step, which will always raise the error. Returns ------- scores : array of float, shape=(len(list(cv)),) Array of scores of the estimator for each run of the cross validation. Examples -------- >>> from sklearn import datasets, linear_model >>> from sklearn.model_selection import cross_val_score >>> diabetes = datasets.load_diabetes() >>> X = diabetes.data[:150] >>> y = diabetes.target[:150] >>> lasso = linear_model.Lasso() >>> print(cross_val_score(lasso, X, y, cv=3)) [0.33150734 0.08022311 0.03531764] See Also --------- :func:`sklearn.model_selection.cross_validate`: To run cross-validation on multiple metrics and also to return train scores, fit times and score times. :func:`sklearn.model_selection.cross_val_predict`: Get predictions from each split of cross-validation for diagnostic purposes. :func:`sklearn.metrics.make_scorer`: Make a scorer from a performance metric or loss function.
主要参数:
X,y分别为特征数据和目标变量scoring为模型评估指标,如"roc_auc"、“accuracy”、"f1"等
具体可参考KFold
groups
这里似乎有点问题:
怎么保证可重复性,似乎没有提供随机数参数是否通过输入cv发生器时指定随机数?看了源码,因为StratifiedKFold和KFold是在内部调用的,所以不制定随机数之前看别人代码,似乎可以搞个全局的随机数,没准可以用??抓大放小,以后有空再review这个问题吧
clf = svm.SVC(kernel='linear', C=1)scores = cross_val_score(clf, X, y, cv=5,scoring="f1_macro")
array([0.96658312, 1. , 0.96658312, 0.96658312, 1. ])
cross_validate
该函数和cross_val_score的差别:
It allows specifying multiple metrics for evaluation.It returns a dict containing fit-times, score-times (and optionally training scores as well as fitted estimators) in addition to the test score.
from sklearn.model_selection import cross_validatescoring = ['precision_macro', 'recall_macro']clf = svm.SVC(kernel='linear', C=1, random_state=0)scores = cross_validate(clf, X, y, scoring=scoring)
{'fit_time': array([0.00199509, 0.0009973 , 0.00200558, 0.00099754, 0.0009973 ]), 'score_time': array([0.00099778, 0.00099826, 0.0019834 , 0.00199461, 0.00199437]), 'test_precision_macro': array([0.96969697, 1. , 0.96969697, 0.96969697, 1. ]), 'test_recall_macro': array([0.96666667, 1. , 0.96666667, 0.96666667, 1. ])}
Ref
[1] Sklearn官网文档
[2] cross_val_score文档
2020-03-01 于南京市栖霞区
版权声明:本文内容由网络用户投稿,版权归原作者所有,本站不拥有其著作权,亦不承担相应法律责任。如果您发现本站中有涉嫌抄袭或描述失实的内容,请联系我们jiasou666@gmail.com 处理,核实后本网站将在24小时内删除侵权内容。
暂时没有评论,来抢沙发吧~