交叉验证-Python应用-FinClip官网

交叉验证-Python应用

网友投稿 771 2022-09-02

交叉验证-Python应用

主要介绍Skearn中交叉验证的相关函数：

数据切分函数交叉验证函数

部分交叉验证函数是和网格搜索等调参方法一起用的，此文不涉及，另Package版本：

Sklearn版本为0.22Pandas版本0.23.4Numpy版本1.17.4

数据划分

train_test_split

随机把数据划分为成两部分，先看help

import pandas as pdimport numpy as np from sklearn import svmfrom sklearn import datasetsfrom sklearn.model_selection import train_test_splitfrom sklearn.model_selection import

help(train_test_split)

Help on function train_test_split in module sklearn.model_selection._split:train_test_split(*arrays, **options) Split arrays or matrices into random train and test subsets Quick utility that wraps input validation and ``next(ShuffleSplit().split(X, y))`` and application to input data into a single call for splitting (and optionally subsampling) data in a oneliner. Read more in the :ref:`User Guide `. Parameters ---------- *arrays : sequence of indexables with same length / shape[0] Allowed inputs are lists, numpy arrays, scipy-sparse matrices or pandas dataframes. test_size : float, int or None, optional (default=None) If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If ``train_size`` is also None, it will be set to 0.25. train_size : float, int, or None, (default=None) If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size. random_state : int, RandomState instance or None, optional (default=None) If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by `np.random`. shuffle : boolean, optional (default=True) Whether or not to shuffle the data before splitting. If shuffle=False then stratify must be None. stratify : array-like or None (default=None) If not None, data is split in a stratified fashion, using this as the class labels. Returns ------- splitting : list, length=2 * len(arrays) List containing train-test split of inputs. .. versionadded:: 0.16 If the input is sparse, the output will be a ``scipy.sparse.csr_matrix``. Else, output type is the same as the input type. Examples -------- >>> import numpy as np >>> from sklearn.model_selection import train_test_split >>> X, y = np.arange(10).reshape((5, 2)), range(5) >>> X array([[0, 1], [2, 3], [4, 5], [6, 7], [8, 9]]) >>> list(y) [0, 1, 2, 3, 4] >>> X_train, X_test, y_train, y_test = train_test_split( ... X, y, test_size=0.33, random_state=42) ... >>> X_train array([[4, 5], [0, 1], [6, 7]]) >>> y_train [2, 0, 3] >>> X_test array([[2, 3], [8, 9]]) >>> y_test [1, 4] >>> train_test_split(y, shuffle=False) [[0, 1, 2], [3, 4]]

构造测试数据

data = datasets.load_iris()irisDf = pd.DataFrame(data=data.data, columns=data.feature_names)# 增加属性数据irisDf["target"] = data.target# 重编码irisDf["species"] = irisDf.target.map({0:"Setosa",1:"Versicolour",2:"Virginica"})# 列名变换 irisDf.columns = ["sepalLength","sepalWidth","petalLength","petalWidth","target","species"]irisDf.head()

	sepalLength	sepalWidth	petalLength	petalWidth	target	species
0	5.1	3.5	1.4	0.2	0	Setosa
1	4.9	3.0	1.4	0.2	0	Setosa
2	4.7	3.2	1.3	0.2	0	Setosa
3	4.6	3.1	1.5	0.2	0	Setosa
4	5.0	3.6	1.4	0.2	0	Setosa

简单的参数说明：

test_size：float时是划分比例，int则是样本数random_state：指定随机数种子，使得结果可复现stratify：数据类型为array-like，该参数是指split数据时，按照该数据进行分层抽样，通常是为了保证split之后的正负样本比一致If not None, data is split in a stratified fashion, using this as the class labels.

X = irisDf.iloc[:, [0, 1, 2, 3]]y = irisDf.targetX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)print(" X_train.shape: ", X_train.shape, "\n", "X_test.shape: ", X_test.shape)print("trainset的样本分布:")print(y_train.value_counts())print("testset的样本分布:")print(y_test.value_counts())

X_train.shape: (105, 4) X_test.shape: (45, 4)trainset的样本分布:1 402 330 32Name: target, dtype: int64testset的样本分布:0 182 171 10Name: target, dtype: int64

此时样本的分布不是和总体数据不一致，可以通过stratify调整

X = irisDf.iloc[:, [0, 1, 2, 3]]y = irisDf.targetX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123, stratify=y)print(" X_train.shape: ", X_train.shape, "\n", "X_test.shape: ", X_test.shape)print("trainset的样本分布:")print(y_train.value_counts())print("testset的样本分布:")print(y_test.value_counts())

X_train.shape: (105, 4) X_test.shape: (45, 4)trainset的样本分布:2 351 350 35Name: target, dtype: int64testset的样本分布:2 151 150 15Name: target, dtype: int64

交叉验证评估

cross_val_score

还是先看文档

help(cross_val_score)

Help on function cross_val_score in module sklearn.model_selection._validation:cross_val_score(estimator, X, y=None, groups=None, scoring=None, cv=None, n_jobs=None, verbose=0, fit_params=None, pre_dispatch='2*n_jobs', error_score=nan) Evaluate a score by cross-validation Read more in the :ref:`User Guide `. Parameters ---------- estimator : estimator object implementing 'fit' The object to use to fit the data. X : array-like The data to fit. Can be for example a list, or an array. y : array-like, optional, default: None The target variable to try to predict in the case of supervised learning. groups : array-like, with shape (n_samples,), optional Group labels for the samples used while splitting the dataset into train/test set. Only used in conjunction with a "Group" :term:`cv` instance (e.g., :class:`GroupKFold`). scoring : string, callable or None, optional, default: None A string (see model evaluation documentation) or a scorer callable object / function with signature ``scorer(estimator, X, y)`` which should return only a single value. Similar to :func:`cross_validate` but only a single metric is permitted. If None, the estimator's default scorer (if available) is used. cv : int, cross-validation generator or an iterable, optional Determines the cross-validation splitting strategy. Possible inputs for cv are: - None, to use the default 5-fold cross validation, - integer, to specify the number of folds in a `(Stratified)KFold`, - :term:`CV splitter`, - An iterable yielding (train, test) splits as arrays of indices. For integer/None inputs, if the estimator is a classifier and ``y`` is either binary or multiclass, :class:`StratifiedKFold` is used. In all other cases, :class:`KFold` is used. Refer :ref:`User Guide ` for the various cross-validation strategies that can be used here. .. versionchanged:: 0.22 ``cv`` default value if None changed from 3-fold to 5-fold. n_jobs : int or None, optional (default=None) The number of CPUs to use to do the computation. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details. verbose : integer, optional The verbosity level. fit_params : dict, optional Parameters to pass to the fit method of the estimator. pre_dispatch : int, or string, optional Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be: - None, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs - An int, giving the exact number of total jobs that are spawned - A string, giving an expression as a function of n_jobs, as in '2*n_jobs' error_score : 'raise' or numeric Value to assign to the score if an error occurs in estimator fitting. If set to 'raise', the error is raised. If a numeric value is given, FitFailedWarning is raised. This parameter does not affect the refit step, which will always raise the error. Returns ------- scores : array of float, shape=(len(list(cv)),) Array of scores of the estimator for each run of the cross validation. Examples -------- >>> from sklearn import datasets, linear_model >>> from sklearn.model_selection import cross_val_score >>> diabetes = datasets.load_diabetes() >>> X = diabetes.data[:150] >>> y = diabetes.target[:150] >>> lasso = linear_model.Lasso() >>> print(cross_val_score(lasso, X, y, cv=3)) [0.33150734 0.08022311 0.03531764] See Also --------- :func:`sklearn.model_selection.cross_validate`: To run cross-validation on multiple metrics and also to return train scores, fit times and score times. :func:`sklearn.model_selection.cross_val_predict`: Get predictions from each split of cross-validation for diagnostic purposes. :func:`sklearn.metrics.make_scorer`: Make a scorer from a performance metric or loss function.

主要参数：

X,y分别为特征数据和目标变量scoring为模型评估指标，如"roc_auc"、“accuracy”、"f1"等

具体可参考KFold

groups

这里似乎有点问题：

怎么保证可重复性，似乎没有提供随机数参数是否通过输入cv发生器时指定随机数？看了源码，因为StratifiedKFold和KFold是在内部调用的，所以不制定随机数之前看别人代码，似乎可以搞个全局的随机数，没准可以用？？抓大放小，以后有空再review这个问题吧

clf = svm.SVC(kernel='linear', C=1)scores = cross_val_score(clf, X, y, cv=5,scoring="f1_macro")

array([0.96658312, 1. , 0.96658312, 0.96658312, 1. ])

cross_validate

该函数和cross_val_score的差别：

It allows specifying multiple metrics for evaluation.It returns a dict containing fit-times, score-times (and optionally training scores as well as fitted estimators) in addition to the test score.

from sklearn.model_selection import cross_validatescoring = ['precision_macro', 'recall_macro']clf = svm.SVC(kernel='linear', C=1, random_state=0)scores = cross_validate(clf, X, y, scoring=scoring)

{'fit_time': array([0.00199509, 0.0009973 , 0.00200558, 0.00099754, 0.0009973 ]), 'score_time': array([0.00099778, 0.00099826, 0.0019834 , 0.00199461, 0.00199437]), 'test_precision_macro': array([0.96969697, 1. , 0.96969697, 0.96969697, 1. ]), 'test_recall_macro': array([0.96666667, 1. , 0.96666667, 0.96666667, 1. ])}

Ref

[1] Sklearn官网文档

[2] cross_val_score文档

2020-03-01 于南京市栖霞区

洞察纵观鸿蒙next版本，如何凭借FinClip加强小程序的跨平台管理，确保企业在数字化转型中的高效运营和数据安全？

771 2022-09-02

交叉验证-Python应用

洞察纵观鸿蒙next版本，如何凭借FinClip加强小程序的跨平台管理，确保企业在数字化转型中的高效运营和数据安全？

洞察金融行业需要转型，如何利用鸿蒙app开发提升运营效率

洞察在数字化转型过程中，信创推动企业有效整合资源，实现低成本、高效率的跨平台小程序运营。

最近发表

更多内容

小程序SDK

Finclip技术文档

小程序开发

小程序容器

小程序框架

Finclip小程序平台

Finclip用户投稿

车联网

推荐文章

小程序SDK是什么意思？小程序sdk和插件有什么区别？

小程序支付功能怎么实现？

企业app开发流程是什么？

app运营模式有哪些？

小程序多端引流怎么做？

小程序生态分析的机会和威胁

Flutter入门这一篇效率文章就够了

原生与跨平台解决方案分析,跨平台软件开发技术方案

热更新技术：让软件更新变得更加轻松快速

解决方案

银行解决方案

证券解决方案

互联网解决方案

政企OA解决方案

科技解决方案

loT解决方案

信任解决方案

热评文章

AppCan:基于混合模式的移动应用开发,移动混合模

Hybrid App混合模式开发的了解

小程序容器技术助力券商数字营销突围，小程序容器化的意

用mpvue开发微信小程序基础知识（vue.js开发

小程序多端框架全面测评对比，强烈推荐！

券商app架构 - 解析券商应用程序的构建与设计