数据中聚类个数的确定（Determining the number of clusters in a data set）-FinClip官网

数据中聚类个数的确定（Determining the number of clusters in a data set）

网友投稿 1132 2022-08-30

数据中聚类个数的确定（Determining the number of clusters in a data set）

本文主要讨论聚类中聚类个数的确定问题。

1. K的作用

Intuitively then, the optimal choice of k will strike a balance between maximum compression of the data using a single cluster, and maximum accuracy by assigning each data point to its own cluster.

2. 常用方法

2.1 经验法则（Rule of thumb）

[1]k≈n/2−−−√

2.2 弯形判据 (The Elbow Method）

the percentage of variance V.S. the number of clusters

2.3 信息准则(Information Criterion Approach)

[2][3]如果聚类模型能写成一个似然函数（likelihood function）考虑使用：Akaike information criterion (AIC), Bayesian information criterion (BIC), or the Deviance information criterion (DIC) [4]是关于k-meas的例子。

2.4 (An Information Theoretic Approach)

[5] 率失真理论（Rate distortion theory）应用于选择k，通过信息理论标准最小化误差的同时最大化效率。该策略通过运行一个标准的聚类算法为输入数据在k值从1到n生成一个失真曲线（distortion curve），接着基于数据维数选择的a negative power对失真曲线处理，最后寻找跳跃最大的点作为k。

2.5 轮廓(Choosing k Using the Silhouette)

[6][7]

The silhouette of a datum is a measure of how closely it is matched to data within its cluster and how loosely it is matched to data of the neighbouring cluster.

2.6 交叉验证法(Cross-validation)

[8]

2.7 文本数据 (Finding Number of Clusters in Text Databases)

[9] 矩阵D∈Rn×m

2.8 核矩阵 (Analyzing the Kernel Matrix)

不像先前的方法要求先验聚类，[10]直接从数据本身获得聚类个数。 1.形成核矩阵（数据映射到高维空间线性可分） 2.特征值分解核矩阵 3.分析特征值和特征向量 4.画图找弯点（elbow）

参考及引用文献： [1] [Kanti Mardia et al. (1979). Multivariate Analysis. Academic Press.] [2] [David J. Ketchen, Jr & Christopher L. Shook (1996). “The application of cluster analysis in Strategic Management Research: An analysis and critique”. Strategic Management Journal 17 (6): 441–458.] [3] [Cyril Goutte, Peter Toft, Egill Rostrup, Finn Årup Nielsen, Lars Kai Hansen (March 1999). “On Clustering fMRI Time Series” . NeuroImage 9 (3): 298–310.] [4] [Cyril Goutte, Lars Kai Hansen, Matthew G. Liptrot & Egill Rostrup (2001). “Feature-Space Clustering for fMRI Meta-Analysis” . Human Brain Mapping 13 (3): 165–183.] [5] [Catherine A. Sugar and Gareth M. James (2003). “Finding the number of clusters in a data set: An information theoretic approach” . Journal of the American Statistical Association 98 (January): 750–763.] [6] [Peter J. Rousseuw (1987). “Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis” . Computational and Applied Mathematics 20: 53–65.] [7] [R. Lleti, M.C. Ortiz, L.A. Sarabia, M.S. Sánchez (2004). “Selecting Variables for k-Means Cluster Analysis by Using a Genetic Algorithm that Optimises the Silhouettes” . Analytica Chimica Acta 515: 87–100.] [8] [Finding the Right Number of Clusters in kMeans and EM Clustering: v-Fold Cross-Validation” . Electronic Statistics Textbook. StatSoft. 2010. Retrieved 2010-05-03.] [9] [Can, F.; Ozkarahan, E. A. (1990). “Concepts and effectiveness of the cover-coefficient-based clustering methodology for text databases” . ACM Transactions on Database Systems 15 (4): 483.] [10] [Honarkhah, M and Caers, J (2010). “Stochastic Simulation of Patterns Using Distance-Based Pattern Modeling” . Mathematical Geosciences 42 (5): 487–517.]

洞察纵观鸿蒙next版本，如何凭借FinClip加强小程序的跨平台管理，确保企业在数字化转型中的高效运营和数据安全？

1132 2022-08-30

数据中聚类个数的确定（Determining the number of clusters in a data set）

洞察纵观鸿蒙next版本，如何凭借FinClip加强小程序的跨平台管理，确保企业在数字化转型中的高效运营和数据安全？

洞察金融行业需要转型，如何利用鸿蒙app开发提升运营效率

洞察在数字化转型过程中，信创推动企业有效整合资源，实现低成本、高效率的跨平台小程序运营。

最近发表

更多内容

小程序SDK

Finclip技术文档

小程序开发

小程序容器

小程序框架

Finclip小程序平台

Finclip用户投稿

车联网

推荐文章

小程序SDK是什么意思？小程序sdk和插件有什么区别？

小程序支付功能怎么实现？

企业app开发流程是什么？

app运营模式有哪些？

小程序多端引流怎么做？

小程序生态分析的机会和威胁

Flutter入门这一篇效率文章就够了

原生与跨平台解决方案分析,跨平台软件开发技术方案

热更新技术：让软件更新变得更加轻松快速

解决方案

银行解决方案

证券解决方案

互联网解决方案

政企OA解决方案

科技解决方案

loT解决方案

信任解决方案

热评文章

AppCan:基于混合模式的移动应用开发,移动混合模

Hybrid App混合模式开发的了解

小程序容器技术助力券商数字营销突围，小程序容器化的意

用mpvue开发微信小程序基础知识（vue.js开发

小程序多端框架全面测评对比，强烈推荐！

券商app架构 - 解析券商应用程序的构建与设计