Conjecture—Scalding下可扩展的机器学习框架-FinClip官网

Conjecture—Scalding下可扩展的机器学习框架

网友投稿 540 2022-10-26

Conjecture—Scalding下可扩展的机器学习框架

Conjecture is a framework for building machine learning models in Hadoop using the Scalding DSL. The goal of this project is to enable the development of statistical models as viable components in a wide range of product settings. Applications include classification and categorization, recommender systems, ranking, filtering, and regression (predicting real-valued numbers). Conjecture has been designed with a primary emphasis on flexibility and can handle a wide variety of inputs. Integration with Hadoop and scalding enable seamless handling of extremely large data volumes, and integration with established ETL processes. Predicted labels can either be consumed directly by the web stack using the dataset loader, or models can be deployed and consumed by live web code. Currently, binary classification (assigning one of two possible labels to input data points) is the most mature component of the Conjecture package.

Tutorial

There are a few stages involved in training a machine learning model using Conjecture.

Create Training Data

We represent the training data as "feature vectors" which are just mappings of feature names to real values. In this case we represent them as a java map of strings to doubles (although we have a class StringKeyedVector which provides convenience methods for feature vector construction). We also need the true label of each instance, which we represent as 0 and 1 (the mapping of these binary labels to e.g., "male" and "female" is up to the user). We construct BinaryLabeledInstances, which are just wrappers for a feature vector and a label.

val bl = new BinaryLabeledInstance(0.0)bl.addTerm("bias", 1.0)bl.addTerm("some_feature", 0.5)

Training a Classifier

Classifiers are essentially trained by presenting the labeled instances to them. There are several kinds of linear classifiers we implement, among them:

Logistic regression,Perceptron,MIRA (a large margin perceptron model),Passive aggressive.

These models all have several options, such as learning rate, regularization parameters and so on. We supply reasonable defaults for these parameters although they can be changed readily. To train a linear model simply call the update function with the labeled instance:

val p = new LogisticRegression()p.update(bl)

In order to make this procedure tractable for large datasets, we provided scalding wrappers for the training. These operate by training several small models on mappers, then aggregating them into a final complete model on the reducers. This wrapper is called like so:

new BinaryModelTrainer(args) .train(instances, 'instance, 'model) .write(SequenceFile("model")) .map('model -> 'model){ x : UpdateableBinaryModel => new com.google.gson.Gson.tojson(x) } .write(Tsv("model_json"))

This code segment will train a model using a pipe called instances which has a field called instance which contains the BinaryLabeledInstance objects. It produces a pipe with a single field containing the completed model, which can then be written to disk.

This class uses the command line args object from scalding, in order to let you set some options on the command line. Some useful options are:

Argument	Possible values	Default	Meaning
--model	mira, logistic_regression, passive_aggressive	passive_aggressive	The type of model to use.
--iters	1, 2, 3...	1	The number of iterations of training to perform.
--zero_class_prob, --one_class_prob	[0, 1]	1

To see all the command line options, see the BinaryModelTrainer class.

Evaluating a Classifier

It is important to get a sense of the performance you can expect out of your classifier on unseen data. In order to do this we recommend to use cross validation. In essence, your input set of instances is split up into testing and training portions (multiple different ways), then a classifier is trained on each training portion, and evaluated (against the true labels which are present) using the testing portion. This is all wrapped up in a class called BinaryCrossValidator, it is used like so:

new BinaryCrossValidator(args, 5) .crossValidate(instances, 'instance) .write(Tsv("model_xval"))

This class also takes the command line arguments, which it passes to a model trainer for each fold. This allows the specification of options to the cross validated models on the command line. The output contains statistics about the performance of the model as well as the confusion matrices for each fold.

A script is included which cross validates a logistic regression model on the iris dataset.

标签：js

敏捷交付如何驱动企业在快速变化的市场中获胜

540 2022-10-26

Conjecture—Scalding下可扩展的机器学习框架

前端框架选型是企业提升开发效率与用户体验的关键因素

大屏前端框架如何推动企业数据可视化与用户体验的革新

敏捷交付如何驱动企业在快速变化的市场中获胜

最近发表

更多内容

小程序SDK

Finclip技术文档

小程序开发

小程序容器

小程序框架

Finclip小程序平台

Finclip用户投稿

车联网

推荐文章

小程序SDK是什么意思？小程序sdk和插件有什么区别？

小程序支付功能怎么实现？

企业app开发流程是什么？

app运营模式有哪些？

小程序多端引流怎么做？

小程序生态分析的机会和威胁

Flutter入门这一篇效率文章就够了

原生与跨平台解决方案分析,跨平台软件开发技术方案

热更新技术：让软件更新变得更加轻松快速

解决方案

银行解决方案

证券解决方案

互联网解决方案

政企OA解决方案

科技解决方案

loT解决方案

信任解决方案

热评文章

AppCan:基于混合模式的移动应用开发,移动混合模

Hybrid App混合模式开发的了解

小程序容器技术助力券商数字营销突围，小程序容器化的意

用mpvue开发微信小程序基础知识（vue.js开发

小程序多端框架全面测评对比，强烈推荐！

券商app架构 - 解析券商应用程序的构建与设计