Spark随机森林实现票房预测-FinClip官网

Spark随机森林实现票房预测

网友投稿 404 2023-07-28

Spark随机森林实现票房预测

前言

最近一段时间都在处理电影领域的数据, 而电影票房预测是电影领域数据建模中的一个重要模块, 所以我们针对电影数据做了票房预测建模.

前期工作

一开始的做法是将这个问题看待成回归的问题, 采用GBDT回归树去做. 训练了不同残差的回归树, 然后做集成学习. 考虑的影响因子分别有电影的类型, 豆瓣评分, 导演的影响力, 演员的影响力, 电影的出品公司. 不过预测的结果并不是那么理想, 准确率为真实值的0.3+/-区间情况下的80%, 且波动性较大, 不容易解析.

后期的改进

总结之前的失败经验, 主要归纳了以下几点:

1.影响因子不够多, 难以建模

2.票房成绩的区间较大(一百万到10亿不等),分布不均匀, 大多数集中与亿级, 所以不适合采用回归方法解决.

3.数据样本量比较少, 不均匀, 预测百万级的电影较多, 影响预测结果

后期, 我们重新规范了数据的输入格式, 即影响因子, 具体如下:

第一行: 电影名字

第二行: 电影票房(也就是用于预测的, 以万为单位)

第三行: 电影类型

第四行: 片长(以分钟为单位)

第五行:上映时间(按月份)

第六行: 制式( 一般分为2D, 3D, IMAX)

第七行: 制作国家

第八行: 导演影响 (以导演的平均票房成绩为衡量, 以万为单位 )

第九行: 演员影响 ( 以所有演员的平均票房成绩为衡量, 以万为单位 )

第十行:制作公司影响 ( 以所有制作公司的平均票房成绩为衡量, 以万为单位 )

第十一行: 发行公式影响 ( 以所有制作公司的平均票房成绩为衡量,以万为单位 )

收集了05-17年的来自中国,日本,美国,英国的电影, 共1058部电影. 由于处理成为分类问题, 故按将电影票房分为以下等级:

在构建模型之前, 先将数据处理成libsvm格式文件, 然后采用随机森林模型训练.

随机森林由许多的决策树组成, 因为这些决策树的形成采用随机的策略, 每个决策树都随机生成, 相互之间独立.模型最后输出的类别是由每个树输出的类别的众数而定.在构建每个决策树的时候采用的策略是信息熵, 决策树为多元分类决策树.随机森林的流程图如下图所示:

随机森林是采用spark-mllib提供的random forest, 由于超过10亿的电影的数据相对比较少, 为了平衡各数据的分布, 采用了过分抽样的方法, 训练模型的代码如下:

public void predict() throws IOException{

SparkConf conf = new SparkConf().setAppName("SVM").setMaster("local");

conf.set("spark.testing.memory", "2147480000");

SparkContext sc = new SparkContext(conf);

SQLContext sqlContext = new SQLContext(sc);

// Load and parse the data file, converting it to a DataFrame.

DataFrame trainData = sqlContext.read().format("libsvm").load(this.trainFile);

DataFrame testData = sqlContext.read().format("libsvm").load(this.testFile);

// Index labels, adding metadata to the label column.

// Fit on whole dataset to include all labels in index.

StringIndexerModel labelIndexer = new StringIndexer()

.setInputCol("label")

.setOutputCol("indexedLabel")

.fit(trainData);

// Automatically identify categorical features, and index them.

// Set maxCategories so features with > 4 distinct values are treated as continuous.

VectorIndexerModel featureIndexer = new VRHaDExmONyectorIndexer()

.setInputCol("features")

.setOutputCol("indexedFeatures")

.setMaxCategories(4)

.fit(trainData);

// Split the data into training and test sets (30% held out for testing)

// DataFrame[] splits = trainData.randomSplit(new double[] {0.9, 0.1});

// trainData = splits[0];

// testData = splits[1];

// Train a RandomForest model.

RandomForestClassifier rf = new RandomForestClassifier()

.setLabelCol("indexedLabel")

.setFeaturesCol("indexedFeatures")

.setNumTrees(20);

// Convert indexed labels back to original labels.

IndexToString labelConverter = new IndexToString()

.setInputCol("prediction")

.setOutputCol("predictedLabel")

.setLabels(labelIndexer.labels());

// Chain indexers and forest in a Pipeline

Pipeline pipeline = new Pipeline()

.setStages(new PipelineStage[] {labelIndexer, featureIndexer, rf, labelConverter});

// Train model. This also runs the indexers.

PipelineModel model = pipeline.fit(trainData);

// Make predictions.

DataFrame predictions = model.transform(testData);

// Select example rows to display.

predictions.select("predictedLabel", "label", "features").show(200);

// Select (prediction, true label) and compute test error

MulticlassClassificationEvaluator evaluator = new MulticlassClassificationEvaluator()

.setLabelCol("indexedLabel")

.setPredictionCol("prediction")

.setMetricName("precision");

double accuracy = evaluator.evaluate(predictions);

System.out.println("Test Error = " + (1.0 - accuracy));

RandomForestClassificationModel rfModel = (RandomForestClassificationModel)(model.stages()[2]);

// System.out.println("Learned classification forest model:\n" + rfModel.toDebugString());

DataFrame resultDF = predictions.select("predictedLabel");

javaRDD resultRow = resultDF.toJavaRDD();

JavaRDD result = resultRow.map(new Result());

this.resultList = result.collect();

for(String one: resultList){

System.out.println(one);

}

下面为其中一个的决策树情况:

Tree 16 (weight 1.0):

If (feature 10 in {0.0})

If (feature 48 <= 110.0)

If (feature 86 <= 13698.87)

If (feature 21 in {0.0})

If (feature 54 in {0.0})

Predict: 0.0

Else (feature 54 not in {0.0})

Predict: 1.0

Else (feature 21 not in {0.0})

Predict: 0.0

Else (feature 86 > 13698.87)

If (feature 21 in {0.0})

If (feature 85 <= 39646.9)

Predict: 2.0

Else (feature 85 > 39646.9)

Predict: 3.0

Else (feature 21 not in {0.0})

Predict: 3.0

Else (feature 48 > 110.0)

If (feature 85 <= 15003.3)

If (feature 9 in {0.0})

If (feature 54 in {0.0})

Predict: 0.0

Else (feature 54 not in {0.0})

Predict: 2.0

Else (feature 9 not in {0.0})

Predict: 2.0

Else (feature 85 > 15003.3)

If (feature 65 in {0.0})

If (feature 85 <= 66065.0)

Predict: 3.0

Else (feature 85 > 66065.0)

Predict: 2.0

Else (feature 65 not in {0.0})

Predict: 3.0

Else (feature 10 not in {0.0})

If (feature 51 in {0.0})

If (feature 85 <= 6958.4)

If (feature 11 in {0.0})

If (feature 50 <= 1.0)

Predict: 1.0

Else (feature 50 > 1.0)

Predict: 0.0

Else (feature 11 not in {0.0})

Predict: 0.0

Else (feature 85 > 6958.4)

If (feature 5 in {0.0})

If (feature 4 in {0.0})

Predict: 3.0

Else (feature 4 not in {0.0})

Predict: 1.0

Else (feature 5 not in {0.0})

Predict: 2.0

Else (feature 51 not in {0.0})

If (feature 48 <= 148.0)

If (feature 0 in {0.0})

If (feature 6 in {0.0})

Predict: 2.0

Else (feature 6 not in {0.0})

Predict: 0.0

Else (feature 0 not in {0.0})

If (feature 50 <= 4.0)

Predict: 2.0

Else (feature 50 > 4.0)

Predict: 3.0

Else (feature 48 > 148.0)

If (feature 9 in {0.0})

If (feature 49 <= 3.0)

Predict: 2.0

Else (feature 49 > 3.0)

Predict: 0.0

Else (feature 9 not in {0.0})

If (feature 36 in {0.0})

Predict: 3.0

Else (feature 36 not in {0.0})

Predict: 2.0

后记

该模型预测的平均准确率为80%, 但相对之前的做法规范了很多, 对结果的解析也更加的合理, 不过如何增强预测的效果, 可以考虑更多的因子, 形如:电影是否有前续;电影网站的口碑指数;预告片的播放量;相关微博的阅读数;百度指数等;

app开发者平台在数字化时代的重要性与发展趋势解析

404 2023-07-28

Spark随机森林实现票房预测

app开发者平台在数字化时代的重要性与发展趋势解析

探索flutter框架开发的app在移动应用市场的潜力与挑战

智慧屏第三方App安装如何提升用户体验与功能拓展

最近发表

更多内容

小程序SDK

Finclip技术文档

小程序开发

小程序容器

小程序框架

Finclip小程序平台

Finclip用户投稿

车联网

推荐文章

小程序SDK是什么意思？小程序sdk和插件有什么区别？

小程序支付功能怎么实现？

企业app开发流程是什么？

app运营模式有哪些？

小程序多端引流怎么做？

小程序生态分析的机会和威胁

Flutter入门这一篇效率文章就够了

原生与跨平台解决方案分析,跨平台软件开发技术方案

热更新技术：让软件更新变得更加轻松快速

解决方案

银行解决方案

证券解决方案

互联网解决方案

政企OA解决方案

科技解决方案

loT解决方案

信任解决方案

热评文章

AppCan:基于混合模式的移动应用开发,移动混合模

Hybrid App混合模式开发的了解

小程序容器技术助力券商数字营销突围，小程序容器化的意

用mpvue开发微信小程序基础知识（vue.js开发

小程序多端框架全面测评对比，强烈推荐！

券商app架构 - 解析券商应用程序的构建与设计