网络小程序开发(小程序开发软件开发)
1034
2022-09-02
Netflix大奖数据
原文:
Netflix Prize data
Dataset from Netflix's competition to improve their reccommendation algorithm
Context
Netflix held the Netflix Prize open competition for the best algorithm to predict user ratings for films. The grand prize was $1,000,000 and was won by BellKor's Pragmatic Chaos team. This is the dataset that was used in that competition.
Content
This comes directly from the README:
TRAINING DATASET FILE DESCRIPTION
The file "training_set.tar" is a tar of a directory containing 17770 files, one
per movie. The first line of each file contains the movie id followed by a
colon. Each subsequent line in the file corresponds to a rating from a customer
and its date in the following format:
CustomerID,Rating,Date
MovieIDs range from 1 to 17770 sequentially.CustomerIDs range from 1 to 2649429, with gaps. There are 480189 users.Ratings are on a five star (integral) scale from 1 to 5.Dates have the format YYYY-MM-DD.
MOVIES FILE DESCRIPTION
Movie information in "movie_titles.txt" is in the following format:
MovieID,YearOfRelease,Title
MovieID do not correspond to actual Netflix movie ids or IMDB movie ids.YearOfRelease can range from 1890 to 2005 and may correspond to the release of
corresponding DVD, not necessarily its theaterical release.
Title is the Netflix movie title and may not correspond to
titles used on other sites. Titles are in English.
QUALIFYING AND PREDICTION DATASET FILE DESCRIPTION
The qualifying dataset for the Netflix Prize is contained in the text file
"qualifying.txt". It consists of lines indicating a movie id, followed by a
colon, and then customer ids and rating dates, one per line for that movie id.
The movie and customer ids are contained in the training set. Of course the
ratings are withheld. There are no empty lines in the file.
MovieID1:
CustomerID11,Date11
CustomerID12,Date12
…
MovieID2:
CustomerID21,Date21
CustomerID22,Date22
For the Netflix Prize, your program must predict the all ratings the customers
gave the movies in the qualifying dataset based on the information in the
training dataset.
The format of your submitted prediction file follows the movie and customer id,
date order of the qualifying dataset. However, your predicted rating takes the
place of the corresponding customer id (and date), one per line.
For example, if the qualifying dataset looked like:
111:
3245,2005-12-19
5666,2005-12-23
6789,2005-03-14
225:
1234,2005-05-26
3456,2005-11-07
then a prediction file should look something like:
111:
3.0
3.4
4.0
225:
1.0
2.0
which predicts that customer 3245 would have rated movie 111 3.0 stars on the
19th of Decemeber, 2005, that customer 5666 would have rated it slightly higher
at 3.4 stars on the 23rd of Decemeber, 2005, etc.
You must make predictions for all customers for all movies in the qualifying
dataset.
THE PROBE DATASET FILE DESCRIPTION
To allow you to test your system before you submit a prediction set based on the
qualifying dataset, we have provided a probe dataset in the file "probe.txt".
This text file contains lines indicating a movie id, followed by a colon, and
then customer ids, one per line for that movie id.
MovieID1:
CustomerID11
CustomerID12
…
MovieID2:
CustomerID21
CustomerID22
Like the qualifying dataset, the movie and customer id pairs are contained in
the training set. However, unlike the qualifying dataset, the ratings (and
dates) for each pair are contained in the training dataset.
If you wish, you may calculate the RMSE of your predictions against those
ratings and compare your RMSE against the Cinematch RMSE on the same data. See
that value.
译文:
Netflix大奖数据
来自Netflix竞争对手的数据集,以改进其推荐算法
概述:
Netflix举办了Netflix大奖公开赛,评选预测电影用户评级的最佳算法。大奖是100万美元,由贝尔科尔的务实混沌团队赢得。这是比赛中使用的数据集。
所容纳之物
这直接来自自述:
训练数据集文件描述
文件“training_set.tar”是包含17770个文件的目录的tar,一个每部电影。每个文件的第一行包含电影id,后跟冒号文件中的每个后续行对应于客户的评级及其日期,格式如下:
客户ID、等级、日期
● 电影ID的范围从1到17770。
● CustomerID的范围从1到2649429,有间隙。有480189个用户。
● 评级为五星(积分)等级,从1到5。
● 日期的格式为YYYY-MM-DD。
电影文件描述
“Movie_titles.txt”中的电影信息采用以下格式:
电影ID、租赁年、片名
● MovieID与实际的Netflix电影ID或IMDB电影ID不对应。
● release的年份范围从1890年到2005年,可能对应于相应的DVD,不一定是实物版。
● 标题是Netflix电影的标题,可能与
在其他网站上使用的标题。标题是英文的。
限定和预测数据集文件描述
Netflix大奖的合格数据集包含在文本文件中
“qualification.txt”。它由指示电影id的行组成,后跟冒号,然后是客户id和评级日期,该电影id每行一个。
电影和客户ID包含在培训集中。当然是
评级被扣留。文件中没有空行。
影片编号1:
客户11,日期11
客户12,日期12
…
第二部电影:
客户编号21,日期21
客户号22,日期22
对于Netflix大奖,您的程序必须预测所有客户的评分
根据中的信息提供符合条件的数据集中的电影训练数据集。
您提交的预测文件的格式遵循电影和客户id,符合条件的数据集的日期顺序。然而,你的预测评级需要对应客户id(和日期)的位置,每行一个。
例如,如果符合条件的数据集如下所示:
111:
3245,2005-12-19
5666,2005-12-23
6789,2005-03-14
225:
1234,2005-05-26
3456,2005-11-07
然后,预测文件应类似于:
111:
3
3.4
4
225:
1
2
据预测,客户3245会将电影111评为3.0级明星
2005年12月19日,该客户5666会对其进行稍高的评级
2005年12月23日的3.4颗星等。
您必须对排位赛中所有电影的所有客户进行预测数据集。
探测数据集文件描述允许您在提交基于的预测集之前测试系统
通过限定数据集,我们在文件“probe.txt”中提供了一个探测数据集。
此文本文件包含指示电影id的行,后跟冒号,以及然后是客户id,每行一个电影id。
影片编号1:
定制的11
客户12
…
第二部电影:
定制21
客户化22
与限定数据集一样,电影和客户id对也包含在
训练集。但是,与限定数据集不同,评级(和培训数据集中包含每对的日期)。
如果您愿意,您可以根据这些数据计算预测的RMSE
评级,并在相同数据上将您的RMSE与Cinematch RMSE进行比较。看见http://netflixprize.com/faq#probe为了这个价值。
版权声明:本文内容由网络用户投稿,版权归原作者所有,本站不拥有其著作权,亦不承担相应法律责任。如果您发现本站中有涉嫌抄袭或描述失实的内容,请联系我们jiasou666@gmail.com 处理,核实后本网站将在24小时内删除侵权内容。
发表评论
暂时没有评论,来抢沙发吧~