disk.frame - 基于磁盘的高性能并行内存(RAM)外数据处理框架-FinClip官网

disk.frame - 基于磁盘的高性能并行内存(RAM)外数据处理框架

网友投稿 959 2022-11-03

disk.frame - 基于磁盘的高性能并行内存(RAM)外数据处理框架

Introduction

How do I manipulate tabular data that doesn’t fit into Random Access Memory (RAM)?

Use {disk.frame}!

In a nutshell, {disk.frame} makes use of two simple ideas

split up a larger-than-RAM dataset into chunks and store each chunk in a separate file inside a folder andprovide a convenient API to manipulate these chunks

{disk.frame} performs a similar role to distributed systems such as Apache Spark, Python’s Dask, and Julia’s JuliaDB.jl for medium data which are datasets that are too large for RAM but not quite large enough to qualify as big data.

Installation

You can install the released version of {disk.frame} from CRAN with:

install.packages("disk.frame")

And the development version from GitHub with:

# install.packages("devtools")devtools::install_github("xiaodaigh/disk.frame")

On some platforms, such as SageMaker, you may need to explicitly specify a repo like this

install.packages("disk.frame", repo="https://cran.rstudio.com")

Vignettes and articles

Please see these vignettes and articles about {disk.frame}

Quick start: {disk.frame} which replicates the sparklyr vignette for manipulating the nycflights13 flights data.Ingesting data into {disk.frame} which lists some commons way of creating disk.frames{disk.frame} can be more epic! shows some ways of loading large CSVs and the importance of srckeepGroup-by the various types of group-bysCustom one-stage group-by functions how to define custom one-stage group-by functionsFitting GLMs (including logistic regression) introduces the dfglm function for fitting generalized linear modelsUsing data.table syntax with disk.framedisk.frame conceptsBenchmark 1: disk.frame vs Dask vs JuliaDB

Common questions

a) What is {disk.frame} and why create it?

{disk.frame} is an R package that provides a framework for manipulating larger-than-RAM structured tabular data on disk efficiently. The reason one would want to manipulate data on disk is that it allows arbitrarily large datasets to be processed by R. In other words, we go from “R can only deal with data that fits in RAM” to “R can deal with any data that fits on disk”. See the next section.

b) How is it different to data.frame and data.table?

A data.frame in R is an in-memory data structure, which means that R must load the data in its entirety into RAM. A corollary of this is that only data that can fit into RAM can be processed using data.frames. This places significant restrictions on what R can process with minimal hassle.

In contrast, {disk.frame} provides a framework to store and manipulate data on the hard drive. It does this by loading only a small part of the data, called a chunk, into RAM; process the chunk, write out the results and repeat with the next chunk. This chunking strategy is widely applied in other packages to enable processing large amounts of data in R, for example, see chunkded arkdb, and iotools.

Furthermore, there is a row-limit of 2^31 for data.frames in R; hence an alternate approach is needed to apply R to these large datasets. The chunking mechanism in {disk.frame} provides such an avenue to enable data manipulation beyond the 2^31 row limit.

c) How is {disk.frame} different to previous “big” data solutions for R?

R has many packages that can deal with larger-than-RAM datasets, including ff and bigmemory. However, ff and bigmemory restrict the user to primitive data types such as double, which means they do not support character (string) and factor types. In contrast, {disk.frame} makes use of data.table::data.table and data.frame directly, so all data types are supported. Also, {disk.frame} strives to provide an API that is as similar to data.frame’s where possible. {disk.frame} supports many dplyr verbs for manipulating disk.frames.

Additionally, {disk.frame} supports parallel data operations using infrastructures provided by the excellent future package to take advantage of multi-core CPUs. Further, {disk.frame} uses state-of-the-art data storage techniques such as fast data compression, and random access to rows and columns provided by the fst package to provide superior data manipulation speeds.

d) How does {disk.frame} work?

{disk.frame} works by breaking large datasets into smaller individual chunks and storing the chunks in fst files inside a folder. Each chunk is a fst file containing a data.frame/data.table. One can construct the original large dataset by loading all the chunks into RAM and row-bind all the chunks into one large data.frame. Of course, in practice this isn’t always possible; hence why we store them as smaller individual chunks.

{disk.frame} makes it easy to manipulate the underlying chunks by implementing dplyr functions/verbs and other convenient functions (e.g. the cmap(a.disk.frame, fn, lazy = F) function which applies the function fn to each chunk of a.disk.frame in parallel). So that {disk.frame} can be manipulated in a similar fashion to in-memory data.frames.

e) How is {disk.frame} different from Spark, Dask, and JuliaDB.jl?

Spark is primarily a distributed system that also works on a single machine. Dask is a Python package that is most similar to {disk.frame}, and JuliaDB.jl is a Julia package. All three can distribute work over a cluster of computers. However, {disk.frame} currently cannot distribute data processes over many computers, and is, therefore, single machine focused.

In R, one can access Spark via sparklyr, but that requires a Spark cluster to be set up. On the other hand {disk.frame} requires zero-setup apart from running install.packages("disk.frame") or devtools::install_github("xiaodaigh/disk.frame").

Finally, Spark can only apply functions that are implemented for Spark, whereas {disk.frame} can use any function in R including user-defined functions.

Example usage

Set-up {disk.frame}

{disk.frame} works best if it can process multiple data chunks in parallel. The best way to set-up {disk.frame} so that each CPU core runs a background worker is by using

setup_disk.frame()# this allows large datasets to be transferred between sessionsoptions(future.globals.maxSize = Inf)

The setup_disk.frame() sets up background workers equal to the number of CPU cores; please note that, by default, hyper-threaded cores are counted as one not two.

Alternatively, one may specify the number of workers using setup_disk.frame(workers = n).

Quick-start

suppressPackageStartupMessages(library(disk.frame))library(nycflights13)# this will setup disk.frame's parallel backend with number of workers equal to the number of CPU cores (hyper-threaded cores are counted as one not two)setup_disk.frame()#> The number of workers available for disk.frame is 6# this allows large datasets to be transferred between sessionsoptions(future.globals.maxSize = Inf)# convert the flights data.frame to a disk.frame# optionally, you may specify an outdir, otherwise, the flights.df <- as.disk.frame(nycflights13::flights)

Example: dplyr verbs

dplyr verbs

{disk.frame} aims to support as many dplyr verbs as possible. For example

flights.df %>% filter(year == 2013) %>% mutate(origin_dest = paste0(origin, dest)) %>% head(2)#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time#> 1 2013 1 1 517 515 2 830 819#> 2 2013 1 1 533 529 4 850 830#> arr_delay carrier flight tailnum origin dest air_time distance hour minute#> 1 11 UA 1545 N14228 EWR IAH 227 1400 5 15#> 2 20 UA 1714 N24211 LGA IAH 227 1416 5 29#> time_hour origin_dest#> 1 2013-01-01 05:00:00 EWRIAH#> 2 2013-01-01 05:00:00 LGAIAH

Group-by

Starting from {disk.frame} v0.3.0, there is group_by support for a limited set of functions. For example:

result_from_disk.frame = iris %>% as.disk.frame %>% group_by(Species) %>% summarize( mean(Petal.Length), sumx = sum(Petal.Length/Sepal.Width), sd(Sepal.Width/ Petal.Length), var(Sepal.Width/ Sepal.Width), l = length(Sepal.Width/ Sepal.Width + 2), max(Sepal.Width), min(Sepal.Width), median(Sepal.Width) ) %>% collect

The results should be exactly the same as if applying the same group-by operations on a data.frame. If not, please report a bug.

List of supported group-by functions

If a function you like is missing, please make a feature request here. It is a limitation that function that depend on the order a column can only be obtained using estimated methods.

Function	Exact/Estimate	Notes
`min`	Exact
`max`	Exact
`mean`	Exact
`sum`	Exact
`length`	Exact
`n`	Exact
`n_distinct`	Exact
`sd`	Exact
`var`	Exact	`var(x)` only `cor, cov` support planned
`any`	Exact
`all`	Exact
`median`	Estimate
`quantile`	Estimate	One quantile only
`IQR`	Estimate

Example: data.table syntax

library(data.table)#> #> Attaching package: 'data.table'#> The following object is masked from 'package:purrr':#> #> transpose#> The following objects are masked from 'package:dplyr':#> #> between, first, lastsuppressWarnings( grp_by_stage1 <- flights.df[ keep = c("month", "distance"), # this analysis only required "month" and "dist" so only load those month <= 6, .(sum_dist = sum(distance)), .(qtr = ifelse(month <= 3, "Q1", "Q2")) ])grp_by_stage1#> qtr sum_dist#> 1: Q1 27188805#> 2: Q1 953578#> 3: Q1 53201567#> 4: Q2 3383527#> 5: Q2 58476357#> 6: Q2 27397926

The result grp_by_stage1 is a data.table so we can finish off the two-stage aggregation using data.table syntax

grp_by_stage2 = grp_by_stage1[,.(sum_dist = sum(sum_dist)), qtr]grp_by_stage2#> qtr sum_dist#> 1: Q1 81343950#> 2: Q2 89257810

Basic info

To find out where the disk.frame is stored on disk:

# where is the disk.frame storedattr(flights.df, "path")#> [1] "C:\\Users\\RTX2080\\AppData\\Local\\Temp\\RtmpsnJlFJ\\file3d3ce978e3.df"

A number of data.frame functions are implemented for disk.frame

# get first few rowshead(flights.df, 1)#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time#> 1: 2013 1 1 517 515 2 830 819#> arr_delay carrier flight tailnum origin dest air_time distance hour minute#> 1: 11 UA 1545 N14228 EWR IAH 227 1400 5 15#> time_hour#> 1: 2013-01-01 05:00:00

# get last few rowstail(flights.df, 1)#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time#> 1: 2013 9 30 NA 840 NA NA 1020#> arr_delay carrier flight tailnum origin dest air_time distance hour minute#> 1: NA MQ 3531 N839MQ LGA RDU NA 431 8 40#> time_hour#> 1: 2013-09-30 08:00:00

# number of rowsnrow(flights.df)#> [1] 336776

# number of columnsncol(flights.df)#> [1] 19

Hex logo

Contributors

Current Priorities

The work priorities at this stage are

BugsUrgent feature implementations that can improve an awful user-experienceMore vignettes covering every aspect of disk.frameComprehensive TestsComprehensive DocumentationMore features

Blogs and other resources

Title	Language	Author	Date	Description
25 days of disk.frame	English	ZJ	2019-12-01	25 tweets about `{disk.frame}`
https://researchgate-/post/What_is_the_Maximum_size_of_data_that_is_supported_by_R-datamining	English	Knut Jägersberg	2019-11-11	Great answer on using disk.frame
`{disk.frame}` is epic	English	Bruno Rodriguez	2019-09-03	It’s about loading a 30G file into `{disk.frame}`
My top 10 R packages for data analytics	English	Jacky Poon	2019-09-03	`{disk.frame}` was number 3
useR! 2019 presentation video	English	Dai ZJ	2019-08-03
useR! 2019 presentation slides	English	Dai ZJ	2019-08-03
Split-apply-combine for Maximum Likelihood Estimation of a linear model	English	Bruno Rodriguez	2019-10-06	`{disk.frame}` used in helping to create a maximum likelihood estimation program for linear models
Emma goes to useR! 2019	English	Emma Vestesson	2019-07-16	The first mention of `{disk.frame}` in a blog post
深入对比数据科学工具箱：Python3 和 R 之争(2020版)	Chinese	Harry Zhu	2020-02-16	Mentions disk.frame

Interested in learning {disk.frame} in a structured course?

Please register your interest at:

https://leanpub.com/c/taminglarger-than-ramwithdiskframe

Open Collective

If you like {disk.frame} and want to speed up its development or perhaps you have a feature request? Please consider sponsoring {disk.frame} on Open Collective

Backers

Thank you to all our backers!

Sponsor and back {disk.frame}

Support {disk.frame} development by becoming a sponsor. Your logo will show up here with a link to your website.

Contact me for consulting

Do you need help with machine learning and data science in R, Python, or Julia? I am available for Machine Learning/Data Science/R/Python/Julia consulting! Email me

Non-financial ways to contribute

Do you wish to give back the open-source community in non-financial ways? Here are some ways you can contribute

Write a blogpost about your {disk.frame}. I would love to learn more about how {disk.frame} has helped youTweet or post on social media (e.g LinkedIn) about {disk.frame} to help promote itBring attention to typos and grammatical errors by correcting and making a PR. Or simply by raising an issue hereStar the {disk.frame} Github repoStar any repo that {disk.frame} depends on e.g. {fst} and {future}

Related Repos

https://github.com/xiaodaigh/disk.frame-fannie-mae-example https://github.com/xiaodaigh/disk.frame-vs https://github.com/xiaodaigh/disk.frame-fannie-mae-example https://github.com/xiaodaigh/disk.frame.ml https://github.com/xiaodaigh/courses-larger-than-ram-data-manipulation-with-disk-frame

Download Counts & Build Status

Live Stream of {disk.frame} development

https://youtube.com/playlist?list=PL3DVdT3kym4fIU5CO-pxKtWhdjMVn4XGe

洞察纵观鸿蒙next版本，如何凭借FinClip加强小程序的跨平台管理，确保企业在数字化转型中的高效运营和数据安全？

959 2022-11-03

disk.frame - 基于磁盘的高性能并行内存(RAM)外数据处理框架

洞察纵观鸿蒙next版本，如何凭借FinClip加强小程序的跨平台管理，确保企业在数字化转型中的高效运营和数据安全？

洞察金融行业需要转型，如何利用鸿蒙app开发提升运营效率

洞察在数字化转型过程中，信创推动企业有效整合资源，实现低成本、高效率的跨平台小程序运营。

最近发表

更多内容

小程序SDK

Finclip技术文档

小程序开发

小程序容器

小程序框架

Finclip小程序平台

Finclip用户投稿

车联网

推荐文章

小程序SDK是什么意思？小程序sdk和插件有什么区别？

小程序支付功能怎么实现？

企业app开发流程是什么？

app运营模式有哪些？

小程序多端引流怎么做？

小程序生态分析的机会和威胁

Flutter入门这一篇效率文章就够了

原生与跨平台解决方案分析,跨平台软件开发技术方案

热更新技术：让软件更新变得更加轻松快速

解决方案

银行解决方案

证券解决方案

互联网解决方案

政企OA解决方案

科技解决方案

loT解决方案

信任解决方案

热评文章

AppCan:基于混合模式的移动应用开发,移动混合模

Hybrid App混合模式开发的了解

小程序容器技术助力券商数字营销突围，小程序容器化的意

用mpvue开发微信小程序基础知识（vue.js开发

小程序多端框架全面测评对比，强烈推荐！

券商app架构 - 解析券商应用程序的构建与设计