ripyr 用于快速地自动从数据流中推断信息的实用程序

网友投稿 520 2022-10-31

ripyr 用于快速地自动从数据流中推断信息的实用程序

ripyr 用于快速地自动从数据流中推断信息的实用程序

ripyr

Work in progress.

Overview

A utility for inferring information quickly from streams of data. Uses modern python async tooling to process large datasets efficiently. Currently there are two supported source types:

CSVDisk: a CSV on disk, backed by the standard python CSVReaderjsONDisk: a file of one-json-blob-per-row data, without newlines in the json itself anywhere.

For a given source, you can apply any of a number of metrics:

categorical approximate cardinality, based on a bloom filter dates date format inference inference type inference numeric countminmaxhistogram

For small datasets, this isn't all that fast, Pandas performs really well. But because of the async/streaming nature of the library, it maintains a very low memory footprint regardless of dataset size. In extremely large files, this means that not only is ripyr more stable, it's also faster.

A secondary benefit of the library is it's neat declarative syntax.

Example

A simple example parsing a CSV off of disk:

cleaner = StreamingColCleaner(source=CSVDiskSource(filename='sample.csv'))cleaner.add_metric_to_all(CountMetric())cleaner.add_metric('B', [CountMetric(), CardinalityMetric()])cleaner.add_metric('C', [CardinalityMetric(), MaxMetric()])cleaner.add_metric('date', DateFormat())cleaner.process_source()print(json.dumps(cleaner.report(), indent=4, sort_keys=True))

Which will quickly give you a report that looks like:

{ "columns": [ "C", "D", "A", "B", "date" ], "metrics": { "A": { "count": 10000 }, "B": { "approximate_cardinality": 25.794076259392007, "count": 10000 }, "C": { "approximate_cardinality": null, "count": 10000, "max": 0.9999592447150767 }, "D": { "count": 10000 }, "date": { "count": 10000, "estimated_schema": "{Between 1.0-12.0}/{Between 1.0-31.0}/{Between 1970.0-2017.0} {Between 1.0-12.0}:{Between 0.0-59.0} {AM|PM}" } }}

Written in python 3.5+ with async/await and type annotations.

版权声明:本文内容由网络用户投稿,版权归原作者所有,本站不拥有其著作权,亦不承担相应法律责任。如果您发现本站中有涉嫌抄袭或描述失实的内容,请联系我们jiasou666@gmail.com 处理,核实后本网站将在24小时内删除侵权内容。

上一篇:【1002】A+B for Polynomials (25 分)
下一篇:springboot2.0整合logback日志的详细代码
相关文章

 发表评论

暂时没有评论,来抢沙发吧~