微前端架构如何改变企业的开发模式与效率提升
552
2022-10-31
ripyr 用于快速地自动从数据流中推断信息的实用程序
ripyr
Work in progress.
Overview
A utility for inferring information quickly from streams of data. Uses modern python async tooling to process large datasets efficiently. Currently there are two supported source types:
CSVDisk: a CSV on disk, backed by the standard python CSVReaderjsONDisk: a file of one-json-blob-per-row data, without newlines in the json itself anywhere.
For a given source, you can apply any of a number of metrics:
categorical approximate cardinality, based on a bloom filter dates date format inference inference type inference numeric countminmaxhistogram
For small datasets, this isn't all that fast, Pandas performs really well. But because of the async/streaming nature of the library, it maintains a very low memory footprint regardless of dataset size. In extremely large files, this means that not only is ripyr more stable, it's also faster.
A secondary benefit of the library is it's neat declarative syntax.
Example
A simple example parsing a CSV off of disk:
cleaner = StreamingColCleaner(source=CSVDiskSource(filename='sample.csv'))cleaner.add_metric_to_all(CountMetric())cleaner.add_metric('B', [CountMetric(), CardinalityMetric()])cleaner.add_metric('C', [CardinalityMetric(), MaxMetric()])cleaner.add_metric('date', DateFormat())cleaner.process_source()print(json.dumps(cleaner.report(), indent=4, sort_keys=True))
Which will quickly give you a report that looks like:
{ "columns": [ "C", "D", "A", "B", "date" ], "metrics": { "A": { "count": 10000 }, "B": { "approximate_cardinality": 25.794076259392007, "count": 10000 }, "C": { "approximate_cardinality": null, "count": 10000, "max": 0.9999592447150767 }, "D": { "count": 10000 }, "date": { "count": 10000, "estimated_schema": "{Between 1.0-12.0}/{Between 1.0-31.0}/{Between 1970.0-2017.0} {Between 1.0-12.0}:{Between 0.0-59.0} {AM|PM}" } }}
Written in python 3.5+ with async/await and type annotations.
版权声明:本文内容由网络用户投稿,版权归原作者所有,本站不拥有其著作权,亦不承担相应法律责任。如果您发现本站中有涉嫌抄袭或描述失实的内容,请联系我们jiasou666@gmail.com 处理,核实后本网站将在24小时内删除侵权内容。
发表评论
暂时没有评论,来抢沙发吧~