CodeBuff smart formatter一个基于AI的通用代码格式化程序-FinClip官网

CodeBuff smart formatter一个基于AI的通用代码格式化程序

网友投稿 1116 2022-10-26

CodeBuff smart formatter一个基于AI的通用代码格式化程序

CodeBuff smart formatter

Abstract

Code formatting is not particularly exciting but many researchers would consider it either unsolved or not well-solved. The two well-established solutions are:

Build a custom program that formats code for specific a language with ad hoc techniques, typically subject to parameters such as "always put a space between operators". Define a set of formal rules that map input patterns to layout instructions such as "line these expressions up vertically".

Either techniques are painful and finicky.

This repository is a step towards what we hope will be a universal code formatter that uses machine learning to look for patterns in a corpus and to format code using those patterns.

Whoa! It appears to work. Academic paper, Towards a Universal Code Formatter through Machine Learning accepted to SLE2016. Sample output is in the paper or next section.

Sample output

All input is completed squeezed of whitespace/newlines so only the output really matters when examining CodeBuff output. You can check out the output dir for leave-one-out formatting of the various corpora. But, here are some sample formatting results.

SQL

SELECT *FROM DMartLoggingWHERE DATEPART(day, ErrorDateTime) = DATEPART(day, GetDate()) AND DATEPART(month, ErrorDateTime) = DATEPART(month, GetDate()) AND DATEPART(year, ErrorDateTime) = DATEPART(year, GetDate())ORDER BY ErrorDateTime DESC

SELECT CASE WHEN SSISInstanceID IS NULL THEN 'Total' ELSE SSISInstanceID END SSISInstanceID , SUM(OldStatus4) AS OldStatus4 , SUM(Status0) AS Status0 , SUM(Status1) AS Status1 , SUM(Status2) AS Status2 , SUM(Status3) AS Status3 , SUM(Status4) AS Status4 , SUM(OldStatus4 + Status0 + Status1 + Status2 + Status3 + Status4) AS InstanceTotalFROM ( SELECT CONVERT(VARCHAR, SSISInstanceID) AS SSISInstanceID , COUNT(CASE WHEN Status = 4 AND CONVERT(DATE, LoadReportDBEndDate) < CONVERT(DATE, GETDATE()) THEN Status ELSE NULL END) AS OldStatus4 , COUNT(CASE WHEN Status = 0 THEN Status ELSE NULL END) AS Status0 , COUNT(CASE WHEN Status = 1 THEN Status ELSE NULL END) AS Status1 , COUNT(CASE WHEN Status = 2 THEN Status ELSE NULL END) AS Status2 , COUNT(CASE WHEN Status = 3 THEN Status ELSE NULL END) AS Status3--, COUNT ( CASE WHEN Status = 4 THEN Status ELSE NULL END ) AS Status4 , COUNT(CASE WHEN Status = 4 AND DATEPART(DAY, LoadReportDBEndDate) = DATEPART(DAY, GETDATE()) THEN Status ELSE NULL END) AS Status4 FROM dbo.ClientConnection GROUP BY SSISInstanceID ) AS StatusMatrixGROUP BY SSISInstanceID

Java

public class Interpreter { ... public static final Set predefinedAnonSubtemplateAttributes = new HashSet() { { add("i"); add("i0"); } };... public int exec(STWriter out, InstanceScope scope) { final ST self = scope.st; if ( trace ) System.out.println("exec("+self.getName()+")"); try { setDefaultArguments(out, scope); return _exec(out, scope); } catch (Exception e) { StringWriter sw = new StringWriter(); PrintWriter pw = new PrintWriter(sw); e.printStackTrace(pw); pw.flush(); errMgr.runTimeError(this, scope, ErrorType.INTERNAL_ERROR, "internal error: "+sw.toString()); return 0; } }... protected int _exec(STWriter out, InstanceScope scope) { final ST self = scope.st; int start = out.index(); // track char we're about to write int prevOpcode = 0; int n = 0; // how many char we write out int nargs; int nameIndex; int addr; String name; Object o, left, right; ST st; Object[] options; byte[] code = self.impl.instrs; // which code block are we executing int ip = 0; while ( ip

ANTLR

referenceType : classOrInterfaceType | typeVariable | arrayType ;classOrInterfaceType : ( classType_lfno_classOrInterfaceType | interfaceType_lfno_classOrInterfaceType ) ( classType_lf_classOrInterfaceType | interfaceType_lf_classOrInterfaceType )* ;

Build complete jar

To make a complete jar with all of the dependencies, do this from the repo main directory:

$ mvn clean compile install

This will leave you with artifact target/codebuff-1.4.19.jar or whatever the version number is and put the jar into the usual maven local cache.

Formatting files

To use the formatter, you need to use class org.antlr.codebuff.Tool. Commandline usage:

-g grammar-name. The grammar must be run through ANTLR and be compiled (and in the CLASSPATH). For example, for Java8.g4, use -g Java8, not the filename. For separated grammar files, like ANTLRv4Parser.g4 and ANTLRv4Lexer.g4, use -g ANTLRv4. If the grammar is in a package, use fully-qualified like -g org.antlr.codebuff.ANTLRv4.-rule start-rule. Start rule of the grammar where parsing of a full file starts, such as compilationUnit in Java.g4.-corpus root-dir-of-samples[-files file-extension]. E.g., use java, g4, c, ...[-indent num-spaces]. This defaults to 4 spaces indentation.[-comment line-comment-name]. As a failsafe, CodeBuff allows you to specify the token name for single-line comments, such as LINE_COMMENT, within the grammar so that it can ensure there is a line break after a single line,.[-o output-file]. Filename with optional path to where output should go.file-to-format. Filename (with optional path) must be last.

Output goes to standard out unless you use -o.

$ java -jar target/codebuff-1.4.19.jar \ -g org.antlr.codebuff.ANTLRv4 \ -rule grammarSpec \ -corpus corpus/antlr4/training \ -files g4 \ -indent 4 \ -comment LINE_COMMENT \ T.g4

$ java -jar target/codebuff-1.4.19.jar \ -g org.antlr.codebuff.Java \ -rule compilationUnit \ -corpus corpus/java/training/stringtemplate4 \ -files java \ -comment LINE_COMMENT \ T.java

These examples work for the grammars specified because they are already inside the complete jar. For parsers compiled outside of the jar, you might need to do something like:

java java -cp target/codebuff-1.4.19.jar:$CLASSPATH \ org.antlr.codebuff.Tool \ -g org.antlr.codebuff.ANTLRv4 \ -rule grammarSpec -corpus corpus/antlr4/training \ -files g4 -indent 4 -comment LINE_COMMENT T.g4

Grammar requirements

All whitespace should go to the parser on a hidden channel. For example, here is a rule that does that:

WS : [ \t\r\n\f]+ -> channel(HIDDEN) ;

Comments should also:

BLOCK_COMMENT : '/*' .*? ('*/' | EOF) -> channel(HIDDEN) ;LINE_COMMENT : '//' ~[\r\n]* -> channel(HIDDEN) ;

You can have line comments match newlines if you want.

Speed tests

The paper cites some speed tests for training and formatting time for

guava corpus and java grammarguava corpus and java8 grammarantlr corpus and antlr parser grammar, antlr lexer grammar

First, here is my machine configuration:

Memory speed seems to make a big difference given how much we have to trawl through memory---The tests shown below were done with 1867 MHz DDR3 RAM. We set an initial 4G RAM, 1M stack size. First build everything:

$ mvn clean compile install

Then you can run the speed tests as shown in following subsections.

ANTLR corpus

$ java -Xmx4G -Xss1M -cp target/codebuff-1.4.19.jar org.antlr.codebuff.validation.Speed -antlr corpus/antlr4/training/Java8.g4Loaded 12 files in 172msantlr training of /Users/parrt/antlr/code/codebuff/corpus/antlr4/training/Java8.g4 = 353ms formatting = 340msantlr training of /Users/parrt/antlr/code/codebuff/corpus/antlr4/training/Java8.g4 = 188ms formatting = 161msantlr training of /Users/parrt/antlr/code/codebuff/corpus/antlr4/training/Java8.g4 = 145ms formatting = 153msantlr training of /Users/parrt/antlr/code/codebuff/corpus/antlr4/training/Java8.g4 = 130ms formatting = 129msantlr training of /Users/parrt/antlr/code/codebuff/corpus/antlr4/training/Java8.g4 = 123ms formatting = 113msantlr training of /Users/parrt/antlr/code/codebuff/corpus/antlr4/training/Java8.g4 = 114ms formatting = 116msantlr training of /Users/parrt/antlr/code/codebuff/corpus/antlr4/training/Java8.g4 = 93ms formatting = 90msantlr training of /Users/parrt/antlr/code/codebuff/corpus/antlr4/training/Java8.g4 = 80ms formatting = 90msantlr training of /Users/parrt/antlr/code/codebuff/corpus/antlr4/training/Java8.g4 = 73ms formatting = 88msantlr training of /Users/parrt/antlr/code/codebuff/corpus/antlr4/training/Java8.g4 = 72ms formatting = 71msantlr training of /Users/parrt/antlr/code/codebuff/corpus/antlr4/training/Java8.g4 = 71ms formatting = 69msantlr training of /Users/parrt/antlr/code/codebuff/corpus/antlr4/training/Java8.g4 = 71ms formatting = 73msantlr training of /Users/parrt/antlr/code/codebuff/corpus/antlr4/training/Java8.g4 = 76ms formatting = 63msantlr training of /Users/parrt/antlr/code/codebuff/corpus/antlr4/training/Java8.g4 = 70ms formatting = 70msantlr training of /Users/parrt/antlr/code/codebuff/corpus/antlr4/training/Java8.g4 = 70ms formatting = 69msantlr training of /Users/parrt/antlr/code/codebuff/corpus/antlr4/training/Java8.g4 = 73ms formatting = 70msantlr training of /Users/parrt/antlr/code/codebuff/corpus/antlr4/training/Java8.g4 = 70ms formatting = 68msantlr training of /Users/parrt/antlr/code/codebuff/corpus/antlr4/training/Java8.g4 = 71ms formatting = 66msantlr training of /Users/parrt/antlr/code/codebuff/corpus/antlr4/training/Java8.g4 = 70ms formatting = 70msantlr training of /Users/parrt/antlr/code/codebuff/corpus/antlr4/training/Java8.g4 = 73ms formatting = 72msmedian of [5:19] training 72msmedian of [5:19] formatting 70ms

Guava corpus, Java grammar

$ java -Xms4G -Xss1M -cp target/codebuff-1.4.19.jar org.antlr.codebuff.validation.Speed -java_guava corpus/java/training/guava/cache/LocalCache.javaLoaded 511 files in 1949msjava_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1984ms formatting = 2669msjava_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1747ms formatting = 3166msjava_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1784ms formatting = 2811msjava_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1507ms formatting = 1742msjava_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1499ms formatting = 2832msjava_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1582ms formatting = 2663msjava_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1499ms formatting = 2807msjava_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1561ms formatting = 2815msjava_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1521ms formatting = 2136msjava_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1545ms formatting = 2811msjava_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1501ms formatting = 2800msjava_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1506ms formatting = 2581msjava_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1494ms formatting = 2838msjava_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1494ms formatting = 2789msjava_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1497ms formatting = 2621msjava_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1501ms formatting = 2714msjava_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1506ms formatting = 2816msjava_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1512ms formatting = 2733msjava_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1515ms formatting = 2587msjava_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1508ms formatting = 2430msmedian of [5:19] training 1506msmedian of [5:19] formatting 2733ms

Guava corpus, Java8 grammar

Load time here is very slow (2.5min) because the Java8 grammar is meant to reflect the language spec. It has not been optimized for performance. Once the corpus is loaded, training and formatting times are about the same as for Java grammar.

$ java -Xms4G -Xss1M -cp target/codebuff-1.4.19.jar \ org.antlr.codebuff.validation.Speed \ -java8_guava corpus/java/training/guava/cache/LocalCache.javaLoaded 511 files in 159947msjava8_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 2238ms formatting = 23312msjava8_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1913ms formatting = 2368msjava8_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1855ms formatting = 2277msjava8_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1856ms formatting = 2267msjava8_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1868ms formatting = 2348msjava8_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1890ms formatting = 2263msjava8_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1866ms formatting = 2328msjava8_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1855ms formatting = 2247msjava8_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1856ms formatting = 2243msjava8_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1871ms formatting = 2204msjava8_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1863ms formatting = 2244msjava8_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1850ms formatting = 2212msjava8_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1861ms formatting = 2215msjava8_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1877ms formatting = 2257msjava8_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1843ms formatting = 2249msjava8_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1842ms formatting = 2205msjava8_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1869ms formatting = 2343msjava8_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1864ms formatting = 2225msjava8_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1851ms formatting = 2260msjava8_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1871ms formatting = 2200msmedian of [5:19] training 1863msmedian of [5:19] formatting 2244ms

Generating graphs from paper

In the Towards a Universal Code Formatter Through Machine Learning paper, we have three graphs to support our conclusions. This sections shows how to reproduce them. (Note that these jobs take many minutes to run; maybe up to 30 minutes for one of them on a fast box.)

The Java code generates python code that uses matplotlib. The result of running the python is a PDF of the graph (that also pops up in a window).

Box plot with median error rates

To generate:

do this:

$ mvn clean compile install$ java -Xms8G -Xss1M -cp target/codebuff-1.4.19.jar org.antlr.codebuff.validation.LeaveOneOutValidator...wrote python code to python/src/leave_one_out.py$ cd python/src$ python leave_one_out.py &

Plot showing effect of corpus size on error rate

To generate:

do this:

$ mvn clean compile install$ java -Xms8G -Xss1M -cp target/codebuff-1.4.19.jar org.antlr.codebuff.validation.SubsetValidator...wrote python code to python/src/subset_validator.py$ cd python/src$ python subset_validator.py &

Plot showing effect of varying model parameter k

To generate:

do this:

$ mvn clean compile install$ java -Xms8G -Xss1M -cp target/codebuff-1.4.19.jar org.antlr.codebuff.validation.TestK...wrote python code to python/src/vary_k.py$ cd python/src$ python vary_k.py &

国产操作系统生态圈推动信息安全与技术自主发展的新机遇

1116 2022-10-26

CodeBuff smart formatter一个基于AI的通用代码格式化程序

探索flutter框架开发的app在移动应用市场的潜力与挑战

国产操作系统生态圈推动信息安全与技术自主发展的新机遇

React 前端框架助力企业快速适应数字化转型的挑战与机遇

最近发表

更多内容

小程序SDK

Finclip技术文档

小程序开发

小程序容器

小程序框架

Finclip小程序平台

Finclip用户投稿

车联网

推荐文章

小程序SDK是什么意思？小程序sdk和插件有什么区别？

小程序支付功能怎么实现？

企业app开发流程是什么？

app运营模式有哪些？

小程序多端引流怎么做？

小程序生态分析的机会和威胁

Flutter入门这一篇效率文章就够了

原生与跨平台解决方案分析,跨平台软件开发技术方案

热更新技术：让软件更新变得更加轻松快速

解决方案

银行解决方案

证券解决方案

互联网解决方案

政企OA解决方案

科技解决方案

loT解决方案

信任解决方案

热评文章

AppCan:基于混合模式的移动应用开发,移动混合模

Hybrid App混合模式开发的了解

小程序容器技术助力券商数字营销突围，小程序容器化的意

用mpvue开发微信小程序基础知识（vue.js开发

小程序多端框架全面测评对比，强烈推荐！

券商app架构 - 解析券商应用程序的构建与设计