CodeBuff smart formatter一个基于AI的通用代码格式化程序

网友投稿 1036 2022-10-26

CodeBuff smart formatter一个基于AI的通用代码格式化程序

CodeBuff smart formatter一个基于AI的通用代码格式化程序

CodeBuff smart formatter

Abstract

Code formatting is not particularly exciting but many researchers would consider it either unsolved or not well-solved. The two well-established solutions are:

Build a custom program that formats code for specific a language with ad hoc techniques, typically subject to parameters such as "always put a space between operators". Define a set of formal rules that map input patterns to layout instructions such as "line these expressions up vertically".

Either techniques are painful and finicky.

This repository is a step towards what we hope will be a universal code formatter that uses machine learning to look for patterns in a corpus and to format code using those patterns.

Whoa! It appears to work. Academic paper, Towards a Universal Code Formatter through Machine Learning accepted to SLE2016. Sample output is in the paper or next section.

Sample output

All input is completed squeezed of whitespace/newlines so only the output really matters when examining CodeBuff output. You can check out the output dir for leave-one-out formatting of the various corpora. But, here are some sample formatting results.

SQL

SELECT *FROM DMartLoggingWHERE DATEPART(day, ErrorDateTime) = DATEPART(day, GetDate()) AND DATEPART(month, ErrorDateTime) = DATEPART(month, GetDate()) AND DATEPART(year, ErrorDateTime) = DATEPART(year, GetDate())ORDER BY ErrorDateTime DESC

SELECT CASE WHEN SSISInstanceID IS NULL THEN 'Total' ELSE SSISInstanceID END SSISInstanceID , SUM(OldStatus4) AS OldStatus4 , SUM(Status0) AS Status0 , SUM(Status1) AS Status1 , SUM(Status2) AS Status2 , SUM(Status3) AS Status3 , SUM(Status4) AS Status4 , SUM(OldStatus4 + Status0 + Status1 + Status2 + Status3 + Status4) AS InstanceTotalFROM ( SELECT CONVERT(VARCHAR, SSISInstanceID) AS SSISInstanceID , COUNT(CASE WHEN Status = 4 AND CONVERT(DATE, LoadReportDBEndDate) < CONVERT(DATE, GETDATE()) THEN Status ELSE NULL END) AS OldStatus4 , COUNT(CASE WHEN Status = 0 THEN Status ELSE NULL END) AS Status0 , COUNT(CASE WHEN Status = 1 THEN Status ELSE NULL END) AS Status1 , COUNT(CASE WHEN Status = 2 THEN Status ELSE NULL END) AS Status2 , COUNT(CASE WHEN Status = 3 THEN Status ELSE NULL END) AS Status3--, COUNT ( CASE WHEN Status = 4 THEN Status ELSE NULL END ) AS Status4 , COUNT(CASE WHEN Status = 4 AND DATEPART(DAY, LoadReportDBEndDate) = DATEPART(DAY, GETDATE()) THEN Status ELSE NULL END) AS Status4 FROM dbo.ClientConnection GROUP BY SSISInstanceID ) AS StatusMatrixGROUP BY SSISInstanceID

Java

public class Interpreter { ... public static final Set predefinedAnonSubtemplateAttributes = new HashSet() { { add("i"); add("i0"); } };... public int exec(STWriter out, InstanceScope scope) { final ST self = scope.st; if ( trace ) System.out.println("exec("+self.getName()+")"); try { setDefaultArguments(out, scope); return _exec(out, scope); } catch (Exception e) { StringWriter sw = new StringWriter(); PrintWriter pw = new PrintWriter(sw); e.printStackTrace(pw); pw.flush(); errMgr.runTimeError(this, scope, ErrorType.INTERNAL_ERROR, "internal error: "+sw.toString()); return 0; } }... protected int _exec(STWriter out, InstanceScope scope) { final ST self = scope.st; int start = out.index(); // track char we're about to write int prevOpcode = 0; int n = 0; // how many char we write out int nargs; int nameIndex; int addr; String name; Object o, left, right; ST st; Object[] options; byte[] code = self.impl.instrs; // which code block are we executing int ip = 0; while ( ip

ANTLR

referenceType : classOrInterfaceType | typeVariable | arrayType ;classOrInterfaceType : ( classType_lfno_classOrInterfaceType | interfaceType_lfno_classOrInterfaceType ) ( classType_lf_classOrInterfaceType | interfaceType_lf_classOrInterfaceType )* ;

classModifier : annotation | 'public' | 'protected' | 'private' | 'abstract' | 'static' | 'final' | 'strictfp' ;

typeSpecifier : ( 'void' | 'char' | 'short' | 'int' | 'long' | 'float' | 'double' | 'signed' | 'unsigned' | '_Bool' | '_Complex' | '__m128' | '__m128d' | '__m128i' ) | '__extension__' '(' ('__m128' | '__m128d' | '__m128i') ')' | atomicTypeSpecifier | structOrUnionSpecifier | enumSpecifier | typedefName | '__typeof__' '(' constantExpression ')' // GCC extension ;

Build complete jar

To make a complete jar with all of the dependencies, do this from the repo main directory:

$ mvn clean compile install

This will leave you with artifact target/codebuff-1.4.19.jar or whatever the version number is and put the jar into the usual maven local cache.

Formatting files

To use the formatter, you need to use class org.antlr.codebuff.Tool. Commandline usage:

-g grammar-name. The grammar must be run through ANTLR and be compiled (and in the CLASSPATH). For example, for Java8.g4, use -g Java8, not the filename. For separated grammar files, like ANTLRv4Parser.g4 and ANTLRv4Lexer.g4, use -g ANTLRv4. If the grammar is in a package, use fully-qualified like -g org.antlr.codebuff.ANTLRv4.-rule start-rule. Start rule of the grammar where parsing of a full file starts, such as compilationUnit in Java.g4.-corpus root-dir-of-samples[-files file-extension]. E.g., use java, g4, c, ...[-indent num-spaces]. This defaults to 4 spaces indentation.[-comment line-comment-name]. As a failsafe, CodeBuff allows you to specify the token name for single-line comments, such as LINE_COMMENT, within the grammar so that it can ensure there is a line break after a single line,.[-o output-file]. Filename with optional path to where output should go.file-to-format. Filename (with optional path) must be last.

Output goes to standard out unless you use -o.

$ java -jar target/codebuff-1.4.19.jar \ -g org.antlr.codebuff.ANTLRv4 \ -rule grammarSpec \ -corpus corpus/antlr4/training \ -files g4 \ -indent 4 \ -comment LINE_COMMENT \ T.g4

$ java -jar target/codebuff-1.4.19.jar \ -g org.antlr.codebuff.Java \ -rule compilationUnit \ -corpus corpus/java/training/stringtemplate4 \ -files java \ -comment LINE_COMMENT \ T.java

These examples work for the grammars specified because they are already inside the complete jar. For parsers compiled outside of the jar, you might need to do something like:

java java -cp target/codebuff-1.4.19.jar:$CLASSPATH \ org.antlr.codebuff.Tool \ -g org.antlr.codebuff.ANTLRv4 \ -rule grammarSpec -corpus corpus/antlr4/training \ -files g4 -indent 4 -comment LINE_COMMENT T.g4

Grammar requirements

All whitespace should go to the parser on a hidden channel. For example, here is a rule that does that:

WS : [ \t\r\n\f]+ -> channel(HIDDEN) ;

Comments should also:

BLOCK_COMMENT : '/*' .*? ('*/' | EOF) -> channel(HIDDEN) ;LINE_COMMENT : '//' ~[\r\n]* -> channel(HIDDEN) ;

You can have line comments match newlines if you want.

Speed tests

The paper cites some speed tests for training and formatting time for

guava corpus and java grammarguava corpus and java8 grammarantlr corpus and antlr parser grammar, antlr lexer grammar

First, here is my machine configuration:

Memory speed seems to make a big difference given how much we have to trawl through memory---The tests shown below were done with 1867 MHz DDR3 RAM. We set an initial 4G RAM, 1M stack size. First build everything:

$ mvn clean compile install

Then you can run the speed tests as shown in following subsections.

ANTLR corpus

$ java -Xmx4G -Xss1M -cp target/codebuff-1.4.19.jar org.antlr.codebuff.validation.Speed -antlr corpus/antlr4/training/Java8.g4Loaded 12 files in 172msantlr training of /Users/parrt/antlr/code/codebuff/corpus/antlr4/training/Java8.g4 = 353ms formatting = 340msantlr training of /Users/parrt/antlr/code/codebuff/corpus/antlr4/training/Java8.g4 = 188ms formatting = 161msantlr training of /Users/parrt/antlr/code/codebuff/corpus/antlr4/training/Java8.g4 = 145ms formatting = 153msantlr training of /Users/parrt/antlr/code/codebuff/corpus/antlr4/training/Java8.g4 = 130ms formatting = 129msantlr training of /Users/parrt/antlr/code/codebuff/corpus/antlr4/training/Java8.g4 = 123ms formatting = 113msantlr training of /Users/parrt/antlr/code/codebuff/corpus/antlr4/training/Java8.g4 = 114ms formatting = 116msantlr training of /Users/parrt/antlr/code/codebuff/corpus/antlr4/training/Java8.g4 = 93ms formatting = 90msantlr training of /Users/parrt/antlr/code/codebuff/corpus/antlr4/training/Java8.g4 = 80ms formatting = 90msantlr training of /Users/parrt/antlr/code/codebuff/corpus/antlr4/training/Java8.g4 = 73ms formatting = 88msantlr training of /Users/parrt/antlr/code/codebuff/corpus/antlr4/training/Java8.g4 = 72ms formatting = 71msantlr training of /Users/parrt/antlr/code/codebuff/corpus/antlr4/training/Java8.g4 = 71ms formatting = 69msantlr training of /Users/parrt/antlr/code/codebuff/corpus/antlr4/training/Java8.g4 = 71ms formatting = 73msantlr training of /Users/parrt/antlr/code/codebuff/corpus/antlr4/training/Java8.g4 = 76ms formatting = 63msantlr training of /Users/parrt/antlr/code/codebuff/corpus/antlr4/training/Java8.g4 = 70ms formatting = 70msantlr training of /Users/parrt/antlr/code/codebuff/corpus/antlr4/training/Java8.g4 = 70ms formatting = 69msantlr training of /Users/parrt/antlr/code/codebuff/corpus/antlr4/training/Java8.g4 = 73ms formatting = 70msantlr training of /Users/parrt/antlr/code/codebuff/corpus/antlr4/training/Java8.g4 = 70ms formatting = 68msantlr training of /Users/parrt/antlr/code/codebuff/corpus/antlr4/training/Java8.g4 = 71ms formatting = 66msantlr training of /Users/parrt/antlr/code/codebuff/corpus/antlr4/training/Java8.g4 = 70ms formatting = 70msantlr training of /Users/parrt/antlr/code/codebuff/corpus/antlr4/training/Java8.g4 = 73ms formatting = 72msmedian of [5:19] training 72msmedian of [5:19] formatting 70ms

Guava corpus, Java grammar

$ java -Xms4G -Xss1M -cp target/codebuff-1.4.19.jar org.antlr.codebuff.validation.Speed -java_guava corpus/java/training/guava/cache/LocalCache.javaLoaded 511 files in 1949msjava_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1984ms formatting = 2669msjava_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1747ms formatting = 3166msjava_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1784ms formatting = 2811msjava_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1507ms formatting = 1742msjava_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1499ms formatting = 2832msjava_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1582ms formatting = 2663msjava_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1499ms formatting = 2807msjava_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1561ms formatting = 2815msjava_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1521ms formatting = 2136msjava_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1545ms formatting = 2811msjava_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1501ms formatting = 2800msjava_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1506ms formatting = 2581msjava_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1494ms formatting = 2838msjava_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1494ms formatting = 2789msjava_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1497ms formatting = 2621msjava_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1501ms formatting = 2714msjava_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1506ms formatting = 2816msjava_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1512ms formatting = 2733msjava_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1515ms formatting = 2587msjava_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1508ms formatting = 2430msmedian of [5:19] training 1506msmedian of [5:19] formatting 2733ms

Guava corpus, Java8 grammar

Load time here is very slow (2.5min) because the Java8 grammar is meant to reflect the language spec. It has not been optimized for performance. Once the corpus is loaded, training and formatting times are about the same as for Java grammar.

$ java -Xms4G -Xss1M -cp target/codebuff-1.4.19.jar \ org.antlr.codebuff.validation.Speed \ -java8_guava corpus/java/training/guava/cache/LocalCache.javaLoaded 511 files in 159947msjava8_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 2238ms formatting = 23312msjava8_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1913ms formatting = 2368msjava8_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1855ms formatting = 2277msjava8_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1856ms formatting = 2267msjava8_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1868ms formatting = 2348msjava8_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1890ms formatting = 2263msjava8_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1866ms formatting = 2328msjava8_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1855ms formatting = 2247msjava8_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1856ms formatting = 2243msjava8_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1871ms formatting = 2204msjava8_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1863ms formatting = 2244msjava8_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1850ms formatting = 2212msjava8_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1861ms formatting = 2215msjava8_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1877ms formatting = 2257msjava8_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1843ms formatting = 2249msjava8_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1842ms formatting = 2205msjava8_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1869ms formatting = 2343msjava8_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1864ms formatting = 2225msjava8_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1851ms formatting = 2260msjava8_guava training of /Users/parrt/antlr/code/codebuff/corpus/java/training/guava/cache/LocalCache.java = 1871ms formatting = 2200msmedian of [5:19] training 1863msmedian of [5:19] formatting 2244ms

Generating graphs from paper

In the Towards a Universal Code Formatter Through Machine Learning paper, we have three graphs to support our conclusions. This sections shows how to reproduce them. (Note that these jobs take many minutes to run; maybe up to 30 minutes for one of them on a fast box.)

The Java code generates python code that uses matplotlib. The result of running the python is a PDF of the graph (that also pops up in a window).

Box plot with median error rates

To generate:

do this:

$ mvn clean compile install$ java -Xms8G -Xss1M -cp target/codebuff-1.4.19.jar org.antlr.codebuff.validation.LeaveOneOutValidator...wrote python code to python/src/leave_one_out.py$ cd python/src$ python leave_one_out.py &

Plot showing effect of corpus size on error rate

To generate:

do this:

$ mvn clean compile install$ java -Xms8G -Xss1M -cp target/codebuff-1.4.19.jar org.antlr.codebuff.validation.SubsetValidator...wrote python code to python/src/subset_validator.py$ cd python/src$ python subset_validator.py &

Plot showing effect of varying model parameter k

To generate:

do this:

$ mvn clean compile install$ java -Xms8G -Xss1M -cp target/codebuff-1.4.19.jar org.antlr.codebuff.validation.TestK...wrote python code to python/src/vary_k.py$ cd python/src$ python vary_k.py &

版权声明:本文内容由网络用户投稿,版权归原作者所有,本站不拥有其著作权,亦不承担相应法律责任。如果您发现本站中有涉嫌抄袭或描述失实的内容,请联系我们jiasou666@gmail.com 处理,核实后本网站将在24小时内删除侵权内容。

上一篇:kafka并发写大消息异常TimeoutException排查记录
下一篇:为Laravel的Eloquent ORM添加应用程序级别的级联删除
相关文章

 发表评论

暂时没有评论,来抢沙发吧~