Vol.6 No.2
J. o f C o m p u t .
Sci.
& Technol.
1991
Source-to-Source Conversion Based on F o r m a l Definition Zhang Xing'er(~z~JL ) and Zhu Xiaojun (~1~) Department of Computer Science, Nanjing University, Nanjing 210008 Li Jianxin (~NN) Jiangsu Research Institute of Computing Technology, Nanjing 210000 Dong Jianning (~N~) Institute of Computer, zijin Information Industry Corp., Nanjing 210000 Received December 20, 1988; revised January 4, 1990. Abstract
This paper proposes the idea of source - to - source conversion between two heterogeneous high-level programming languages. The conversion is based on formal definition and oriented to multi-pairs of innguages. The issues in conversion from PASCAL to C are also discussed. 1.
Introduction
In recent years, much effort has been paid to the source-to-source conversion between two heterogeneous high-level languages, such as PASCAL to A D A , ADA .to PASCAL TM , FORTRAN to ADA TM , PASCAL.to M O D U L A - 2 TM and PASCAL t o C I41 . Language conversion can support software reuse and prototyping. However, to date almost all projects are about two specific programming languages, so it is necessary to develop the general purpose conversion tools for multi-pairs of languages. By using the syntax rules and the conversion rules defined formally, a conversion system with the highest flexibility and adaptability can b e obtained. This paPer proposes the idea of source-to-source conversion between two heterogeneous high-level programming languages, which is based on the formal definition and oriented to multi-pairs of languages. I t is the first to apply the improved canonical abstract syntax tree to language conversion. The practice to date shows the virtues of the idea: simplicity, easy to guarantee the correctness and the high adaptability. When the rules are modified, there will be no influence on the architecture of the conversion system. 2.
2.1.
Source-to-Source
Language
Comparison
Conversion
Based
on
Formal
Definition
a n d the F e a s i b i l i t y a b o u t C o n v e r s i o n
Assume that A and B are two heterogeneous high-level programming languages. The purpose of source-to-source conversion is to convert a program PA in .4 into an equivalent program Ps in B. Let [P] represent the functionality of a program P . DefinitiOn 1 . I f P~ is syntactically correct about .4 and so is PB about B , besides, [ PA ] =[ Ps] , then PA and P8 are called equivalent, denoted as PA =-- P s . Definition 2 . I f a system T can convert any PA which is syntactically correct about .4 into a correspondingly equivalent PB, and for any erroneous PA, every error in PA has its mirror in Pc, then T is called a conversion system converting A into B strictly.
No.2
Source-to-Source Conversion Based on Formal Definition
179
In general, it is difficult to be converted strictly. The conversion of an erroneous PA results in an erroneous PB, but PB may not show the same errors as in PA, or it even reveals no error at all. Definition 3 . I f a system T can convert any correct P~ into a correspondingly equivalent PB, but for an erroneous P~ , there may be no mirror in PB for the error in PA, then T is called a conversion system (not strict~ ) converting A into B . In practice, it is necessary for a conversion system T to find out all the syntactic errors in P,4 , where T may be strict or n o t . It is self-evident that by equivalence between two programs , it means that both of them are syntactically correct with regard to the corresponding language. On the premise of syntactic correctness, they have the same functionality, i.e. semantic equivalence. Each specific syntactic structure always corresponds to a definite set of machine commands according to the semantics defined implicitly. The meaning of a source program is determined fully by the syntactic structures which compose the p r o g r a m . That means, to convert PA into the equivalent PB can be implemented at a source level if in B a correspondence relationship can be built for any syntactic structure in A , i . e . any syntactic structure in A can be expressed in terms of the syntactic structures in B. In general, there are three cases about the correspondence relationship between the syntactic structures of the two heterogeneous languages: directly compatible, indirectly compatible and semantically compatible. We illustrate them by some examples in converting PASCAL to C . By directly compatible it means that in B there is a direct correspondence for a syntactic structure in A. They have similar notions and structures. E.g. the repeat_statement in PASCAL can correspond directly to the d o - s t a t e m e n t in C . By indirectly compatible it means that in B there is no direct correspondence for a syntactic structure in A . But the structure may be expressed in terms of the other syntactic structures in A which a r e directly compatible. So it can be corresponding to a combination of syntactic structures in B . E .g. the s e t _ t y p e in PASCAL may be indirectly compatible with one dimension of integer array in C.in which every element takes only the value 0 or 1. By semantically compatible it means that a syntactic structure in A is neither directly compatible nor indirectly compatible. We call it semantically compatible. An example is the with_ statement in P A S C A L . In general, the correspondence for a semantically compatible syntactic structure has to be determined according to its semantics. 2.2.
T h e I d e o l o g y f o r Source- to- S o u r c e C o n v e r s i o n
For implementing the s o u r c e - t o - s o u r c e conversion between two heterogeneous languages, the traditional technique is to use subroutines. But there are several-defects for i t . I ) When the difference between the program structures of two languages is great, it may be necessary to set up buffers for saving temporally the pieces of the source PA which have been handled but the corresponding structures for them in B cannot be generated y e t . The number of buffers may be more than one and the sizes of buffers may be quite big. 2) The program seems to be torn to pieces and got bogged down in implementation details. So it is easy to miss some cases which have to be considered. When errors are revealed o r some syntax rules are to be modified, it is difficult to modify, and new errors may be brought in . 3 ) It is difficult to adapt to the modification of the syntax. Especially when the system must be developed from scratch for a new pair of languages. 4 ) It is difficult to find out the errors in a source language program and retrieve them. Thus both [1] and [2] introduced the internal tree representation as the transition from source (PA) to o~ect (PB) 9 The internal tree representation is either the parsing tree or a variety of parsing trees defined by the designers themselves. The number of nodes i n an intermediate tree is too large. And they'are all for two specific languages without considering h o w to handle the situation when the languages are modified o r / and expanded partly, especially when they
180
J. of C o m p u t . Sci. & Technol.
Vol.6
are a new pair. The idea of source-to-source conversion based on the formal definition is to be directed by both the syntax rules and the conversion rules defined formally, and based on the canonical abstract syntax tree (AST) [51 .Precisely, first, based on the syntax rules for A , a corresponding (improved) canonical AST is generated from PA , then based on the conversion rules from A to B , the corresponding PB is generated from the canonical A S T . In view of keeping only the notational-free essential information in an A S T , the AST is a good common notation for the same kind of syntactic components in different languages. It can be used as an excellent intermediate notation for source-to-source conversion. Introducing the canonical AST TM makes the notion of AST standardized. When the generating l a w s are used, the number of nodes in an AST is not only much less than that in the corresponding parsing tree, but also much less than that in a normal A S T .
2.3. Architecture of the Conversion System The conversion system based on the formal definition consists of four parts, i . e . the rule processor, the AST generator, the converter from tree to program in B , and the prettyprinter. The diagram is shown in F i g . 1 . 1) Rule processor: Its functionality is to input both the syntax rules of the source language A and the conversion rules from source (.4) to object ( B ) , process them into the internal representation, and generate the LR parsing table for A . Obviously, for a particular pair of languages, the rule processor works only once, i . e . it withdraws as soon as the parsing table is generated and the internal representation is obtained from the rules. 2 ) AST generator: It is a classical LR parser which inputs the program in A to be converted and performs both lexical and syntactic analyses. But the recent implementation has several advantages over the normal LR parser. a ) During parsing, the (improved) canonical A S T , rather than the parsing tree, is generated. b ) To keep all the essential information in an A S T , only the tables associated with conversion are generated. c ) The interactive syntactic error retrieve facility is provided when there are some syntactic errors in a source language program .Deleting,modifying or inserting a token interactively are allowed. Program in A I
Rules LJ Rule Processor
l Conversion
Parsing Table . AST Generator Syntax Rules
Rules
[ I
Prettyprmter
Program in B Program in B
Fig. 1 3 ) Converter from tree to program in B : Its functionality is to generate the corresponding program in B from an AST according to the conversion rules. It works as follows. Started with the root node in an A S T , a conversion rule.corresponding to the node name ( n o n t e r m i n a l ) is found o u t . Then its symbols are sequentially handled. The symbols may fall into the following "four classes: operand terminal ( identifier Or c o n s t a n t ) , terminal in B , nonterminal in A, or system subroutine's n a m e . When it is operand terminal, the corresponding identifier or constant is output. When it is terminal in B, it is output immediately.
No.2
Source-to-Source Conversion Based on Formal Definition
181
When it is nonterminal in A , the information including the recent conversion rule and the symbol position is pushed into a stack. The conversion rule on the subnode corresponding to the nonterminal becomes the one to be handled recently. So similarly the symbols in the recent conversion rule are sequentially handled. When the end of the rule is reached, the conversion rule on top of the stack becomes the rule to be handled recently , beginning with the symbol next to the one which had been handled just n o w . Go on in this way until in the stack there is no rule to be handled. The system subroutines are introduced mainly for the semantically compatible syntactic structures. So when it is the system subroutine's n a m e , the corresponding subroutine is called. When finished, the control returns to handle the next symbol in the conversion rule. Initial preparation
]
L
I-o .... roo, ....I ood,r fro~ Push rule number and position
,~
into stack
1
~ End
Get conversion rule from the top of stack as recentone. Pop stack End of the rule ?-
l(q . . . . . . . . .
I
l . . . . ina,
al
....... ,~ o.toutthe = position intostack corresponding
l
in B
subroutine "s I ....
I c=.,o,
qnodeinfo*ma*ionl
I
Takethecorresponding rue as recent one
re :*or,o~176176 I
lad........... positionI
i
1 Fig .2
The flowchart about the control program of the converter is shown in Fig. 2. The advantages are as follows: a ) The handling is directed by the conversion rules, so it is both canonical and simple. b ) The program is simple and its volume is small. c ) The flowchart is suitable for source-to-source conversion between any pair of heterogeneous languages. It is a general-purpose one being oriented to multi-pairs of languages. Modificationis not needed even if the languages (source o r / and object) are modified. 4 ) Prettyprinter : The input is a program in B , produced from an AST, without format generally ,and the output is an indentation form. To avoid redundant recognition, we can output the symbols in B alternately with prettyprinting as shown in Fig. l . In fact, the prettyprinting format may be defined formally in the conversion rules.
2.4. T h e Formal Definition of the Rules The BNF description is used to formally define the syntax of the source language A , and the similar form for the conversion rules. Let VNL represent the set of nonterminals in a language L, VTL the set of terminals in L . Let sym be the set of the names of system subroutines for processing semantically compatible syntactical structures. The syntax rules and
182
J. of C o m p u t . Sci. & Technol.
Vol.6
the conversion rules have the general form as follows: Ui:
:=XilXi2
9 . .
Xin i
Yil Yi2 " " " Yimi where U i s V N A, X i k e V N A u V T A u { e }, Y i k e V N AU V T B u s y m u { e } . Here , e represents the empty string. Each system subroutine's name is prefixed by a character @ . The following are some rules used in the conversion system from PASCAL to C . ( f o r - s t a t e m e n t ) : : =F.OR i : = ( i n i t _ v a l u e ) TO ( final_value ) DO ( statement ) =r for ( i = ( init_.value ) ; i = ( final_ value ) ; i + + ) ( statement ) Directed by the conversion rule which begins with the symbol "=~ " , the control program carries out the source -to-source conversion for the for_ statement without any conversion subroutines. So are for the most of statements. ( assignment_ statement ) : : = ( left ) : = ( expression ) => @ strsetfunc ( left ) @ comasrp ( expression ) @ comasrp ; Directed by the conversion rule, the control program carries out the source-to:source conversion for the assignment_statement. Now it is necessary only to develop two system subroutines: @ strsetfunc and @ comasrp. The functionality of the former is to recognize if the "left" is array n a m e , set n a m e , function name or others. @comasrp outputs symbols " = " , " , " or " ) " according to what the "left ~ is . If the part after " =~ " in a conversion rule is identical to the right-hand side of the corresponding syntax rule, the conversion rule may be omitted (together with the symbol " => " ) . The idea proposed in this paper has been applied to the conversion system SDPCC from PASCAL to C .
3. The Issues in Conversion from PASCAL 3.1.
to C
The Program Structure
The rules for convel~ting program structures are as follows. ( program ) : : =PROGRAM (program name ) ( parameter part ) ( non_subroutine declaration part > (subroutine declaration part) ( program body ) =:~ (parameter part ) < non_ subroutine declaration part ) < subroutine declaration part ) main ( ) ( program body ) PASCAL is a programming language based on the nested b l o c k _ s t r u c t u r e , but C is the one based on the non-nested function (definition). When converting a program in PASCAL into a program in C , the problem on scope must be solved. One is the scope of identifiers, and the other is the scope of labels. 1 ) Scope of Identifier In a program in P A S C A L , the local quantities defined in the outside subroutine can be inherited automatically to within the inside subroutines. However, for C , each function definition is an independent unit. The outside quantities accessed originally in the inside now will be undefined in the unit corresponding to the inside. The right way to transfer information between two function definitions in C is parameter passing. It is necessary to expand the number of parameters of a function which is corresponding to the inside, so that t h r o u g h the parameter passing the quantities of the outside have the definitions within the function definition. The expansion is implemented by subroutines @ param and @ extparam. 2) Scope o f Label Transferring out to an outside subroutine's body or the p r o g r a m ' s body from a subroutine's body by a goto_ statement is allowed in PASCAL. But in C , the label defined
No.2
Source-to-Source Conversion Based on Formal Definition
183
originally in the outside will be undefined in the function definition corresponding to the inside originally, so let the g o t o - s t a t e m e n t within the inside be treated as a r e t u r n _ s t a t e m e n t , making the control transfer to the function definition corresponding to the outside. Now it is necessary to assign the level which the label belongs to and the value of the label (digits) to the global variables _LEVEL (level number) and _ L A B E L (label) respectively just before return. Thus, the C correspondent of the PASCAL p r o g r a m ' s body is as follows'. . . .
goto _FINISH ; _SWITCH: switch (_LABEL) { case labell : _LABEL=0; goto _labell ; case label n : _LABEL=0; goto _labeln ;
} }
-FINISH : ;
Note that the identifiers beginning with hyphen " _ " are introduced by the system. The C correspondent of PASCAL subroutine's body is similar to that of the PASCAL program's b o d y . The difference is to replace the " goto _ F I N I S H ; " with a r e t u r n _ s t a t e m e n t . O f course, the last " - F I N I S H : ; " is unnecessary now. As for the goto_statement there are rules as follows. ( goto_statement )" : : =GOTO ( label ) =~ @ exit goto ( l a b e l ) ; ( label)::=N =~_N The functionality of the system subroutine @ exit is to judge if the goto_ statement is an abnormal exit from a procedure in P A S C A L . If not , the goto_statement is converted int o " goto _ N ; " ; otherwise, the corresponding handling is m a d e . Here the detailed discussion is not given, please refer to [6] .
3.2. The Conversion for Semantically Compatible Syntactic Structures The conversion for the semantically compatible syntactic structures can be implemented by the system subroutines. The SDPCC introduces 40 system subroutines altogether which are closely related to both languages to be processed and the considerations of the develope r ( the rule w r i t e r ) . But , implementation is both simple and straightforward in general. Since the handling generally .is localized, both the modifiability and the correctness are easy to be guaranteed. For example, the subscripted variable explains the introduction of system subroutines. Suppose there.is an array variable dec!aration in a program in PASCAL as follows. VAR A: ARRAY[ - 5 . . the converted declaration in C is int n [ 1 0 - ( - 5 ) + 1 ]
10, 1..20] OF integer
[20-1+1]
the subscripted variable a [ i , j ] will be converted into a[i-(-5)][j-1] i . e . each subscript expression will be reduced by the low bound of the corresponding dimension. That m e a n s , the definition node about the array variable A must be determined to get the definition node of each dimension. So the system subroutine @ arrdef is introduced to perform the task stated above. The corresponding rules are as follows. ( variable): := ( variable )[ (subscript list )] = ) @ arrdef( variable ) [ ( subscript list ) @ index ] According to the dimension definition node determined by @ arrdef, the system subroutine @index gets the low bound and outputs " - l o w bound ~ The rules for subscript list are
184
J. of C o m p u t . Sci. & Technol,
Vol.6
as follows. ( subscript list ) : := ( subscript list ) , ( expression ) = ) ( subscript list ) @ index ] [ ( expression )
4. Conclusion The paper proposes the idea for source-to-source conversion based on the formal definition. And it is the first to apply the improved canonical abstract syntax tree to the conversion. Directed by the idea , the conversion system SDPCC from PASCAL to C has been developed. Judged by the practice so far, the formal method with AST possesses the following advantages. a ) It is easy to modify both syntax rules and conversion rules defined formally. The modification contributes no influence to the system's architecture. It is necessary only to introduce o r / and modify some of the system subroutines, so that the system possesses high adaptability and modifiability. And it is easy to develop the conversion system itself. b ) The architecture of the system itself is simple and the program volume is small. The control algorithm is suitable for source-to-source conversion between any pair of heterogeneous languages. So it can be used as a general:purpose one. c ) The number of nodes in a canonical AST is much less than that in the corresponding parsing tree, so the memory required is reduced greatly and the conversion is speeded u p . It makes the practical application of the syntax tree possible. d ) The syntax rules are defined formally. So it is possible to use the LR parsing technique, making the parsing speed much higher and making an excellent interactive (syntactic) error retrieve facility possible. e) To. write r u l e s and to develop some of system subroutines only are obviously much better than to develop the conversion system from scratch by using another technique else. The work load is reduced greatly. And one can pay much attention to some of the system subroutines which are difficult to be handled. So it is easy to guarantee the correctness. The practice shows that the canonical AST is an efficient tool for source- t o - source conversion between two heterogeneous languages. Introducing it makes the application areas of parsing trees expanded. We will consider applying the idea proposed in the paper to t h e source- t o - source conversion for another pair of heterogeneous languages.
References [1]
P. F. Albrecht et a l . , Source-to-source translation: Ada to Pascal and Pascal to Ada. ACM S[GPLAN Noticeg, 15:12(1980), 183~ 193.
[2] J. K. Slape and P'. J. L. Wallis, Conversion of Fortran to Ada using an intermediate tree representation. The Computer Journal, 26:4 (1983), 344-- 353 9 [3] Zheng Gouliang et al., The Automatic Conversion Tool From Pascal To M o d u l a - 2 . Collectanea for computer software tools, technique and environment, 1985. [4] Zhou Changle, A translating tool of PASCAL into C : The PTOC soRware. Computer Research and Development, 24:3 (1987). [5] Zhang Xing'er, The canonical abstract syntax and direct generation of abstract syntax tree. Chinese J. Computers, 13:: 12(1990). [6] Zhang Xing'eret al., Implementation of the abnormal exit from procedure. Mieroeleetronies and Computer, 7: 6(1990), 7 - - 10.