Talk:Top-down parsing

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search
WikiProject Computer science (Rated C-class, Low-importance)
WikiProject iconThis article is within the scope of WikiProject Computer science, a collaborative effort to improve the coverage of Computer science related articles on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
C-Class article C  This article has been rated as C-Class on the project's quality scale.
 Low  This article has been rated as Low-importance on the project's importance scale.
WikiProject Linguistics (Rated C-class, Low-importance)
WikiProject iconThis article is within the scope of WikiProject Linguistics, a collaborative effort to improve the coverage of linguistics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
C-Class article C  This article has been rated as C-Class on the project's quality scale.
 Low  This article has been rated as Low-importance on the project's importance scale.


The article is about LL parsers, but doesn't explain clearly what a top-down parser is. IMHO, it should be rewritten.

What is a top-down chart? IMHO, this article should answer this question. -- 01:24, 3 February 2006 (UTC)

what means "w.r.t."? —Preceding unsigned comment added by (talk) 12:02, 18 June 2008 (UTC)

No no. Not all top-down parsers are LL parsers. Old advanced Top-down parser programming languages of metacompilers could produce left or right handed trees.

Bottom-up production grammar rewrite rule termonolgy just doesn't fit. Thus my use of the term left handed or right handed tree describing which branch is extended as new branches are added. The parse is recursively top-down while the tree building may be top-down or bottom up so to speak. Left recursion is avoided by a zero or more loop construct usually bulding a left handed tree while right recursion builds a right handed tree. But an alternative for example is to build a list of terms with term that are to be substracted having a SUBTRACT or NAGATE sub-tree.Steamerandy (talk) 07:17, 10 September 2018 (UTC)

No input is specified in the "Programming language application" section. How are the the productions being matched? —Preceding unsigned comment added by (talk) 03:48, 26 November 2008 (UTC)

"While LR can be said to be better for languages it is also backwards to specify" - This sentance doesn't make sense. — Preceding unsigned comment added by (talk) 00:07, 22 May 2014 (UTC)

And now for something completely different[edit]

There may be errors in the following. Not the bast at typing when thinking. One common problem is teh for the. Seams my hands have a race condition and the left beats the right. Will get back to this as time allows.

Top down parsers may have nothing to do with formal grammars basicly expressing a top down analysis approach. A formula based parsing program expressed in an early META "language" is an example. If we have a formal grammar of a language it must be rewritten in an analytical, "reductive analysis", form. One may design a language in reductive form from scratch. It is easier to do an example then talk about. The parser is defined using formulated rules or just formula(s). A formula is a TEST(boolean) function that may perform an action. A test returns its boolean success(TRUE) or failure(FALSE) result status. A formula is written using tests (i.e. returning success or failure) that operate on the input stream and is itself a test(boolean) function. A string "..." is a test. "=" would try and match the input stream against the string "=". Formulas generate code directly. They may call other formula. The function they may call are special function that also return success or failure. Functions coded in other languages have parameters distinguishing them from grammar formula. Formula in the metaprogram are a type of function that analyzes language structures defined by them. The SYNTAX (Parser Programming Language) programming model has three stacks and a token buffer. The three stacks are: call stack, parse stack and node stack. Grammar formula are functions that use the call stack to save return addresses on calling other formula or other function in other languages. Parse and node stacks are used in building tree structures that are passed to procedural generator function that operate on trees and lists structures. The token buffer holds characters matched by a token .. formula. Token formula, on success, default to automatically entering the token string into the dictionary, a symbol table stack. A function call may intercedes and creates a different type of object. An interceeding function might create an integer object or other self defining object type. MAKESTRING[] creates a string object bypassing, cataloging it as a symbol. The generator language is a LISP 2 based (object based) programming language. The parser transforms the input character stream into LISP 2 based generator language objects that are then processed by functions having a lisp 2 like grammar with input unparsing parameter rules. This example is about the parser programming language. A top down parser not using a formal grammar(Although a formal language is defined as being defined by a formal grammar. So this conflicts with the defination of formal language):

Reductive, analytical, formula parsing begins execution with a driving formula. A program defined by formulas in this manner starts witg a top driving formula:

We start at the top and write a formula defining what a program is in the language we intend to compile.

program = $declaration; //A program is a
                        //sequenced of

The above driving rule is simply declaring a program to be a series of declaration. The $ operator is specifying that zero or more declaration are to be parsed. If we needed to require at least one declaration the rule would be written:

program = declaration $declaration; //A program is
                                    //a sequence of
                                    //one or more

A more complicated driving rule for the COBOL language might look like:

program =  IDENTIFICATION_Division

The rules above are an Left to right & down description of the entire program. They directly generate code. Rule (1) above in assembly is straight forward. A rule is a function that returns a condition of success or failure. The $, zero or more operator, is a programed loop that applies to the expression following it. The expression code is generated inclosed in a loop construct. Do <something> while successful. So that means we need a label for the start of the loop, generate code for the <something> and generate code to jmp back to the labeled start of the loop on success. On failure of the something the $ loop operator is successful. So in the program rule we have code to return success.

C++ procedure are used in the examples here. The actual code is IA86 assembly language encapsulated in a "__declspec(naked) void" function.
//   program = $declaration;
__declspec(naked) void program(){  _asm {
l1: // $ (zero or more sequance) loop tag l1
   	 call	declaration //call declaration
// repeat declaration until NE status returned
         je	l1       //Jump, on EQ status,
                         //(success) to l1
// On NE failure status the zero or more loop
//  terminates. Proceeding on to:
         cmp	al,al    //force success EQ to
         ret             //always return success

The implementation of the rule in this case is vary simple. The __declspec(naked) creates a c++ function having no preamble or exit code. The _asm { ... } is declaring the contained code to be assembly language. Above the entire function is assembly code. There is no additional code being generated by the C++ compiler in the above example.

When the rule "program" is called it is entered at the first assembly code line of the program function:

l1:	call declaration

The "call" is an assembly instruction that calls a function. The l1: is a label used in addressing that line of code. A grammar rule returns success or failure in the processor status register zero flag. The __declspec(naked) allows programming outside the C++ context. We are not using C calling conventions writing in assembly. The void program() declaration is only providing a label for our rule. It also allows returning of success or failure in the processor flag register. There is no entry or exit code generated by the compiler that could effect the processor status. So on return from declaration,

        je      l1

tests the processor flag and if a equal condition is true it jumps to the l1 tag calling declaration repeatedly as long as declaration is successful. The $declaration is zero or more declaration. If declaration fails the status flag will indicate a NE condition. and the je l1 will instead proceed to the next instruction cmp al,al (comparing the contents of register al to itself) seting the equal condition as the $ zero or more operator always succeeds. The rule returns an eq condition to it's caller. The alternate rule, program = declaration $declaration; requiring at least one declaration with comments:

__declspec(naked) void program(){  _asm {
// program is declared to be a naKed void function
// so there is only the following assembly code.
        call declaration // call declaration
        je   l1          // On success goto l1
        ret              // else return failure
l1: // $ Zero or more generated loop tag.
        call declaration // call declaration,
                         // a naked void function
                         // that parses a
                         // declaration.           
	je	l1       // declaration is called
                        // repeatedly until ne condition in the processor status
	cmp	al,al                     // comparing al to it self always sets the EQ condition in the status register.
	ret                               // returns success an EQ condition in the status register.
                                          //  comparing esp to 0 is a sure way of seting the failure condition.
}}                                        //  rules have no local variables values values returned.

The next step would be defining declaration. The example here very simple to illustrate the top analysis of the input. The language has three types of rules: Character class rules define characters that are used in higher level rules. They generally do not generate code directly. The most efficient implementation is character classes are bits in a character class map table indexed by the character code. The class map table is tested usually by a single machine instruction for a character membership in a class. Token making rules define the tokens of the language. white spacing is automatic when recognizing a token. The skipclass character class may be defined or the default used.

  program = $declaration;              // A program is a sequence of declaration.

// define character classes, dgt, let, and alphanum, used in defining tokens

dgt       : '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9';
let       : 'a' | 'b' | 'c' | 'd' | 'e' | 'f' | 'g' | 'h' | 'i' | 'j' |
            'k' | 'l' | 'm' | 'n' | 'o' | 'p' | 'q' | 'r' | 's' | 't' |
            'u' | 'v' | 'w' | 'x' | 'y' | 'z' |
            'A' | 'B' | 'C' | 'D' | 'E' | 'F' | 'G' | 'H' | 'I' | 'J' |
            'K' | 'L' | 'M' | 'N' | 'O' | 'P' | 'Q' | 'R' | 'S' | 'T' |
            'U' | 'V' | 'W' | 'X' | 'Y' | 'Z';
alphanum  : let | dgt;

// define token of the language

number    .. dgt $dgt MAKENUM();

// define the language syntax

id        .. let $alphanum;

declaration = "quit" EXIT() | (assignment || output || ERROR("invalid input") clear_input());
assignment  = id '=':store expression ';' !2 eval(*1);
output      = expression ':' eval(*1) print(*1);

expression = term $(('+':ADD|'-':SUB) term!2);
term       = factor $(('*':MPY|'/':DIV) factor!2);
factor     = '(' expr ')' | number | id ;

The above an analytical top down grammar of a simple language for an interactive interpreter. The above is a program for a top down parser written in an analytical parsing language. A single '|' is an alternant operator. 'A' or 'B' written as 'A' | 'B'. Backtracking is expressed using the "||" alternate operator. A token rule automaticly backtracks to the state upon token entry and return failure. In '=' rules backtracking is controlled by the "||" alternant operator. literal strings are also tokens and automaticly backtrack to the state upon entry to their matching routine. In the above the "||" is only used in the top driving rule declaration. Where is "quit" is matched exit is called and terminates the program. If "quit" is not matched the alternant (assignment || output || ERROR("invalid input") clear_input()) is tried. First assignment is called. where say an id is matched. If an '=' is not recognized assignment fails returning to declaration where the state is reset to the before assignment was called and the next alternate output is called. output is going to look for an expression. An expression can be as simple as an id or an arithmetic expression. If during matching an expression a failure occurs the state would be reset to when output was called in declaration and the next backtracking alternant tried. Where ERROR("invalid input") is called and the input cleared by a call to clear_input()). If expression succeeds eval is called. eval is a generator function having a different syntax. eval may be programed to fail is an id has not been assigned a value. Or it may just evaluate what numerics it can and return a partial evaluated tree and print programed to display an expression. In the SYNTAX above the ":" operator makes a tree node and the "!2" makes a 2 element tree combining the node and the top 2 parse stack entries. The language three stacks. A call stack, parse stack, and a node stack. As entries entries are matched they are placed on the parse stack. Matched literal strings "..." are not normally put on the parse stack. A '+' operator may be used to place a matched literal on the parse stack. Strings may be matched using the ' or " character. ' may be used only to match a single character string. 'a' is a valid 'ed string 'xyz' is not. A " character must be used to match a multi character string. Only ' strings are allowed in character class rules.

The above grammar program is a top down analytical definition of a language. It can parse many context sensitive constructs. Backtracking is controlled by the grammar programming. In the above backtracking is used to resolve the ambiguity of outputting an id and an assignment statements. The order of evaluation is important as the assignment is tried first as an expression in the output rule may be an id. And last the backtracking alternant is used to recover from an error. Of note is that eval can fail if an id has not been assigned a value. The language the above is written is a metacompiler metaprogramming language. There is a symbol table and symbol may have attributes. A value attribute may be assigned the evaluation of an expression. The pressensance of the value attribute can be tested by the eval function.

This doesn't exactly fit the description given in the article here on Top-down parsing.

--Steamerandy (talk) 18:00, 15 November 2014 (UTC)