slideList is[slidei, slide1,slide1a, slide2,slide2a, slide2b, slide2c, slide2d, slide2e, slide2f, slideii, slide3,slide4,slide5,slide6,slide7,slide8,slide9,slide10,slide11,slide11a,slide12,slide13,slide14,slide15,slide16,slide17,slide18,slide19,slide20,slide21,slide22,slide23]; slidei is Slide("
Traditional parsers are often implemented using parser generator tools, such as the UNIX utilities Flex and Bison. The full details of the parsing techniques used by Flex and Bison are beyond the scope of this tutorial, but the general principles will be illustrated by making an EM construal. Before tracing the stages in the development of the construal, we first demonstrate it in operation in a simple application: transforming an arithmetic expression in '+' and '*' into postfix (\"Reverse Polish\") notation.
To follow the demonstration, it is helpful to first dispose the panels on the screen so that you can conveniently access this presentation and the EDEN Input Window together with the HTML5 Canvases default, view_3, view_4 and view_5.
"); slide1 is Slide("The JS-EDEN parsing model is an EM construal of a traditional parser. Making the construal exposes the whole process of parsing in detail. In its fully developed form the construal can also be used to parse given strings according to a specified grammar.
Because the model serves as a parser written in JS-EDEN, it enables users to write a model in a new notation without using any external compiler. It also benefits students of parsing theory who can gain understanding of the process of parsing through exploring the model.
Before tracing the construction of the parsing model in detail, we first illustrate key features of the complete model.
"); slide1a is Slide("When Flex and Bison are used to generate a parser for a grammar, a crucial step is the construction of a finite state machine (FSM) that is based on the grammar.
The graph on the default HTML5 Canvas is a visualisation of the FSM that is generated from the following simple expression grammar:
E → E+T E → T T → T*F T → F F → (E) F → idIn this grammar, the terminal symbol id refers either to a specific numerical value or to an identifier which has a numerical value associated with it. The non-terminal symbol E stands for an arbitrary arithmetic expression in + and * in which the operands are numbers or identifiers with an associated numerical value.
Note that this grammar (which is taken from teaching resources published at the University of Alaska Fairbanks at the url http://www.cs.uaf.edu/~cs631/aho/y.output) is rather ill-suited to automated parsing. It is however quite a good example for the purposes of EM construal.
"); slide2 is Slide("The FSM depicted on the default HTML5 Canvas is derived from a textual representation that can be produced by Flex and Bison. To open an HTML view to display
The y.output file spells out the way in which the 13 states of the FSM are linked by transitions associated with the terminal and non-terminal symbols in the grammar. These edges are labelled by their corresponding symbols, and are also colour coded accordingly. The initial state of the FSM is state 0 and the final state is state 7. The FSM makes transitions from state to state as the parse proceeds and in the process builds up a path within the FSM. The edges of this path are displayed in black.
"); slide2a is Slide("We shall illustrate parsing using the construal with a simple example: the conversion of the expression a*(b+3*(b+c)) into reverse Polish notation. To follow the parsing process step-by-step, dispose the panels on the display so that the finite state machine and the Canvas panels view_3, view_4 and vew_5 are visible. To initialise the parse, then execute:
The parser reads the input string from left to right. There are two basic operations - shift and reduce. A shift operation transfers the next symbol from the input string to the stack that is depicted in view_4. A reduce operation replaces the top segment of the stack by a single non-terminal symbol in accordance with the grammar rules set out in view_3. These operations can be performed manually by pressing the Next Step and Apply Rule buttons, which correspond to shift and reduce respectively.
"); slide2b is Slide("With the Input String in place, you begin the parse by pressing the Next Step button. This reads the first symbol from the input, which is the identifier 'a'. You will see that this identifier appears on the stack in view_4, together with its associated semantic value, which for the purpose of this translation exercise is the string \"a\". You will also notice that a path in the FSM is initiated - it comprises a single edge, labelled by the token ID, that represents a transition from the initial state 0 to the state 1.
The idea behind the parsing method is that a derivation tree for the input string is developed incrementally by identifying the top segment of the stack with the RHS of a grammar rule, and replacing this by the non-terminal symbol on the LHS ('a reduction'). The semantic values associated with the top segment of the stack are simultaneously processed ('syntax-directed translation'). If the parse is to succeed then the entire input string must be processed in such a way that on termination the stack consists of a single symbol - the non-terminal E, which represents a well-formed arithmetic expression.
"); slide2c is Slide ("When the first input has been read, there are several possible reductions that can be performed. For instance, we could apply rule 6 to reduce ID to F, then apply rule 4 to reduce F to T, then apply rule 2 to T to get E. Given that the next symbol in the input is '*', reference to the Grammar Rules in view_3 shows that there is only one way in which the top segment of the stack can go on to form the RHS of a grammar rule once the '*' has been read. This can only happen through a reduction using rule 3 for which the RHS takes the form 'T*F'. On that basis the initial reduction of the ID must generate a T by invoking rule 6 followed by rule 4, but not applying rule 2.
Following the above reasoning, it follows that the first steps of the manual parse are: Apply Rule 6, Apply Rule 4 then press Next Step. As you can see from inspection of the input, the reduction using rule 3 alluded to above can't in fact take place until the final symbol in the string is read. The next point in the parse at which a reduction is possible is when the leftmost occurrence of the identifier b is read. The procedure to complete the parsing from this point is left as an exercise to the reader.
"); slide2d is Slide("There is at all times a direct correspondence between symbols on the stack and the path so far traversed in the FSM. Since the input string, and the stack, can have unbounded length the path necessarily can include cycles - as you will have observed in parsing the specified input.
You can interpret the shift and reduce operations in the FSM. A shift corresponds to a transition from one state to another in direct response to reading a terminal symbol. A reduction is associated with snipping a segment off the end of the path and replacing it by a single edge labelled by a non-terminal symbol. (This replacement process is associated with the 'go to' constructs in the description of the FSM in y.output.) The current state of the FSM is always the node at the end of the path.
What makes the grammar badly designed for automatic parsing is what makes it a more interesting subject in manual parsing - the fact that there are often choices to be made between possible shift and reduce actions. These are referred to as 'shift-reduce' and reduce-reduce' conflicts.
"); slide2e is Slide("When you have successfully carried out the parse manually, you will recognise a simple set of rules that you can follow in order to automate the process. The parsing model is set up in such a way that by setting up a boolean flag 'isautomatic' you can invoke a procedure called 'automatic' that you specify yourself.
A suitable 'automatic' procedure for our example grammar - already loaded in the model, but available here for modification in case you wish to experiment - is specified on the next slide.
When you have inspected the automatic procedure you can invoke automatic exection and watch the parsing process being replayed.
To toggle between manual and automatic execution, you execute the following code:
In this way, you can in fact intervene in the automated parsing process if you wish.
"); slide2f is Slide ("Having seen how the JS-EDEN parsing model works, you are now in a position to examine its construction in more detail. Following standard EM practice, the construction will be traced from beginning to end. For this purpose, it will be convenient to restart JS-EDEN and initialise by cutting and pasting the following code into the Input Window:
include(\"models/parsingEMPE/newrun.js-e\");
This will remove the definitions that specify the parsing model and restore you to the same point in this presentation.
Though the reconstruction of the parsing construal is quite detailed and can be carried out simply by following through the presentation, there are some additional definitions etc that are not explicitly exposed. To examine these, it would be necessary to refer to the source of the model itself within the online directory of models associated with the JS-EDEN master.
"); slide3 is Slide("To record the structure of the FSM we use a list called 'transitions' with 13 elements because the finite state machine has 13 states. Since the states in the FSM are indexed from 0, it is convenient to record the transitions associated with state i in the i-th entry of the list for i in the range 1 to 12 and use the 13th entry to record transitions in state 0. When there is no meaningful transition defined for a given state-symbol pair we introduce a transition to a dummy state with index '100'.
For example, from the grammar, we see that in state 4, reading a *, we should go to state 9. This piece of information will be stored in the list element transitions[4] - itself a list - as a triple [4,\"*\",9]. Likewise the transition from state 0 to state 5 in response to an F is recorded in transitions[13] as a triple [13, \"F\",5].
Constructing the list of transitions manually is tiresome work. To avoid this, we could use a UNIX script file for automatic translation of the y.output file. An example of such a file (to be run in sed environment) can be found here:
mkeden fileTo set up the parsing model, execute the code below:
Now we have a plain canvas. In order to understand how the traditional parser works, the next thing to do is to make a visualisation of the finite state machine on the canvas.
We are going to use the graph tool in JS-Eden. Before starting adding something to the canvas, we need to set values for two observables: numberOfNodes (number of states) and numberOfSyms (number of symbols featuring in the grammar).
The red node on the canvas is a graph representation of node 0 (state 0).
Of course, we want to have more:
We will add a label to each node to distinguish the nodes.
Following this procedure, we can draw all the 13 nodes on the canvas:
Consequently, in accordance with the actual transitions from the grammar, the rest of the edges can be shown on the canvas:
In addition to displaying nodes and edges on the canvas, you can also move them by clicking.
A procedure is required to move each node on the canvas:
Now s1 is movable. To move other nodes is routine. Just put the name of the node which you'd like to move
in place of s1 in the above code.
However the whole process is rather slow and tedious! If you wish, you can try to use
the 'definition factory' (also available in JS-Eden in the emile variant - see Lab 10 from CS405 in 2012) to save time. In this session, we won't go
further with the use of definition factory, you can explore it yourself.
The code to make all the nodes movable is specified here:
Previous slides explain how to store the grammar information in a JS-Eden list. In addition, we also want this information to be accessible to the learner in a window that they can refer to when trying to parse a string manually.
parsingrules is a list which stores the grammar information. It differs from transitions,
in that each of its entries is one line in the grammar in a human-readable form and does not store any information
about the finite state machine.
In the window named \"CANVAS HTML5 view_3\", you will see the grammar information displayed.
In order to make the graph easier to interpret, it is also convenient to associate a default colour with each edge. This colour is to be determined by the label of the edge.
"); slide9 is Slide("The graph of the finite state machine is intended to represent the state of parser at each stage. To highlight the edges being traversed on the current path, we are going to make the traversed edges black whilst others retain their default color.
In order to do this, we need a function which returns how many times each edge has been traversed and observables to represent the colours of the edges. We then define edge colours by dependency.
Suppose we have an observable cols0s1, which is the color of the edge s0s1 - what kind of dependency do we want?
We want the color of the edge dependent on the number of times it has been traversed. Obviously, we need another observable of edge
s0s1 for the number of times it has been traversed.
ctxy is a function, which takes three parameters: the start point and end point of an edge and the currpath (observable for current path being traversed),
and returns the number of times that this edge has been traversed.
So if we make the colour of the edge depend on the number of times it has been traversed, the colours of the edges together will clearly show the path. For edge s0s1:
To define the colours of all the edges, we first set up an observable to record how many times it is being traversed in the current path:
The use of the graph prototype saves much work in setting up the FSM, but there are
still ways in which it has to be improved for readability.
Having set up the FSM visualisation as above, it is convenient to layout the graph and modify the display of the nodes to make it easier to read. To this end, you can include the following definitions:
These definitions make adjustments to the value of observables that have been generated using the graph prototype. The observables concerned specify the explicit coordinates of points (e.g. ps0), the background colour of each of the nodes (e.g. cols0, cols1), and the position of the labels of selected nodes (e.g. labs10).
Note how this process of modifying the product of a prototyping activity in accordance with Empirical Modelling principles highlights the important distinction between tools that are developed with prototype-based rather than class-based object-orientation in mind. This is an important respect in which JavaScript is well-suited to EM.
"); slide12 is Slide("The key observables in the parsing process construal are:
currpath - stores the content on the state stack as a list.
The observables whose values are subject to be changed directly by an agent action, rather than being defined by a dependency relation, are InputText and k.
This network of dependency is derived from the operation of 'real' traditional parsers:
- The input and how far the parser has read from the beginning defines the look ahead token;
- The shift action pops a token and a state onto the stack;
- Based on its current state, the model decides whether it needs a look ahead token to carry out the next action,
or whether it should reduce.
The observable InputTextt has the same initial value as InputText and changes when reduction occurs. An index j is used to record how far the parser has read in the
InputTextt. Like InputText, InputTextt is independent of other observables and the observables currstack, currpath, currstate,etc depend on it.
The dependencies between key observables are:
Lexical Analysis
The lexical analysis part of parsing models in JS-EDEN is different from the lexical analysis of the traditional parser. This is done by using the function subpatt, which is an example of embedding JavaScript in EDEN:
This function recognizes regular expressions and the set of tokens is defined by regular expressions.
If an input string is given, the model should find IDs and exprs, and mark them as identifiers.
Using the subspatt function, we substitute values in the input by lexified tokens. And we want to keep a record of the original input values, and store them into a list.
The record of the input values, can be used in the last step of parsing - syntax directed translation. So we make a function named makeinputvals. This function will be different for different grammars. It makes uses of the function findpatt (which finds and returns a regular string):
and analyzes the input to many cases. strvals stores the values of the input string to a list. Below is the definition for strvals.
Can you make a makeinputvals function for the simple expression grammar? For test purposes you can include the function via the Input Window. A suitable specification is given on the next slide.
"); slide16 is Slide("When the value of k increases 1, the parsing model reads the next token and shifts. In some certain points of the parsing, the parsing model should reduce according to the grammar rules. This means that it should change some thing on the top of the stack. So to do reduction, we make functions which takes InputTextt as parameter, and returns the changed value of InputTextt.
In grammar, we have 6 reduction rules:
$accept: E $end\";The substitution associated with a reduction using rule 6 is as follows:
We also need a function to decide which of the substitute function should be called:
In parallel with the observables InputTextt and currstack, we have observables inputvals and stackvals to record the values of the input. When the parsing model reduces by rule 1, the inputvals is not changed. But when it reduces by rule 2, the inputvals changes as it combines the eight elements to one entry.
Finally we have set up the environment for parsing. We can try to give some input string and the parser will do the lexical analysis and produce an input text.
The above is an input string accepted by the parser. You can also try some input strings which are not accepted:
and