aboutsummaryrefslogtreecommitdiff
path: root/dst/SPO notes.html
blob: cf2a00f43fca6beee7edb573493b6c4175a3f4b5 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
<!DOCTYPE html>
<html>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">

<title>Exam &mdash; adast.xyz</title>
<link rel="icon" type="image/x-icon" href="/images/favicon.ico">
<link rel="stylesheet" href="styles/style.css">

<p class="header"><a href="/">adast.xyz</a></p>
<h1 id="exam">Exam</h1>

<ul>
<li>15 minute video presentation</li>
<li>1 hour to record</li>
<li>Chosen subject appears on digital exam</li>
</ul>

<h1 id="language-processors">Language processors</h1>

<p>Most important problem in computing : increasing programmer productivity
to improve speed of software release.</p>

<p>This can be improved with:</p>

<ul>
<li>Software process (eg: AGIL) about 20% increase</li>
<li>Good tools 10%</li>
<li>Better programming language design immense increase</li>
</ul>

<p>Example:
    Quicksort is twice as much code in C than Haskell</p>

<p>This can be more readable</p>

<p>You can gain deeper understanding of programming through learning about compilers.</p>

<h2 id="programming-languages">Programming languages</h2>

<ul>
<li>Set of rules to tell a computer what operations to perform</li>
<li>For communicating an algorithm</li>
<li>Linguistic framework</li>
<li>Symbols, words, rules of grammar, rules of semantics</li>
</ul>

<h3 id="why-are-there-so-many-programming-languages">Why are there so many programming languages?</h3>

<p>Programming has evolved over time because different programming languages can be desinged for different types of programs.</p>

<h3 id="types-of-programming-languages">Types of programming languages</h3>

<ul>
<li>Low level - high level 1st-2nd gen</li>
<li>Paradigms

<ul>
<li>Imperative &#47; object oriented 3rd gen</li>
<li>Logic &#47; declarative (database languages) 4th gen</li>
<li>Functional &#47; logical 5th gen</li>
<li>Reactive programming possible 6th gen?</li>
</ul></li>
</ul>

<p>We are in the &#8220;multi-paradigm era&#8221;, Anders Hejlsberg says languages in of the future will be a mix of all existing paradigms.</p>

<h3 id="what-is-a-good-language">What is a good language?</h3>

<ul>
<li>Formerly: run-time performance</li>
<li>Now: Lifecycle (ease of designing, coding, debugging, maintainance, reusability)</li>
<li>Fads</li>
</ul>

<p>Programming language characterstics can be broken down into readibility, writability and reliability.</p>

<p>Static &#47; dynamic typed languages end up with the same amount of time spent debugging. The only difference is where that time is spent.</p>

<p>Programming languages are languages with no ambiguitity and vagueness. Everything is exact. A sentence means one thing or it means nothing.</p>

<p>Programming languages require specification. Example: C# docs. Syntax needs to be specified, scope rules, type rules, semantics (or runtime semantics).</p>

<p>Types of specification:
- Formal: Precise, mathematical
- Informal: description in English
- Usually both</p>

<p>Language specification consists of:
- Syntax, usually formal
- Contextual constraits
    - Scope rules (English &#47; formal)
    - Type rules
    - Semantics</p>

<h3 id="does-syntax-matter">Does syntax matter?</h3>

<p>Syntax is the visible part of the language. The paradigm is the next most visible. Most invisible part is the language semantics. But clear semantics usually means simple and efficient implementations.</p>

<p>Yes. Syntax matters. </p>

<h2 id="language-processors-1">Language Processors</h2>

<ul>
<li>Text Editors &#47; IDEs</li>
<li>Translators 

<ul>
<li>Compiler, assembler, disassembler, etc</li>
</ul></li>
<li>Interpreters</li>
</ul>

<h3 id="interpreters">Interpreters</h3>

<p>Take source code and input and produces output. For example, web browser.</p>

<h3 id="compilation">Compilation</h3>

<p>A 2-step process. Source program is inputted to compiler and a new program target program is outputted from the compiler. However this is much more complicated. Lexical-, syntax-, semantic analysis, code generation (IL)?. Sometimes you may generate high-level code. For example, taking source and converting it to C code.</p>

<p>Phases:
- Syntax analysis -&#62; Error reports | AST
- Contextual analysis -&#62; Error reports | DAST
- Code generation  (Semantics) -&#62; Object code</p>

<p>Symbol tables can be accessed from several phases.</p>

<h3 id="compiler-design">Compiler design</h3>

<p>Compiler design is strong affected and strong affects programming language design. Advantages of good compiler: easier to learn, read, understand. Fewer bugs. More reliable. More diagnostic messages and tools.</p>

<p>There are tools to help write compilers (compiler generators&#47; compiler compiler)</p>

<ul>
<li>Scanner generators

<ul>
<li>JLex</li>
</ul></li>
<li>Parser generators

<ul>
<li>JavaCUP, Yacc</li>
</ul></li>
<li>Front-end generator 

<ul>
<li>SableCC, JavaCC, ANTLR</li>
</ul></li>
<li>Code-generation tools</li>
</ul>

<h1 id="programming-language-evolution-and-tombstone-diagrams">Programming language evolution and tombstone diagrams</h1>

<p>Most new languages are invented out of frustration. When a new language is created it tends to be because it would be an improvement over some existing language. For example, Fortran was invented to improve programmer productivity.</p>

<h2 id="programming-language-design">Programming language design</h2>

<ol>
<li>Create ideas

<ul>
<li>Do a problem analysis (what is the purpose?, who is this for?)</li>
<li>Take inspiration from existing languages</li>
<li>Create some example programs</li>
</ul></li>
<li>Describe&#47;define (SS)</li>
<li>Implement (SPO)</li>
<li>Evaluate</li>
<li>Repeat until satisfied</li>
</ol>

<p>The programming language critera table can be used to generate ideas for or evaluate a language&#8217;s design. Reusability and relocatability of code has become a more and more important criteria over time. </p>

<h2 id="programming-language-history">Programming language history</h2>

<p>40-50s: Assembly languages made programming more readable more usable. Came with a compiler that could translate assembly to machine operations.</p>

<p>50s: Fortran was invented, first high-level language. Now programs could be developed machine independant.</p>

<p>60s: Programming paradigms started taking over. Structured programming. Pascale: No GOTO, Types. OOP.</p>

<p>70s: C invented for high-level programming with low-level power. Ada invented as a means to avoid using hundreds of different programming languages. Functional programming starts off.</p>

<p>80s: OOP really takes over. C++ invented (thought of as C with classes). More discussion about functional programming.</p>

<p>00s: Distributed systems. C# + .Net CLR (common language runtime system).</p>

<p>10s: Multi-paradigm (OO + FP)</p>

<h2 id="tombstone-diagrams---gravstens-diagramer">Tombstone diagrams - gravstens diagramer</h2>

<p>A notation consisting of puzzle pieces that can be used to reason about language processors and programs.</p>

<ul>
<li>Program P implemented in language L (tombstone)</li>
<li>Translator implemented in language L (square tombstone)</li>
<li>Machine implemented in hardware M (shield)</li>
<li>Language interpreter in L (square box)</li>
</ul>

<p>These can be combined to give a general diagram of some source to target language scenario.</p>

<ul>
<li>Cross compilation: a compiler that runs on one machine, that emits code for another machine.</li>
<li>Two-stage compilation: a composition of two translators. Output of first translator is provided as input to the second. Mainly IL but can be any language.</li>
<li>Interpreters: provide better platform indepedence. Useful for testing and debugging. Typically slower than compilers.</li>
</ul>

<p>Bootstrapping can be used as a way to generate more efficient compilers.</p>

<h1 id="the-ac-language-and-compiler">The ac language and compiler</h1>

<h2 id="phases-of-a-compiler">Phases of a compiler</h2>

<p>Phases:
- Syntax analysis -&#62; Error reports | AST
- Contextual analysis -&#62; Error reports | DAST
- Code generation  (Semantics) -&#62; Object code</p>

<p>The phases can be seen as the transformation steps for transforming some source code into object code.</p>

<ul>
<li>Syntax analysis &#60;-&#62; Syntax

<ul>
<li>Lexical analysis &#60;-&#62; Regular Expressions</li>
<li>Parsing &#60;-&#62; CFG</li>
</ul></li>
<li>Contextual analysis &#60;-&#62; Contextual constraints

<ul>
<li>Scope checking &#60;-&#62; Scope rules (static semantics)</li>
<li>Type checking &#60;-&#62; Type rules (static semantics)</li>
</ul></li>
<li>Code generation &#60;-&#62; Semantics (dynamic semantics)</li>
</ul>

<p>Modern compilers all make use of symbol tables. This can generally be used in any phase of compilation but it is now best practice to only access it from type checker &#47; contextual analysis.</p>

<h2 id="phases-of-a-simple-compiler">Phases of a simple compiler</h2>

<ul>
<li>Syntax analysis

<ul>
<li>Scanner&#47;Lexer -&#62; tokens</li>
<li>Parser: tokens -&#62; AST</li>
</ul></li>
<li>Contextual analysis

<ul>
<li>Symbol table: AST -&#62; symbol table</li>
<li>Semantic analysis -&#62; DAST</li>
</ul></li>
<li>Code generation

<ul>
<li>Translation</li>
</ul></li>
</ul>

<p>Scanner&#47;Lexer phase takes input source code and produces a stream of tokens. Tokens have and type and value.</p>

<p>Parsing phase is used to determine if the stream of tokens conforms to the languages grammar specification. This phase either produces an AST or returns errors.</p>

<p>Contextual analysis for checking the semantics. A symbol table is used to process declarations and to check for type consistency.</p>

<p>Symbol table is generally just a key-value list.</p>

<p>Type checking: check each node and ensure child nodes are consistent with their types. For example adding an int to a float. The compiler must specificy what to do in this case.</p>

<p>Code generation: formulation of target-machine instructions that represent the semantics of the source program.</p>

<h2 id="the-ac-language-adding-calculator">The ac language (adding calculator)</h2>

<p>A simple language for arithmetic. Monolithic scope, meaning a variable can only be declared once. Target of translation: dc (desk calculator)</p>

<p>Example ac program:</p>

<pre><code>f b             &#47;&#47; float b
i a             &#47;&#47; int a
a = 5   
b = a + 3.2
p b             &#47;&#47; print b
</code></pre>

<h3 id="formal-definition-of-ac">Formal definition of ac</h3>

<ul>
<li>Syntax specification

<ul>
<li>Context free grammar (CFG)</li>
</ul></li>
<li>Token specification

<ul>
<li>Regular expressions (RE)</li>
</ul></li>
</ul>

<h2 id="context-free-grammar-cfg">Context free grammar (CFG)</h2>

<p>A set of production rules for a language. Two types of symbols: terminals and non-terminals. Terminals cannot be rewritten (id, num, etc). Non-terminals are rules like (prog, scope, etc).</p>

<p>A syntax tree &#47; parse tree can be used to show the production rules of some source code in a tree structure.</p>

<h2 id="regular-expressions-re">Regular expressions (RE)</h2>

<p>For specifying tokens. For example: id, num, &#8220;+&#8221;, &#8220;-&#8221;, etc.</p>

<h1 id="language-specifications">Language specifications</h1>

<p>A communication device between people who need a common understanding of the programming language or PL. For example, the language desinger, implementor, user, etc.</p>

<p>What should be specified? Syntax, contextual constraints or static semantics (scope&#47;type rules), semantics (runtime semantics).</p>

<p>How is a language specified? 
- Formal: Precise, mathematical
- Informal: description in English
- Usually both</p>

<p>A language specification needs to address:
- Syntax
    - Token grammar: Regular expressions
    - Context free grammar: BNF EBNF
- Contextual constraints
    - Scope rules (static semantics)
        - Often informal
    - Type rules (static semantics)
        - Informal or formal
- Semantics (dynamic semantics)
    - Informal or formal</p>

<h2 id="syntax-terminology">Syntax terminology</h2>

<ul>
<li>Sentence: string of characters over some alphabet</li>
<li>Language: set of sentences</li>
<li>Lexeme: lowest level syntatic unit of a language</li>
<li>Token: a category of lexemes (e.g. identifier)</li>
</ul>

<h2 id="formal-definition-of-languages">Formal definition of languages</h2>

<ul>
<li>Generators: a device that generates senteces of a language.</li>
<li>Recognizer: reads input strings over the alphabet and decides whether the input strings belong to the language.</li>
</ul>

<h2 id="syntax-specification">Syntax specification</h2>

<p>Syntax is specified suing a CFG.
    - a finite set of terminals (tokens)
    - a finite set of non-terminals
    - a start symbol
    - a finite set of production rules</p>

<p>Usually CFG are written in BNF notation. Example:</p>

<pre><code>N ::= a
N ::= a | B | ...
</code></pre>

<p>Note: non-terminals are often enclosed with angle brackets. Example:</p>

<pre><code>&#60;N&#62; ::= a | &#60;B&#62; | ...
&#60;B&#62; ::= ...
</code></pre>

<p>In BNF, syntactic lists are described using recursion.</p>

<h2 id="derivations">Derivations</h2>

<p>This is where you take some source code and you step through the CFG with it to see which rules are rewritten to deeper rules, until you get the end result. This can also be represented as a parse tree.</p>

<p>Leftmost derivation (from left to right) and rightmost derivation (from right to left).</p>

<p>Leftmost derivation always choses the leftmost possible non-terminal and vice-versa for rightmost derivation.</p>

<h2 id="ambiguity-in-grammars">Ambiguity in grammars</h2>

<p>A grammar is ambiguous if a string (sentence) can be defined by more than one different parse trees. Ususally ambiguity is undesirable and CFGs are rewritten to disallow this behaviour. This rewritting process can be done systematically.</p>

<p>A &#8216;dangling-else&#8217; problem can be a sign of ambiguity in a language.</p>

<ul>
<li>Sometimes obvious, sometimes difficult to spot.</li>
<li>Ambiguity in grammars is an undecidable property.</li>
</ul>

<h2 id="ebnf">EBNF</h2>

<p>A non-standardised extension to BNF notation.</p>

<ul>
<li>Optional parts are placed in brackets <a href="sometimes%20question%20mark"></a></li>
<li>Alternatives can be expressed in-line using parentheses ()</li>
<li>Reptitions (0 or more) are placed inside curly braces {} (sometimes parentheses asterisk ()* )</li>
</ul>

<p>EBNF can also be rewritten to standard BNF fairly easily.</p>

<h1 id="context-free-grammars">Context free grammars</h1>

<h2 id="properties-of-grammars-non-terminals">Properties of grammars (non-terminals)</h2>

<ul>
<li>Left-recursive: <code>N ::= N ...</code></li>
<li>Right-recursive: <code>N ::= ... N</code></li>
<li>Nullable: <code>expression ::=  λ</code></li>
<li>Useless: if a non-terminal can never produce a string of terminal symbols.</li>
</ul>

<h2 id="grammar-transformations">Grammar transformations</h2>

<ul>
<li>Left factorization: <code>X Y | X Z</code> is transformed to <code>X (Y | Z)</code></li>
<li>Elimination of left-recursion: <code>N ::= X | N Y</code> becomes <code>N ::= X Y*</code></li>
<li>Substitution of non-terminals: for clearer CFGs </li>
</ul>

<h2 id="from-tokens-to-parse-tree">From tokens to parse tree</h2>

<p>Parsing is finding the structure in a stream of tokens. This is done by the parser.</p>

<p>Two well known ways to parse:
- top-down (Leftmost derivation &#47; LL), constructed in pre-order.
- bottom-up (Rightmost derivation in reverse &#47; LR), constructed in post-order.</p>

<h2 id="first-sets">First sets</h2>

<p>The set of all terminal symbols, in order, derived from a sentence S. The empty string λ is never included.</p>

<h2 id="follow-sets">Follow sets</h2>

<p>The set of terminals that can follow a non-terminal A, where the EOF symbol is $.</p>

<h2 id="ll1-grammars">LL(1) grammars</h2>

<ul>
<li>Cannot be left-recursive.</li>
<li>Cannot contain ambiguity.</li>
<li>Some languages have no LL(1) grammar.</li>
<li>λ-free grammars are very easy as you only need to check for the first set.</li>
</ul>

<h2 id="lr-grammars">LR grammars</h2>

<ul>
<li>Can be parsed using an LR algorithm.</li>
<li>Harder to implement parsers for LR than LL.</li>
<li>Can recognize a bigger class of grammars than LL.</li>
<li>LR parsing methods are the most commonly used for automatic tools today.</li>
</ul>

<h1 id="lexical-analysis">Lexical analysis</h1>

<h2 id="developing-a-scanner">Developing a scanner</h2>

<p>The scanner&#47;lexer is what reads through source code character by character and divides them into lexemes and tokens. This includes whitespace which can have meaning for languages like Python. </p>

<p>A Java implementation of a scanner will normally return instances of a Token class, containing a &#8216;kind&#8217; and &#8216;spelling&#8217;. Some tokens, like floats, ints, etc, have one single spelling so in some cases spelling can be left blank.</p>

<p>How do we know how to implement a scanner? Read the language specification. Included in the specification is typically tokens&#47;lexemes as regular expressions and reserved words.</p>

<h3 id="lexical-elements">Lexical elements</h3>

<ul>
<li>Character set (ASCII vs Unicode)</li>
<li>Identifiers</li>
<li>Operators</li>
<li>Keywords</li>
<li>Elementary data</li>
<li>Noise words</li>
<li>Blank space</li>
<li>Layout</li>
<li>Comments</li>
<li>Delimiters</li>
</ul>

<p>Lexem structure can be more detailed for example allowing escaping characters in strings or allowing a null string. Another example would be 0.1 vs 00.1 or .1, 10. vs 1..10 and so on.</p>

<p>The audience of your PL matters for lexical design, for example if target audience are C programmers they will prefer curly braces over begin-end keywords. It is a good idea to keep your design simple though. </p>

<h3 id="regular-grammarsexpressions">Regular grammars&#47;expressions</h3>

<p>It is generally better to use regular expressions instead of production rules where possible. Regular expressions are typically easy to implement. REs can be implemented by finite state machines (FSM). These automatas can then be easily implemented in code and as such make the job of implementing a scanner&#47;lexer a little easier. </p>

<h3 id="implement-a-scanner-based-on-re-by-hand">Implement a scanner based on RE by hand</h3>

<ol>
<li><p>Express the &#8216;lexical&#8217; grammar as RE. Sometimes it is easier to start with (E)BNF and do the necessary transformations.</p></li>
<li><p>For each variant use a switch on the first character by peeking the input stream.</p></li>
<li><p>For each repetition (..)* use a while loop that keeps going as look as the peeking input yields an expected character.</p></li>
<li><p>If there are multiple REs then use a switch to determine which rule is being recognized before following steps 2-3.   </p></li>
</ol>

<ul>
<li>Hand-made scanners are easy for simple token arguments but can be hard and error prone for more complicated grammars.</li>
<li>The task can be automated.</li>
<li>Most compilers are developed using a generated scanner.</li>
</ul>

<h2 id="generating-scanners">Generating scanners</h2>

<p>Something about REs and the different types of FAs are equivalent.
There is an algorithm for transforming NDFA with epsilon into DFAs. DFAs are easily implementable by computers.</p>

<p>A scanner generator a program that takes tokens as REs and will generate a scanner based on a FAs in whatever language. JLex&#47;Lex&#47;CSharpLex are tools for generating a scanner that recognizes tokens.</p>

<h2 id="performance-considerations">Performance considerations</h2>

<p>Performance of scanners is important for production compilers. Size of scanner sometimes matters. Note: modern scanners use explicit control, not table, because bigger tables may perform worse on modern architectures.</p>

<h1 id="top-down-parsing">Top down parsing</h1>

<p>The job of the syntax analysis, is to read to source text and determine it&#8217;s phase structure.
This can be used to construct the AST. The input to the scanner is the stream of characters which output a stream of tokens to the parser or reports errors. The parser produces the AST or reports errors again.</p>

<p>Top down parsing example of &#8220;Micro English&#8221;. This is always done with left derivation.</p>

<pre><code>Sentence Verb Object .
The Noun Verb Object .
The cat Verb Object .
The cat sees a Object .
The cat sees a Noun .
The cat sees a rat
</code></pre>

<p>This is equivalent to the parse tree that would be produced from the sentence. So Top down is a straight-forward parsing algorithm. The parse structure corresponds to the set of procedures. </p>

<h2 id="recursive-decent-parsing">Recursive decent parsing</h2>

<p>Each non-terminal will have a parse method. There will also need to be a method that asks for the next token. This function will either return the this token or give an error in the event of an unexepected token.</p>

<pre><code>private void parseSentence() {
    parseSubject();
    parseVerb();
    parseObject();
    accept(&#39;.&#39;);
}
</code></pre>

<p>This is a little more involved when there are alternatives. This can be implemented using if-elseif or a switch structure that checks for the alternatives.</p>

<h2 id="algo-to-convert-ebnf-into-a-rd-parser">Algo to convert EBNF into a RD parser</h2>

<p>Foreach non-terminal in the grammar, create a parse method. Parsing a terminal just uses the &#8216;accept&#8217; method. Parsing of non-terminals is just method calls to those non-terminal&#8217;s parse methods in order. If there are repetitions, (0 .. more) you use a while loop. As long as the current token is in the first set, you keep parsing it. Use a switch case of if-else for alternatives. </p>

<p>For systematic development of a RD parser: 17:36 in video 7. This process is so mechanical that there are tools for doing this. These are known as parser generators. One tool JavaCC is based on LL(k) grammars so you can specify how many symbols it should look at at once. This allows for fine control when parsing. </p>

<p>If we want the parser to return an AST each of the parse methods should return their little bit of the AST instead of void. (Think of our visitor methods).</p>

<h2 id="the-visitor-pattern-briefly">The visitor pattern (briefly)</h2>

<p>For OOP the visitor pattern enbales the definition of a new operator on an object structure without changing classes of the objects.</p>

<h2 id="ll1-grammars-1">LL(1) grammars</h2>

<p>A grammar which can be parsed with a top-down parse with a lookahead (in the input stream of tokens) of one token.</p>

<p>Lecture 7 @ 32:00 for formal definition of LL(1) grammar.</p>

<h3 id="table-driven-ll1-parsers">Table-driven LL(1) parsers</h3>

<p>A table is generated that simulates the way the regular parser&#8217;s stack works. This presumably doesn&#8217;t consume as much memory as a traditional parser&#8217;s stack. The table is constructed with non-terminals and the lookahead terminals. The table tells which rule to use in the grammar. 49:00 Lecture 7. ANTLR an example of a Java parser generator, that allows LL(*) grammars and is table-driven.</p>

<h1 id="bottom-up-parsing">Bottom up parsing</h1>

<p>LR languages are more powerful than LL.
- LR(0), SLR(1), LALR(1), LR(k) grammars.
- Can handle left recursion.
- Usually more convenient because less need to rewrite gramamr.
- Most commonly used for parsing methods in automatic tools today.</p>

<ul>
<li>LR(0): The simplest algorithm.</li>
<li>SLR(1): An improved LR(0).</li>
<li>LR(1): LR(0) with extra lookahead token.</li>
<li>LR(k): for k&#62;0 tokens </li>
<li>LALR: &#8220;watered down&#8221; version of LR(1)</li>
</ul>

<p>Read through every construction and recognize it at the end. 13:00 Lecture 8.</p>

<p>There are tools that can automate generating the bottom up parser. 50:00 Lecture 8.</p>

<h2 id="terminology-1500-lecture-8">Terminology 15:00 Lecture 8</h2>

<ul>
<li>A reduction transforms <code>uwv to uAv</code> if <code>A -&#62; w</code> is a production.</li>
<li>Right sentenial form.</li>
<li>Handle (informally: a production than can be used in reverse without getting stuck).</li>
</ul>

<h2 id="the-shift-reduce-process-1800-lecture-8">The shift reduce process 18:00 Lecture 8</h2>

<p>This is how bottom up parser work. They have a stack and a table of operations. This can also be represented as a knitting game.</p>

<p>All bottom up parsers have a similar algorithm: 22:00 Lecture 8
- A loop with these parts:
    - Try to find the leftmost parse tree which has not yet been constructed but all whose children have been constructed.
        - This sequence of children is called a handle.
        - This sequence of children is built or shifting elements on a stack
    - Construct a new parse tree node.
        - This is called reducing.</p>

<p>The only difference in the bottom up algorithms is how they find a handle. 
If there is an empty state in the lookup table, we know we have an error.</p>

<p>How is the parse table built? The table is coded as a DFA (28:00 Lecture 8).</p>

<h3 id="lr0-conflicts">LR(0) conflicts</h3>

<p>This algo doesn&#8217;t always work. A LR(0) conflict is a situation (DFA state) in which there is more than one possible action for the algo. There are two kinds of conflicts:
- Shift-reduce: can&#8217;t decide between shift or reduce.
- Reduce-reduce: can&#8217;t decide between two or more reductions for different grammar rules.</p>

<h3 id="the-shift-reduce-problem">The shift-reduce problem</h3>

<p>This is when you can either shift or reduce. This is a sign that the grammar is not LR(0).
There are tools that can be used to input a grammar and output what kind of grammar it is.</p>

<h2 id="lr0-vs-slr1">LR(0) vs SLR(1)</h2>

<p>SLR(1) looks at the symbol after the handle (follow set). This means fewer reduce-actoins and therefore removes a lot of potential conflicts.</p>

<h2 id="lr1">LR(1)</h2>

<p>While SLR just uses the follow set for each production. LR(1) looks at the follow set for each handle. This gives even more reductions. LR(1) parse tables are very big. Expensive but almost all common languages are LR(1).</p>

<h2 id="lalr1">LALR(1)</h2>

<p>A variant of LR(1) that gives smaller parse tables. This is done by combining some states in the parse tree. In practice most practical languages are LALR(1).</p>

<h2 id="parser-conflict-resolution">Parser conflict resolution</h2>

<p>Parsing conflicts are commonly caused by ambiguous grammars. In practice, these ambiguities are solved using a resolution rule however this may or may not do what you want (fx: dangling-else).</p>

<p>Parsing conflicts must be fixed. You cannot rely on the resolution rule.</p>

<h1 id="abstract-syntax-trees-ast">Abstract syntax trees (AST)</h1>

<p>Like a parse tree but with some details omitted.</p>

<p>Semantic actions and values 5:00 Lecture 9.
Synthesized and inherited attributes.</p>

<p>AST must be concise and sufficiently flexible. AST is typically constructed bottom up. Lists of siblings are typically generated by recursive rules.</p>

<p>It should be possible to unparse an AST (reconstruct the source from an AST).</p>

<h2 id="single-pass-vs-multi-pass-compilers">Single-pass vs multi-pass compilers</h2>

<p>Single-pass makes a single pass over source text, parsing, analyzing, and generating code all at once. This means that the contextual analyzer and code generator are called from the parser implementation.</p>

<p>Multi-pass makes several passes over the program. This is the modern approach. The output of a preceeding phase is stored in a data structure and used by subsequent phases. This is more readable and maintainable as the code is more broken down into independant modules.</p>

<h2 id="visitor-pattern">Visitor pattern</h2>

<p>35:00 Lecture 9. </p>

<h2 id="contextual-analysis">Contextual analysis</h2>

<p>Identification and type checking are combined into a depth-first traversal of the AST.</p>
<p class="footer"><a href="/">adast.xyz</a></p>
</html>