# FLC - lesson 03 ##### Luca Oddone Breveglieri ###### 8 october 2015 ## Regular Expressions __Regular Languages__ are the simplest family of formal languages. Can be defined in different ways: - algebrically - through __generating grammars__ - through __recognizer machines__ __Regular Expression__ or *regexp* is a string $r$ defined over the elements of a terminal alphabet $\Sigma$, using one of the following *metasymbols* $\;\cdot\;\cup\;\varnothing\;(\;)$ #### regexp cases: $r,s,t$ are regexp 1. $r=\varnothing$ 2. $r=a, \;\;a\in \Sigma$ 3. $r=s\cup t\;\; \text{or}\;\; r=s|t$ 4. $r=s.t\;\; \text{or}\;\; r=st$ 5. $r=s^*$ #### Precedence of operators 1. star $*$ 2. concatenation $\;\cdot$ 3. union $\cup$ or | #### Regular languages A language is __regular__ if it is generated by a *regular expression* regexp|language ---|--- $\varepsilon \;\text{or}\;\varnothing$|$\{\varepsilon\}$ $a\in\Sigma$|$\{a\}$ $s \cup t \; \text{or} \; s \mid t$ | $L_s \cup L_t$ $s.t \; \text{or} \; st$ | $L_s.L_t$ $s^*$|${L_s}^*$ ##### examples >The regexp $e$ that generates the language of integers, possibly signed $e={(+\cup - \cup\varepsilon)dd}^*\;\;\;L_e=\{+,-,\varepsilon\}\{d\}\{d\}^*$ >The language over the binary alphabet $\{a,b\}$, such that the number of $a$ in every string is odd and every string contains at least one $b$ $A_p=b^*({ab}^*{ab}^*)^*\;\;A_d=b^*{ab}^*({ab}^*{ab}^*)^*$ $L=A_pbA_d | A_dbA_p$ ### Sets of languages __Family of regular languages__ or $REG$: Is the collection of all *regular languages*. __Family of finite languages__ or $FIN$: Is the collection of all *languages of finite cardinality*. Every __finite language__ is *regular* as it is the union of a finite number of finite strings. But family of __regular languages__ contains also languages of *infinite cardinality*. $$FIN\subset REG$$ ### Subexpression of a regexp 1. Consider a *fully parenthesized* regexp $$ e = (a\cup (bb))^*(c^+ \cup (a \cup (bb)))$$ 2. Number every alphabetic symbol that occurs in the regexp $$ e_N = (a_1\cup (b_2b_3))^*(c_4^+ \cup (a_5 \cup (b_6b_7)))$$ 3. Isolate the subexpressions and put them into evidence $$ (a_1\cup (b_2b_3))^*\;\;\;\;c_4^+ \cup (a_5 \cup (b_6b_7))$$ $$ a_1\cup (b_2b_3)\;\;\;\;\;\;c_4^+\;\;\;\;a_5 \cup (b_6b_7)$$ $$ a_1\;\;\;\;b_2\;\;b_3\;\;\;\;\;\;c_4^+\;\;\;\;a_5 \;\;(b_6b_7)$$ $$ a_1\;\;\;\;b_2\;\;\;b_3\;\;\;\;\;\;c_4^+\;\;\;\;a_5 \;\;\;b_6\;\;\;b_7$$ ### Different parenthesization And hence different interpretations of a regexp $$f=(a\cup bb)^*(c^+\cup a\cup bb)$$ $$f=(a\cup bb)^*(c^+\cup (a\cup bb))$$ $$f=(a\cup bb)^*((c^+\cup a)\cup bb)$$ Choosing one of the *interpretations*, I obtain a new regexp defining a smaller language that the original one. ### Derivation Between two regexp $e'$ and $e''$: $$e'\Rightarrow e''$$ If the two regexp $e'$ and $e''$ can be factored as $$e'=\alpha\beta\gamma \;\;\; e''=\alpha\delta\gamma$$ With - $\beta$ subexpression of $e'$ - $\delta$ subexpression of $e''$ - $\delta$ is a *choice* of $\beta$ ##### examples - one step derivation $$a^*\cup b^+\Rightarrow a^*\;\;\;a^*\cup b^+\Rightarrow b^+$$ - multiple derivation $$a^*\cup b^+\Rightarrow a^*\Rightarrow\varepsilon\;\;\;a^*\cup b^+\overset{2}{\Rightarrow} \varepsilon$$ ### Equivalence Two regexps are __equivalent__ if they define the same language The language defined by a *derived regexp* is contained in the language defined by the *deriving regexp* ### Ambiguity A string may be obtained by two derivations differing not only in the order of choices $$(a\cup b)^*a(a\cup b)^*$$ $$(a\cup b)^*a(a\cup b)^* \Rightarrow (a\cup b)a(a\cup b)^*\Rightarrow aa(a\cup b)^*\Rightarrow aa\varepsilon\Rightarrow aa$$ $$(a\cup b)^*a(a\cup b)^* \Rightarrow \varepsilon a(a\cup b)^*\Rightarrow \varepsilon a(a\cup b)\Rightarrow \varepsilon aa\Rightarrow aa$$ A regexp is __ambiguous__ if the corresponding marked (with numbers on letters) regexp $f_\#$ generates two marked strings $x$ and $y$ such that, removing the indices, we obtain the same string (sufficient condition, not necessary) ##### examples $$(aa|ba)^*a|b(aa|ba)^*\;\; \text{is ambiguous}$$ in fact the marking $$(a_1a_2|b_3a_4)^*a_5|b_6(a_7a_8|b_9a_{10})^*$$ Generates the two strings $$b_3a_4a_5\;\;\;b_6a_7a_8$$ That project ambiguously to the same phrase: $\;\;\;baa$ ### Other operators __Basic operators:__ union, concatenation, star __Derived operators:__ - __power__ $e^h = \underset{h\;times}{ee...e}$ - __cross__ $e^+ = {ee}^*$ - __Repetition__ from k to n>k times $[a]_k^n=a^k\cup a^{k+1}\cup ...\cup a^n$ - __Optionality__ $[a]=(\varepsilon\cup a)$ - __Interval__ of an ordered set $(0...9)(a...z)(A...Z)$ By admitting the use of set-theoretic operations __Intersection__, __Complement__ and __Difference__ we obtain the so called __Extended regular expressions__. ### Closure of the $REG$ family to operations > A family of languages is said to be closed with respect to an operator $\theta$ if the result language belongs to the same family as the operand languages. The $REG$ family is closed with respect to: - concatenation - union - star therefore it is closed also with respect to derived operators - cross - repetition - optionality Regular languages are also closed with respect to __mirroring__ This means that combining two regular languages with one of the said operators, we obtain another regular language. ##### The family $REG$ of the regular languages is the smallest family of languages such that both the following properties hold: - $REG$ contains all finite languages - $REG$ is closet w.r.t conatenation, union, and star. The $LIB$ family of the context-free languages is closed w.r.t. concatenation, union and star but is not the smallest of such family, in fact: $$REG \subset LIB$$ ### Linguistic Abstraction Linguistic abstraction transforms the phrases of a real language to a simple form, called __abstract representation__ Compiler design makers reference to the *abstract language* to process, not to the *effective language* Artificial languages (e.g. programming languages) contain few abstract structures, one of these is __lists__ which can be easily modeled by regexps ## Effective and abstracts lists A __list__ is a sequence of *elements* with number not fixed in advance An __element__ can be a __terminal__ (alphabetic symbol) or a __compound object__ (for instance, a list itself) ##### example: List with separators and start-marker of end-marker $$ie(se)^*f\;\;\;i[e(se)^*]f$$ i: inizio e: elemento s: separatore f:fine ### Substitution Operation that replaces the terminal characters of the source language with the phrases of the destination languages ##### example $$L\subset \Sigma^*\;\;\;L_b\subset\Delta^*$$ $$x\in L \;\;\;x=a_1a_2...a_n\;\;\text{and for some}\;\;\;a_i=b$$ Substituting the language $L_b$ to the letter b in the string x produces a language over the alphabet $(\Sigma\setminus \{b\})\cup \Delta$ $$\{y|y=y_1y_2...y_n\wedge (ifa_i\neq b\;\text{then}\;y_i=a_i\;\text{else}\;y_i\in L_b)\}$$ ### Nested lists (or precedence lists) Lists may contain atomic objects as elements, but also other lists of lower level ##### example >livello 1: begin istr1;istr2;...;istrn end livello 2: STAMPA(var1,var2,...,varn)