FLC - lesson 03

Luca Oddone Breveglieri

8 october 2015

Regular Expressions

Regular Languages are the simplest family of formal languages.
Can be defined in different ways:

algebrically
through generating grammars
through recognizer machines

Regular Expression or regexp
is a string $r$ defined over the elements of a terminal alphabet $\Sigma$, using one of the following metasymbols $\;\cdot\;\cup\;\varnothing\;(\;)$

regexp cases:

$r,s,t$ are regexp

$r=\varnothing$
$r=a, \;\;a\in \Sigma$
$r=s\cup t\;\; \text{or}\;\; r=s|t$
$r=s.t\;\; \text{or}\;\; r=st$
$r=s^$

Precedence of operators

star $*$
concatenation $\;\cdot$
union $\cup$ or |

Regular languages

A language is regular if it is generated by a regular expression

regexp	language
$\varepsilon \;\text{or}\;\varnothing$	${\varepsilon}$
$a\in\Sigma$	${a}$
$s \cup t \; \text{or} \; s \mid t$	$L_s \cup L_t$
$s.t \; \text{or} \; st$	$L_s.L_t$
$s^*$	${L_s}^*$

examples

The regexp $e$ that generates the language of integers, possibly signed
$e={(+\cup - \cup\varepsilon)dd}^\;\;\;L_e={+,-,\varepsilon}{d}{d}^$
The language over the binary alphabet ${a,b}$, such that the number of $a$ in every string is odd and every string contains at least one $b$
$A_p=b^({ab}^{ab}^)^\;\;A_d=b^{ab}^({ab}^{ab}^)^$
$L=A_pbA_d | A_dbA_p$

Sets of languages

Family of regular languages or $REG$:
Is the collection of all regular languages.

Family of finite languages or $FIN$:
Is the collection of all languages of finite cardinality.

Every finite language is regular as it is the union of a finite number of finite strings.

But family of regular languages contains also languages of infinite cardinality.
$$FIN\subset REG$$

Subexpression of a regexp

Consider a fully parenthesized regexp $$ e = (a\cup (bb))^*(c^+ \cup (a \cup (bb)))$$
Number every alphabetic symbol that occurs in the regexp $$ e_N = (a_1\cup (b_2b_3))^*(c_4^+ \cup (a_5 \cup (b_6b_7)))$$
Isolate the subexpressions and put them into evidence $$ (a_1\cup (b_2b_3))^\;\;\;\;c_4^+ \cup (a_5 \cup (b_6b_7))$$ $$ a_1\cup (b_2b_3)\;\;\;\;\;\;c_4^+\;\;\;\;a_5 \cup (b_6b_7)$$ $$ a_1\;\;\;\;b_2b_3\;\;\;\;\;\;c_4^\;\;\;\;a_5 \;\;(b_6b_7)$$ $$ a_1\;\;\;\;b_2\;\;\;b_3\;\;\;\;\;\;c_4^\;\;\;\;a_5 \;\;\;b_6\;\;\;b_7$$

Different parenthesization

And hence different interpretations of a regexp $$f=(a\cup bb)^(c^+\cup a\cup bb)$$ $$f=(a\cup bb)^(c^+\cup (a\cup bb))$$ $$f=(a\cup bb)^((c^+\cup a)\cup bb)$$  Choosing one of the interpretations,
I obtain a new regexp defining a smaller language that the original one.

Derivation

Between two regexp $e'$ and $e''$: $$e'\Rightarrow e''$$ If the two regexp $e'$ and $e''$ can be factored as $$e'=\alpha\beta\gamma \;\;\; e''=\alpha\delta\gamma$$ With

$\beta$ subexpression of $e'$
$\gamma$ subexpression of $e''$
$\gamma$ is a choice of $\beta$

examples

one step derivation $$a^\cup b^+\Rightarrow a^\;\;\;a^*\cup b^+\Rightarrow b^+$$
multiple derivation $$a^\cup b^+\Rightarrow a^\Rightarrow\varepsilon\;\;\;a^*\cup b^+\overset{2}{\Rightarrow} \varepsilon$$

Equivalence

Two regexps are equivalent if they define the same language

The language defined by a derived regexp is contained in the language defined by the deriving regexp

Ambiguity

A string may be obtained by two derivations differing not only in the order of choices $$(a\cup b)^a(a\cup b)^$$ $$(a\cup b)^a(a\cup b)^ \Rightarrow (a\cup b)a(a\cup b)^\Rightarrow aa(a\cup b)^\Rightarrow aa\varepsilon\Rightarrow aa$$ $$(a\cup b)^a(a\cup b)^ \Rightarrow \varepsilon a(a\cup b)^\Rightarrow \varepsilon a(a\cup b)\Rightarrow \varepsilon aa\Rightarrow aa$$

A regexp is ambiguous if the corresponding marked (with numbers on letters) regexp $f_#$ generates two marked strings $x$ and $y$ such that, removing the indices, we obtain the same string (sufficient condition, not necessary)

examples

$$(aa|ba)^a|b(aa|ba)^\;\; \text{is ambiguous}$$ in fact the marking $$(a_1a_2|b_3a_4)^a_5|b_6(a_7a_8|b9a{10})^$$ Generates the two strings $$b_3a_4a_5\;\;\;b_6a_7a_8$$ That project ambiguously to the same phrase: $\;\;\;baa$

Other operators

Basic operators: union, concatenation, star

Derived operators:

power $e^h = \underset{h\;times}{ee...e}$
cross $e^+ = {ee}^$
Repetition from k to n>k times $[a]_k^n=a^k\cup a^{k+1}\cup ...\cup a^n$
Optionality $[a]=(\varepsilon\cup a)$
Interval of an ordered set $(0...9)(a...z)(A...Z)$

By admitting the use of set-theoretic operations Intersection, Complement and Difference we obtain the so called Extended regular expressions.

Closure of the $REG$ family to operations

A family of languages is said to be closed with respect to an operator $\theta$ if the result language belongs to the same family as the operand languages.

The $REG$ family is closed with respect to:

concatenation
union
star

therefore it is closed also with respect to derived operators

cross
repetition
optionality

Regular languages are also closed with respect to mirroring

This means that combining two regular languages with one of the said operators, we obtain another regular language.

The family $REG$ of the regular languages is the smallest family of languages such that both the following properties hold:

$REG$ contains all finite languages
$REG$ is closet w.r.t conatenation, union, and star.

The $LIB$ family of the context-free languages is closed w.r.t. concatenation, union and star but is not the smallest of such family, in fact: $$REG \subset LIB$$

Linguistic Abstraction

Linguistic abstraction transforms the phrases of a real language to a simple form, called abstract representation

Compiler design makers reference to the abstract language to process, not to the effective language

Artificial languages (e.g. programming languages) contain few abstract structures, one of these is lists which can be easily modeled by regexps

Effective and abstracts lists

A list is a sequence of elements with number not fixed in advance
An element can be a terminal (alphabetic symbol) or a compound object (for instance, a list itself)

example:

List with separators and start-marker of end-marker $$ie(se)^f\;\;\;i[e(se)^]f$$

Substitution

Operation that replaces the terminal characters of the source language with the phrases of the destination languages

example

$$L\subset \Sigma^\;\;\;L_b\subset\Delta^$$ $$x\in L \;\;\;x=a_1a_2...a_n\;\;\text{and for some}\;\;\;a_i=b$$ Substituting the language $L_b$ to the letter b in the string x produces a language over the alphabet $(\Sigma\setminus {b})\cup \Delta$ $${y|y=y_1y_2...y_n\wedge (ifa_i\neq b\;\text{then}\;y_i=a_i\;\text{else}\;y_i\in L_b)}$$

Nested lists (or precedence lists)

Lists may contain atomic objects as elements, but also other lists of lower level

example

livello 1: begin istr1;istr2;...;istrn end
livello 2: STAMPA(var1,var2,...,varn)

lesson_03.md 7.1 KB História Raw

FLC - lesson 03

Luca Oddone Breveglieri

8 october 2015

Regular Expressions

regexp cases:

Precedence of operators

Regular languages

examples

Sets of languages

Subexpression of a regexp

Different parenthesization

Derivation

examples

Equivalence

Ambiguity

examples

Other operators

Closure of the $REG$ family to operations

The family $REG$ of the regular languages is the smallest family of languages such that both the following properties hold:

Linguistic Abstraction

Effective and abstracts lists

example:

Substitution

example

Nested lists (or precedence lists)

example

lesson_03.md 7.1 KB

História Raw