Regex

A Short tutorial using the Julia Programming Language

Regular expressions allows to define a search pattern in a corpus.

For example, we might want to find if a sequence of characters occurs in some text.

For instance, whether the sequence “dog” occurs in “I love dogs!”. We can find this information in Julia through the occursin function,

occursin("dog", "I love dogs!")
true

Sometimes we want to match a pattern case insensitive. In the following example, the corpus might have the name GitHub with the official spelling or as github, Github, or GITHUB.

pattern = r"(?i)github(?-i)";
all(text -> occursin(pattern, "The website I spend most of my time is $text"),
    ["GitHub", "Github", "GITHUB", "github"])
true

This pattern uses the ignore case indicator (?i). It is equivalent to specifying all possible combinations that ignore case.

pattern = r"[Gg][Ii][Tt][Hh][Uu][Bb]"
all(text -> occursin(pattern, "The website I spend most of my time is $text"),
    ["GitHub", "Github", "GITHUB", "github"])
true

If instead we wish to only allow certain variations, we can specify these with the following pattern

pattern = r"(GitHub|github|GITHUB)"
all(text -> occursin(pattern, "The website I spend most of my time is $text"),
    ["GitHub", "Github", "GITHUB", "github"])
false

In this case, it is false since we are not interested in matching Github.

We can also perform look-aheads and look-behinds.

The following pattern will match anything after “My name is” and followed by a “.”.

pattern = r"(?<=My name is ).*(?=\.)"
foreach(text -> println("\"$text\": $(occursin(pattern, text))"),
        ["Bayoán", "My name is Bayoán!", "My name is Bayoán."])
"Bayoán": false
"My name is Bayoán!": false
"My name is Bayoán.": true

We can also extract the matched pattern.

pattern = r"(?<=My name is ).*(?=\.)"
foreach(text -> println(match(pattern, text)),
        ["Bayoán", "My name is Bayoán!", "My name is Bayoán."])
nothing
nothing
RegexMatch("Bayoán")

To access the value

match(r"(?<=My name is ).*(?=\.)", "My name is Bayoán.").match
"Bayoán"

We can also find and replace through,

replace("My name is Bayoán.", r"(?<=My name is ).*(?=\.)" => "Nosferican")
"My name is Nosferican."

Certain patterns are readily a available such as,

pattern = r"^Starts with"
foreach(text -> println("\"$text\": $(occursin(pattern, text))"),
        ["Starts with this.", "Does not Starts with this.", " Starts with this."])
"Starts with this.": true
"Does not Starts with this.": false
" Starts with this.": false
pattern = r"Ends with$"
foreach(text -> println("\"$text\": $(occursin(pattern, text))"),
        ["Ends with this.", "This text Ends with this.", "This text Ends with"])
"Ends with this.": false
"This text Ends with this.": false
"This text Ends with": true

For more fun with regex, see the documentation at here.

J. Bayoán Santiago Calderón
J. Bayoán Santiago Calderón
Research Economist

🇵🇷 Economist by training. Data scientist / software developer by accident.