More on strings

Basics

a = "That's"
b = "fine"
Operator Description Input Output
+ Concatenation a + b That'sfine
* Repetition 2 * b finefine
[] Slice a[1] h
[:] Range Slice a[1:4] hat
in Membership 'a' in a True
not in Membership 'a' not in b True
r/R Raw String suppresses escape chars r'\n' \n

String methods

Upper and lower case methods

hi = "how ARE you"
Method Description Input Output
.capitalize() first letter into upper case hi.capitalize() How are you
.lower() lower case hi.lower() how are you
.upper() upper case hi.upper() HOW ARE YOU
.title() every word capitalized hi.title() How Are You
In [1]:
hi = "how ARE you"
In [2]:
hi.title()
Out[2]:
'How Are You'
In [3]:
hi.upper()
Out[3]:
'HOW ARE YOU'

Boolean string methods

Method Boolean value
.isalnum() alphanumeric characters (no symbols)?
.isalpha() alphabetic characters (no symbols)?
.islower() lower case?
.isnumeric() numeric characters?
.isspace() whitespace characters?
.istitle() is in title case?
.isupper() upper case?
In [4]:
alnum = "23 apples"
num = "-1234"
title = "Big Apple"
print(alnum.isalnum(), num.isnumeric(), title.istitle()) 
False False True
In [5]:
alnum = "23apples"
num = "1234"
white = "   \t   \n\n   "
print(alnum.isalnum(), num.isnumeric(), white.isspace()) 
True True True

String modification methods – join(), strip(), replace(), split()

hi = "how ARE you"
seq = ['h', 'o', 'w']  #OR seq = 'h', 'o', 'w'
wh = "\t this \n "     # white spaces: space, tab, new line,...
Method Description Input Output
.join() concatenates with separator string " < ".join(seq) h < o < w
.lstrip() removes leading whitespaces wh.lstrip() "this \n "
.rstrip() removes trailing whitespaces wh.rstrip() "\t this"
.strip() performs lstrip() and rstrip() wh.strip() "this"
.replace(old, new[, m]) replaces old with new at most m times hi.replace("o", "O") hOw ARE yOu
.split(s[, m]) splits at s max m times, returns list hi.split() ["how", "ARE", "you"]
In [6]:
s = " < "
seq = "a", "b", "c"   # a sequence of strings (tuple, list)
print(s.join(seq))
a < b < c
In [7]:
hi = "how ARE you today"
hi.replace("o", "O", 2)
Out[7]:
'hOw ARE yOu today'
In [8]:
hi.split() # if no argument is given, separate at white spaces
Out[8]:
['how', 'ARE', 'you', 'today']
In [9]:
hi.split("o", 2)   # separate at the first two occurances
Out[9]:
['h', 'w ARE y', 'u today']
In [10]:
s = '.,.dots or commas.,.,...'
print(s.strip('.,'))
print(s.rstrip('.,'))
print(s.lstrip('.,'))
dots or commas
.,.dots or commas
dots or commas.,.,...

Formatting

Old style formatting: the % operator

For positional formatting the % operator can be used easely:

name = "Lucy"
"Hi %s!" % name

The possibilities are summarized in the next table:

%char short for example output
%s string "Hi %s" % "Joe" "Hi Joe"
%d digits: 0123456789 "%d is prime" % 7 "7 is prime"
%o octal: 01234567 "8 = octal %o" % 8 "8 = octal 10"
%x hex: 0123456789abcdef "10 = hex %x" % 10 "10 = hex a"
%X hex: 0123456789ABCDEF "10 = hex %X" % 10 "10 = hex A"
%f, %F floating point "1.2 = %f" % 1.2 '1.2 = 1.200000'
%e, %E exponential "12 = %e" % 12 '12 = 1.200000e+01'
%g, %G general "1.2e2 = %g" % 1.2e2 '1.2e2 = 120'
In [11]:
x = -12.345
y = 12.3e10
print("f: %f, e: %e, g: %g" % (x, x, x) )
print("f: %f, e: %e, g: %g" % (y, y, y) )
f: -12.345000, e: -1.234500e+01, g: -12.345
f: 123000000000.000000, e: 1.230000e+11, g: 1.23e+11
In [12]:
"Hi %s! You have got %d points!" % ("Joe", 5)
Out[12]:
'Hi Joe! You have got 5 points!'
In [13]:
name = "Lucy"
points = 10
"Hi %s! You have got %d points!" % (name, points)
Out[13]:
'Hi Lucy! You have got 10 points!'

The format method

The object is the formatting string and the parameters are the things to subtitute.

The numbers in the brackets mark the parameters.

In [14]:
'{0}-{1}-{2} {0}, {1}, {2}, {0}{0}'.format('X', 'Y', 'Z')
Out[14]:
'X-Y-Z X, Y, Z, XX'

The format marker "{ }" can have optional formatting instructions: {number:optional}

optional Meaning
d decimal
b binary
o octal
x, X hex, capital HEX
f, F float
e, E exponential form: something times 10 to some power
< left justified
> right justified
^ centered
c^ centered but with a character 'c' as padding
In [15]:
print("01234 01234 01234 0123456789")
print('{0:5} {1:5d} {2:>5} {3:*^10}'.format('0123', 1234, '|', 'center'))
01234 01234 01234 0123456789
0123   1234     | **center**
In [16]:
"int {0:d}, hex {0:x} {0:X}, oct {0:o}, bin {0:b}".format(42)
Out[16]:
'int 42, hex 2a 2A, oct 52, bin 101010'
In [17]:
"{0}, {0:e}, {0:f}, {0:8.4f}, {0:8.1f}".format(-12.345)
Out[17]:
'-12.345, -1.234500e+01, -12.345000, -12.3450,    -12.3'

You can also name the parameters, it is more convinient then indices.

In [18]:
'The center is: ({x}, {y})'.format(x=3, y=5)
Out[18]:
'The center is: (3, 5)'
In [19]:
x1 = 3; y1 = 4
print('The center is: ({x}, {y})'.format(x=x1, y=y1))
The center is: (3, 4)
In [20]:
tabular = [["1st row", -2, -31], ["Second", 3, 1], ["Third", -32, 11]]
table_string = ""
for row in tabular:
    table_string += "{0:_<11}".format(row[0])
    for i in range(1, len(row)):
        table_string += "{0:5d}".format(row[i])
    table_string += "\n"
print(table_string)
1st row____   -2  -31
Second_____    3    1
Third______  -32   11

Formatted strings – f-strings (new from Python 3.6)

An f-string is a string that is prefixed with 'f' or 'F'. These strings may contain replacement fields, which are expressions delimited by curly braces {}. The previous string literals always have constant values, the f-strings are expressions evaluated at run time!

In [21]:
name = "Lucy"
points = 100
f"Hi {name}! You have {points} points!"
Out[21]:
'Hi Lucy! You have 100 points!'
In [22]:
a, b = 3, 5
f"2({a} + {b}) = {2*(a+b)}"
Out[22]:
'2(3 + 5) = 16'
In [23]:
width = 8
precision = 4
value = -123.4567
f"result: {value:{width}.{precision}}"  # nested fields
Out[23]:
'result:   -123.5'
In [24]:
num = 300
f'{num}, {num:x}, {num:o}, {num:b}, {num:10}, {num:10X}'
Out[24]:
'300, 12c, 454, 100101100,        300,        12C'
In [25]:
num, numb, numo, numx = 12, 0b10010, 0o371, 0xabc
f"{num}, {numb}, {numo}, {numx}"
Out[25]:
'12, 18, 249, 2748'

Be careful with the quotation marks!

In [26]:
d = {"one": 1, "two": 2}
f"{d['one']} is one" # f"{d["one"]} is one" would give ERROR
Out[26]:
'1 is one'

Alignment methods*

In [ ]:
s = "where"
In [ ]:
print('0123456789'*3)
print(s.center(30))
print(s.rjust(30))
print(s.ljust(30))

The parameter of the method tells the final width.

You can print a table nicely:

In [ ]:
tabular_string = ""
for row in tabular:
    tabular_string += row[0].ljust(11)
    for i in range(1, len(row)):
        tabular_string += str(row[i]).rjust(5)
    tabular_string += "\n" 
print(tabular_string)

Regular expressions

A regular expression (regex, regexp) is a sequence of characters that define a search pattern. It is used in different programming languages and text editors. The aim is to recognize some string with given properties (like email-address, date, roman numeral, IP-address,...)

Metacharacters

The next characters has special meaning: . ^ $ * + ? { } [ ] ( ) \ |

Character Description Example Fits to
[] set of characters "[abcd]" a, b,...
[a-z] an intervall "[0-9a-fA-F]" B, 5,...
[^chars] not the listed chars "[^qx]" a, b, c,...
\ to escape special characters "\s" space, tab,...
. any character (except newline) "Wh..." Where, Whose,...
^ beginning "^Once" Once.....
$ ends "finished.\$" .....finished.
? zero or one occurrences "colou?r" color, colour
* zero or more occurrences (greedy) "woo*w" woooooow
+ one or more occurrences (greedy) "wo+w" wow, woow
*? zero or more (lazy) "w.*?w"
+? one or more (lazy) "w.+?w"
{n} exactly n occurrences "al{2}e{2}" allee
{n,} at least n occurrences "oh{3,}" ohhhhhh
{,n} at most n occurrences "woo{,3}w" woooow, wooow, woow, wow
{n,m} at least n at most m occurrences "wo{1,3}w" wow,...
| either or "H(a|ae|ä)ndel" Handel, Haendel, Händel
() capture and group

Special sequences

Character Description Examples
\b beginning or end of a word r"\bis" r"st\b"
\B NOT the beginning or the end of a word r"\Bis" r"st\B"
\d digits (0-9) r"\d\d-\d\d"
\D NOT a digits r"\d\d-\D"
\s white space character r"for\sever"
\S NOT a white space r"\S"
\w any word character (a to Z, 0-9, and _) r"\s\w\w\w\s"
\W NOT a word character r"\Wword\W"

Examples

  1. Two-digit numbers divisible by 4: [02468][048]|[13579][26]
  2. String between ' or " characters: (['"])[^\1]*\1
  3. Any floatingpoint number: ^[+-]?(\d+(\.\d+)?|\.\d+)([eE][+-]?\d+)?$
  4. Roman numerals with capital letters: M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})

Tasks:

  1. HTML color code of 6 hexadecimal numbers like A34DC8
  2. Date in the form yyyy-mm-dd

Regular expressions (RegEx) in python

You have to import the functions of the modul re, because they are not default. Put this line in the beginning of your code.

In [27]:
import re

The functions in the module re:

Function Description
findall(p, s) returns a list containing all matches of p in s
search(p, s) returns a "match object" if there is a match of p in s
split(p, s) split at each match of p in s and returns a list
sub(p, n, s[, m]) replaces all or m matches of p with a new string n in s
finditer(p, s) returns an iterable object on the match objects of p in s

Match object defines where is the pattern in the string and what is it exactly. These info can be read out by the .span() and .group() methods.

In [28]:
s = "confirmation"
p = ".i"
print(re.findall(p, s))
['fi', 'ti']
In [29]:
print(re.search(p, s))
<_sre.SRE_Match object; span=(3, 5), match='fi'>
In [30]:
x = re.search(p, s)
print(x.span())
print(x.group())
(3, 5)
fi
In [31]:
print(re.split(p, s))
['con', 'rma', 'on']

Since backslash and other special characters can be used in a RegEx pattern, you have to be careful with them.

The best if you use a so called raw string as pattern. In this format one backslash means actually one backspash. You don't have to escape the backslash.

If you put an r in front of the string, then it is in a raw format.

In [32]:
x = re.finditer(p, s)
for y in x:
    print(y.group())
fi
ti

Exercise: Triple the quotation mark in a string. Anyone!

In [33]:
st = """This "word" is an 'other' one."""
p = """(['"])"""
print(st)
print(re.sub(p, r"\1\1\1", st))
This "word" is an 'other' one.
This """word""" is an '''other''' one.
In [34]:
s = "This 'string' has two 'quoted' words"
p1 = "'.*'"    # greedy    
p2 = "'.*?'"   # lazy
print(re.findall(p1, s), "  --> greedy")
print(re.findall(p2, s), "  --> lazy")
["'string' has two 'quoted'"]   --> greedy
["'string'", "'quoted'"]   --> lazy
In [ ]: