4. Strings

4.1. Script [str_01]: string notation
The script [str_01] is as follows:
# strings
# three possible notations
string1 = "one"
string2 = 'two'
string3 = """Helen is going to the
market to buy vegetables"""
# display
print(f"string1=[{string1}], string2=[{string2}], string3=[{string3}]")
Comments
- line 3: a string delimited by double quotes ";
- line 4: a string delimited by single quotes ';
- Line 5: a string enclosed in triple quotes """. In this case, the string can span multiple lines;
The results are as follows:
C:\Data\st-2020\dev\python\cours-2020\python3-flask-2020\venv\Scripts\python.exe C:/Data/st-2020/dev/python/cours-2020/python3-flask-2020/strings/str_01.py
string1=[one], string2=[two], string3=[Helen is going to the
market to buy vegetables]
Process finished with exit code 0
4.2. Script [str_02]: Methods of the <str> class
The script [str_02] presents some of the methods of the <str> class, which is the string class:
# functions on strings
# string in lowercase
print(f"'ABCD'.lower()={'ABCD'.lower()}")
# uppercase string
print(f"'abcd'.upper()={'abcd'.upper()}")
# Character #2
print(f"'horse[2]={'horse'[2]}")
# substring with characters 5 and 6
print(f"'accented characters'[5:7]={'accented characters'[5:7]}")
# substring starting from character 4 inclusive
print(f"'accented characters'[4:]={'accented characters'[4:]}")
# substring up to but not including character 6
print(f"'accented characters'[:5]={'accented characters'[:5]}")
# length of the string
print(f"len('123')={len('123')}")
# remove leading and trailing whitespace from the string
print(f"' abcd '.strip()=[{' abcd '.strip()}]")
# removing whitespace following the string
print(f"' abcd '.rstrip()=[{' abcd '.rstrip()}]")
# Remove whitespace preceding the string
print(f"' abcd '.lstrip()=[{' abcd '.lstrip()}]")
# the term "whitespace" actually covers different characters
str = ' \r\nabcd \t\f'
print(f"str.strip()=[{str.strip()}]")
# Replacing one substring with another
print(f"'abcd'.replace('a','x')={'abcd'.replace('a', 'x')}")
print(f"'abcd'.replace('ab','xy')={'abcd'.replace('ab', 'xy')}")
# searching for a substring: returns the position or -1 if the substring is not found
print(f"'abcd'.find('bc')={'abcd'.find('bc')}")
print(f"'abcd'.find('bc')={'abcd'.find('Bc')}")
# start of a string
print(f"'abcd'.startswith('ab')={'abcd'.startswith('ab')}")
print(f"'abcd'.startswith('x')={'abcd'.startswith('x')}")
# end of a string
print(f"'abcd'.endswith('cd')={'abcd'.endswith('cd')}")
print(f"'abcd'.endswith('x')={'abcd'.endswith('x')}")
# converting a list of strings to a single string
print(f"'[X]'.join(['abcd', '123', 'èéà'])={'[X]'.join(['abcd', '123', 'èéà'])}")
print(f"''.join(['abcd', '123', 'èéà'])={''.join(['abcd', '123', 'èéà'])}")
# Converting a string to a list of strings
print(f"'abcd 123 cdXY'.split('cd')={'abcd 123 cdXY'.split('cd')}")
# extracting words from a string
print(f"'abcd 123 cdXY'.split(None)={'abcd 123 cdXY'.split(None)}")
The comments combined with the results obtained are sufficient for understanding the script. The results are as follows:
C:\Data\st-2020\dev\python\cours-2020\python3-flask-2020\venv\Scripts\python.exe C:/Data/st-2020/dev/python/cours-2020/python3-flask-2020/strings/str_02.py
'ABCD'.lower()=abcd
'abcd'.upper() = ABCD
'horse[2]=e
'accented characters'[5:7]=tè
'accented characters'[4:] = accented characters
'accented characters'[:5]=carac
len('123')=3
' abcd '.strip()=[abcd]
' abcd '.rstrip()=[ abcd]
' abcd '.lstrip()=[abcd ]
str.strip() = [abcd]
'abcd'.replace('a','x')=xbcd
'abcd'.replace('ab','xy')=xycd
'abcd'.find('bc')=1
'abcd'.find('bc')=-1
'abcd'.startswith('ab')=True
'abcd'.startswith('x')=False
'abcd'.endswith('cd')=True
'abcd'.endswith('x')=False
'[X]'.join(['abcd', '123', 'èéà']) = abcd[X]123[X]èéà
''.join(['abcd', '123', 'èéà']) = abcd123èéà
'abcd 123 cdXY'.split('cd') = ['ab', ' 123 ', 'XY']
'abcd 123 cdXY'.split(None) = ['abcd', '123', 'cdXY']
Process finished with exit code 0
4.3. Script [str_03]: String Encoding (1)
The script [str_03] introduces concepts related to string encoding:
# character encoding
# string type
str = "Helen is going to the market to buy vegetables"
print(f"str=[{str}, type={type(str)}]")
# UTF-8 encoding
print("--- utf-8")
bytes1 = str.encode('utf-8')
print(f"bytes1={bytes1}, type={type(bytes1)}")
bytes2 = bytes(str, 'utf-8')
print(f"bytes2={bytes2}, type={type(bytes2)}")
# iso-8859-1 encoding
print("--- iso-8859-1")
bytes1 = str.encode('iso-8859-1')
print(f"bytes1={bytes1}, type={type(bytes1)}")
bytes2 = bytes(str, 'iso-8859-1')
print(f"bytes2={bytes2}, type={type(bytes2)}")
# encoding latin1=iso-8859-1
print("--- latin1")
bytes1 = str.encode('latin1')
print(f"bytes1={bytes1}, type={type(bytes1)}")
bytes2 = bytes(str, 'latin1')
print(f"bytes2={bytes2}, type={type(bytes2)}")
Encoding a string of type <str> produces a binary string where each character in the string is represented by one or more bytes. There are different types of encoding. The script above shows the two most common ones in the West: "utf-8" and "iso-8859-1," also known as "latin1."
The principle of encoding/decoding is illustrated below (ref. |https://realpython.com/python-encodings-guide/ |):

Comments
- lines 4-5: the initial character string to be encoded. Instances of type <str> are Unicode strings |https://docs.python.org/3/howto/unicode.html|, |https://realpython.com/python-encodings-guide/ |;
- lines 6-11: two ways to encode a string in UTF-8:
- line 8: str.encode('utf-8');
- line 10: bytes(str, 'utf-8');
- lines 12-17: we do the same thing with the 'iso-8859-1' encoding;
- lines 18-23: 'latin1' is another name for the 'iso-8859-1' encoding;
The results are as follows:
C:\Data\st-2020\dev\python\cours-2020\python3-flask-2020\venv\Scripts\python.exe C:/Data/st-2020/dev/python/cours-2020/python3-flask-2020/strings/str_03.py
str=[Hélène goes to the market to buy vegetables, type=<class 'str'>
--- utf-8
bytes1=b'h\xc3\xa9l\xc3\xa8ne goes to the market to buy vegetables', type=<class 'bytes'>
bytes2=b'h\xc3\xa9l\xc3\xa8ne goes to the market to buy vegetables', type=<class 'bytes'>
--- iso-8859-1
bytes1=b'h\xe9l\xe8ne goes to the market to buy vegetables', type=<class 'bytes'>
bytes2=b'h\xe9l\xe8ne goes to the market to buy vegetables', type=<class 'bytes'>
--- latin1
bytes1=b'h\xe9l\xe8ne goes to the market to buy vegetables', type=<class 'bytes'>
bytes2=b'Hélène goes to the market to buy vegetables', type=<class 'bytes'>
Process finished with exit code 0
Comments
- line 4: we see that the accented characters have been encoded using two bytes:
- é: [\xc3\xa9], which is the binary sequence 11000011 10101001;
- è: [\xc3\xa8], which is the binary sequence 11000011 10101000;
- Line 7: With ISO-8859-1 encoding, these two accented characters are encoded differently:
- é: [\xe9], which is the binary sequence 11101001;
- è: [\xe8], which is the binary sequence 11101000;
4.4. Script [str_04]: Character string encoding (2)
Script [str_04] introduces two other types of encoding: 'base64' and 'quoted-printable'. These two encodings do not encode Unicode character strings but rather binary objects. For example, when you attach a Word document to an email, it will undergo one of these two encodings depending on the email client used. This will be the case for most attached files.
The script is as follows:
# encoding / decoding
import codecs
# string
print("---- Unicode string")
str1 = "Helen goes to the market to buy vegetables"
print(f"str1=[{str1}], type(str1)={type(str1)}")
# UTF-8 encoding
print("---- Unicode string -> UTF-8 binary")
bytes1 = bytes(str1, "utf-8")
print(f"bytes1=[{bytes1}], type(bytes1)={type(bytes1)}")
# UTF-8 decoding
print("---- UTF-8 binary -> Unicode string")
str2 = bytes1.decode("utf-8")
print(f"str2=[{str2}], type(str2)={type(str2)}")
print(f"str2==str1={str2 == str1}")
# ISO-8859-1 encoding
print("---- Unicode string -> ISO-8859-1 binary")
bytes2 = bytes(str1, "iso-8859-1")
print(f"bytes2=[{bytes2}], type(bytes2)={type(bytes2)}")
# decoding iso-8859-1
print("---- ISO-8859-1 binary -> Unicode string")
str3 = bytes2.decode("iso-8859-1")
print(f"str3=[{str3}], type(str3)={type(str3)}")
print(f"str3==str1={str3 == str1}")
# decoding error - bytes1 is in utf-8 - we decode it to iso-8859-1
print("--- UTF-8 binary (decoded as ISO-8859-1) --> Unicode string")
str4 = bytes1.decode("iso-8859-1")
print(f"str4=[{str4}], type(str4)={type(str4)}")
# UTF-8 encoding of a Unicode string
print("---- Unicode string -> UTF-8 binary")
bytes3 = codecs.encode(str1, "utf-8")
print(f"bytes3=[{bytes3}], type(bytes3)={type(bytes3)}")
# encoding a UTF-8 binary string to base64
print("---- UTF-8 binary -> Base64 binary")
bytes4 = codecs.encode(bytes1, "base64")
print(f"bytes4=[{bytes4}], type(bytes4)={type(bytes4)}")
# back to the original Unicode string
print("---- Base64 binary -> UTF-8 binary -> Unicode string")
str6 = codecs.decode(bytes4, "base64").decode("utf-8")
print(f"str6=[{str6}], type(str6)={type(str6)}")
# encoding a binary string to quoted-printable
print("---- UTF-8 binary -> quoted-printable binary")
str7 = codecs.encode(bytes1, "quoted-printable")
print(f"str7=[{str7}], type(str7)={type(str7)}")
# back to the original Unicode string
print("---- quoted-printable binary -> UTF-8 binary -> Unicode string")
str8 = codecs.decode(str7, "quoted-printable").decode("utf-8")
print(f"str8=[{str8}], type(str8)={type(str8)}")
Comments
- line 2: the [codecs] module allows for 'base64' and 'quoted-printable' encodings. It can handle many others;
- lines 4–7: the Unicode string that will undergo various encodings;
- lines 9-12: UTF-8 encoding. This produces a binary string;
- lines 14-18: UTF-8 decoding to return to the original Unicode string;
- lines 20–29: we repeat the same process with the 'iso-8859-1' encoding;
- lines 31–34: a decoding error is shown:
- line 33: bytes1 is a binary string encoded in 'utf-8'. We decode it into 'iso-8859-1';
- lines 36–39: another way to encode a string in UTF-8 using the [codecs] module;
- lines 41-44: a 'utf-8' binary string is encoded in 'base64';
- lines 46–49: show how to convert the 'base64' binary string back to the original Unicode string;
- lines 51–59: we repeat this process using 'quoted-printable' encoding instead of 'base64';
The results are as follows:
C:\Data\st-2020\dev\python\cours-2020\python3-flask-2020\venv\Scripts\python.exe C:/Data/st-2020/dev/python/cours-2020/python3-flask-2020/strings/str_04.py
---- Unicode string
str1=[Hélène goes to the market to buy vegetables], type(str1)=<class 'str'>
---- Unicode string -> UTF-8 binary
bytes1=[b'h\xc3\xa9l\xc3\xa8ne goes to the market to buy vegetables'], type(bytes1)=<class 'bytes'>
---- UTF-8 binary -> Unicode string
str2=[hélène goes to the market to buy vegetables], type(str2)=<class 'str'>
str2==str1=True
---- Unicode string -> ISO-8859-1 binary
bytes2=[b'Hélène goes to the market to buy vegetables'], type(bytes2)=<class 'bytes'>
---- ISO-8859-1 binary -> Unicode string
str3=[hélène goes to the market to buy vegetables], type(str3)=<class 'str'>
str3==str1=True
--- UTF-8 binary (decoded from ISO-8859-1) --> Unicode string
str4=[Hélène goes to the market to buy vegetables], type(str4)=<class 'str'>
---- Unicode string -> UTF-8 binary
bytes3=[b'h\xc3\xa9l\xc3\xa8ne goes to the market to buy vegetables'], type(bytes3)=<class 'bytes'>
---- UTF-8 binary -> Base64 binary
bytes4=[b'aMOpbMOobmUgdmEgYXUgbWFyY2jDqSBhY2hldGVyIGRlcyBsw6lndW1lcw==\n'], type(bytes4)=<class 'bytes'>
---- Base64 binary -> UTF-8 binary -> Unicode string
str6=[Hélène goes to the market to buy vegetables], type(str6)=<class 'str'>
---- UTF-8 binary -> quoted-printable binary
str7=[b'h=C3=A9l=C3=A8ne=20va=20au=20march=C3=A9=20acheter=20des=20l=C3=A9gumes'], type(str7)=<class 'bytes'>
---- quoted-printable binary -> UTF-8 binary -> Unicode string
str8=[Hélène is going to the market to buy vegetables], type(str8)=<class 'str'>
Process finished with exit code 0
- lines 14-15: a UTF-8 binary is decoded into a Unicode string using the wrong decoder 'iso-8859-1'. As a result, some generated Unicode characters are incorrect, in this case the accented characters;
- lines 18-19: 'base64' encoding involves using 64 ASCII characters (encoded on 7 bits) to encode any binary data. As we can see, this increases the size of the string's binary data;
- lines 22–23: the 'quoted-printable' encoding also uses ASCII characters (encoded on 7 bits) to encode any binary data;
It is important to remember that when receiving binary data—from the internet, for example—that represents text, you must know the encodings it has undergone in order to recover the original text.