4. Strings

4.1. Script [str_01]: string notation
The script [str_01] is as follows:
Comments
- line 3: a string delimited by double quotes ";
- line 4: a string delimited by single quotes ';
- Line 5: a string enclosed in triple quotes """. In this case, the string can span multiple lines;
The results are as follows:
C:\Data\st-2020\dev\python\cours-2020\python3-flask-2020\venv\Scripts\python.exe C:/Data/st-2020/dev/python/cours-2020/python3-flask-2020/strings/str_01.py
chaine1=[un], chaine2=[deux], chaine3=[hélène va au
marché acheter des légumes]
Process finished with exit code 0
4.2. Script [str_02]: Methods of the <str> class
The script [str_02] presents some of the methods of the <str> class, which is the string class:
The comments combined with the results obtained are sufficient for understanding the script. The results are as follows:
C:\Data\st-2020\dev\python\cours-2020\python3-flask-2020\venv\Scripts\python.exe C:/Data/st-2020/dev/python/cours-2020/python3-flask-2020/strings/str_02.py
'ABCD'.lower()=abcd
'abcd'.upper()=ABCD
'cheval[2]=e
'caractères accentués'[5:7]=tè
'caractères accentués'[4:]=ctères accentués
'caractères accentués'[:5]=carac
len('123')=3
' abcd '.strip()=[abcd]
' abcd '.rstrip()=[ abcd]
' abcd '.lstrip()=[abcd ]
str.strip()=[abcd]
'abcd'.replace('a','x')=xbcd
'abcd'.replace('ab','xy')=xycd
'abcd'.find('bc')=1
'abcd'.find('bc')=-1
'abcd'.startswith('ab')=True
'abcd'.startswith('x')=False
'abcd'.endswith('cd')=True
'abcd'.endswith('x')=False
'[X]'.join(['abcd', '123', 'èéà'])=abcd[X]123[X]èéà
''.join(['abcd', '123', 'èéà'])=abcd123èéà
'abcd 123 cdXY'.split('cd')=['ab', ' 123 ', 'XY']
'abcd 123 cdXY'.split(None)=['abcd', '123', 'cdXY']
Process finished with exit code 0
4.3. Script [str_03]: String Encoding (1)
The script [str_03] introduces concepts related to string encoding:
Encoding a string of type <str> produces a binary string where each character in the string is represented by one or more bytes. There are different types of encoding. The script above shows the two most common ones in the West: "utf-8" and "iso-8859-1," also known as "latin1."
The principle of encoding/decoding is illustrated below (ref. |https://realpython.com/python-encodings-guide/ |):

Comments
- lines 4-5: the initial character string to be encoded. Instances of type <str> are Unicode strings |https://docs.python.org/3/howto/unicode.html|, |https://realpython.com/python-encodings-guide/ |;
- lines 6-11: two ways to encode a string in UTF-8:
- line 8: str.encode('utf-8');
- line 10: bytes(str, 'utf-8');
- lines 12-17: we do the same thing with the 'iso-8859-1' encoding;
- lines 18-23: 'latin1' is another name for the 'iso-8859-1' encoding;
The results are as follows:
C:\Data\st-2020\dev\python\cours-2020\python3-flask-2020\venv\Scripts\python.exe C:/Data/st-2020/dev/python/cours-2020/python3-flask-2020/strings/str_03.py
str=[hélène va au marché acheter des légumes, type=<class 'str'>
--- utf-8
bytes1=b'h\xc3\xa9l\xc3\xa8ne va au march\xc3\xa9 acheter des l\xc3\xa9gumes', type=<class 'bytes'>
bytes2=b'h\xc3\xa9l\xc3\xa8ne va au march\xc3\xa9 acheter des l\xc3\xa9gumes', type=<class 'bytes'>
--- iso-8859-1
bytes1=b'h\xe9l\xe8ne va au march\xe9 acheter des l\xe9gumes', type=<class 'bytes'>
bytes2=b'h\xe9l\xe8ne va au march\xe9 acheter des l\xe9gumes', type=<class 'bytes'>
--- latin1
bytes1=b'h\xe9l\xe8ne va au march\xe9 acheter des l\xe9gumes', type=<class 'bytes'>
bytes2=b'h\xe9l\xe8ne va au march\xe9 acheter des l\xe9gumes', type=<class 'bytes'>
Process finished with exit code 0
Comments
- line 4: we see that the accented characters have been encoded using two bytes:
- é: [\xc3\xa9], which is the binary sequence 11000011 10101001;
- è: [\xc3\xa8], which is the binary sequence 11000011 10101000;
- Line 7: With ISO-8859-1 encoding, these two accented characters are encoded differently:
- é: [\xe9], which is the binary sequence 11101001;
- è: [\xe8], which is the binary sequence 11101000;
4.4. Script [str_04]: Character string encoding (2)
Script [str_04] introduces two other types of encoding: 'base64' and 'quoted-printable'. These two encodings do not encode Unicode character strings but rather binary objects. For example, when you attach a Word document to an email, it will undergo one of these two encodings depending on the email client used. This will be the case for most attached files.
The script is as follows:
Comments
- line 2: the [codecs] module allows for 'base64' and 'quoted-printable' encodings. It can handle many others;
- lines 4–7: the Unicode string that will undergo various encodings;
- lines 9-12: UTF-8 encoding. This produces a binary string;
- lines 14-18: UTF-8 decoding to return to the original Unicode string;
- lines 20–29: we repeat the same process with the 'iso-8859-1' encoding;
- lines 31–34: a decoding error is shown:
- line 33: bytes1 is a binary string encoded in 'utf-8'. We decode it into 'iso-8859-1';
- lines 36–39: another way to encode a string in UTF-8 using the [codecs] module;
- lines 41-44: a 'utf-8' binary string is encoded in 'base64';
- lines 46–49: show how to convert the 'base64' binary string back to the original Unicode string;
- lines 51–59: we repeat this process using 'quoted-printable' encoding instead of 'base64';
The results are as follows:
C:\Data\st-2020\dev\python\cours-2020\python3-flask-2020\venv\Scripts\python.exe C:/Data/st-2020/dev/python/cours-2020/python3-flask-2020/strings/str_04.py
---- chaîne unicode
str1=[hélène va au marché acheter des légumes], type(str1)=<class 'str'>
---- chaîne unicode -> binaire utf-8
bytes1=[b'h\xc3\xa9l\xc3\xa8ne va au march\xc3\xa9 acheter des l\xc3\xa9gumes'], type(bytes1)=<class 'bytes'>
---- binaire utf-8 -> chaîne unicode
str2=[hélène va au marché acheter des légumes], type(str2)=<class 'str'>
str2==str1=True
---- chaîne unicode -> binaire iso-8859-1
bytes2=[b'h\xe9l\xe8ne va au march\xe9 acheter des l\xe9gumes'], type(bytes2)=<class 'bytes'>
---- binaire iso-8859-1 -> chaîne unicode
str3=[hélène va au marché acheter des légumes], type(str3)=<class 'str'>
str3==str1=True
--- binaire utf-8 (décodage iso-8859-1) --> chaîne unicode
str4=[hélène va au marché acheter des légumes], type(str4)=<class 'str'>
---- chaîne unicode -> binaire utf-8
bytes3=[b'h\xc3\xa9l\xc3\xa8ne va au march\xc3\xa9 acheter des l\xc3\xa9gumes'], type(bytes3)=<class 'bytes'>
---- binaire utf-8 -> binaire base64
bytes4=[b'aMOpbMOobmUgdmEgYXUgbWFyY2jDqSBhY2hldGVyIGRlcyBsw6lndW1lcw==\n'], type(bytes4)=<class 'bytes'>
---- binaire base64 -> binaire utf-8 -> chaîne unicode
str6=[hélène va au marché acheter des légumes], type(str6)=<class 'str'>
---- binaire utf-8 -> binaire quoted-printable
str7=[b'h=C3=A9l=C3=A8ne=20va=20au=20march=C3=A9=20acheter=20des=20l=C3=A9gumes'], type(str7)=<class 'bytes'>
---- binaire quoted-printable -> binaire utf-8 -> chaîne unicode
str8=[hélène va au marché acheter des légumes], type(str8)=<class 'str'>
Process finished with exit code 0
- lines 14-15: a UTF-8 binary is decoded into a Unicode string using the wrong decoder 'iso-8859-1'. As a result, some generated Unicode characters are incorrect, in this case the accented characters;
- lines 18-19: 'base64' encoding involves using 64 ASCII characters (encoded on 7 bits) to encode any binary data. As we can see, this increases the size of the string's binary data;
- lines 22–23: the 'quoted-printable' encoding also uses ASCII characters (encoded on 7 bits) to encode any binary data;
It is important to remember that when receiving binary data—from the internet, for example—that represents text, you must know the encodings it has undergone in order to recover the original text.