CIS 1051 - Temple Rome Spring 2023¶

Intro to Problem solving and¶

Programming in Python¶

LOGO

LOGO

Files¶

Prof. Andrea Gallegati

( tuj81353@temple.edu )

Persistence¶

programs are transient when they:

  • run for a short time
  • produce some output
  • then, their data disappears.

If we run the program again, it starts with a clean slate.

contrary, programs are persistent when they:

  • run for a long time (even all the time)
  • store some of their data

If they shut down and restart, they pick up where they left off.

  • operating systems (aka OS): run pretty much whenever a computer is on
  • web servers: run all the time, waiting for incoming requests from the network

The simplest way for programs to maintain their data is by reading/writing text files.

An alternative, is to store the state of the program in a database.

Reading and Writing¶

Text file are sequences of characters stored on a permanent medium.

To write a file, open it with 'w' mode as parameter:

In [28]:
fout = open('output.txt', 'w')

... be careful!

If the file already exists: (opening in write mode) it clears out the old data and starts fresh.

If the file doesn’t exist, a new one is created.

open returns a file object with methods for working with the file.

The write method puts data into the file:

In [11]:
line1 = "This here's the wattle,\n"
fout.write(line1)
Out[11]:
24

and returns the number of written characters.

The file object keeps track of where it is, to add data at the end of the file when we call write again.

In [12]:
line2 = "the emblem of our land...\n"
fout.write(line2)
Out[12]:
26

When done, close the file.

In [27]:
fout.close()

otherwise, it gets closed when the program ends.

Format Operator %¶

The argument of write has to be a string.

To put other values in a file, we have to convert them to strings (with str).

In [16]:
x = 52
fout.write(str(x))
Out[16]:
2

An alternative is to use the format operator, %.

  • applied to integers, % is the modulus operator.
  • when the first operand is a string, % is the format operator.

The format string (first operand) contains format sequences to specify how to fortmat the strings (second operand):

In [17]:
camels = 42
'%d' % camels
Out[17]:
'42'

this format sequence '%d'is to format the second operand as a decimal integer.

The result is always a string. Thus, '42' is not to be confused with the integer value 42.

Format sequences can appear anywhere in the string, to embed values in a sentence:

In [18]:
'I have spotted %d camels.' % camels
Out[18]:
'I have spotted 42 camels.'

With more than one format sequence, the second argument has to be a tuple to match each one (in order).

In [19]:
'In %d years I have spotted %g %s.' % (3, 0.1, 'camels')
Out[19]:
'In 3 years I have spotted 0.1 camels.'

here we use

  • '%d' to format an integer
  • '%g' to format a floating-point number
  • '%s' to format a string
  • The number of elements, in the tuple
  • The types of elements, in the tuple

has to match the number and types of format sequences in the string.

In [21]:
'%d %d %d' % (1, 2)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-21-76a835ca51c7> in <module>()
----> 1 '%d %d %d' % (1, 2)

TypeError: not enough arguments for format string

here there aren’t enough elements

In [22]:
'%d' % 'dollars'
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-22-192ac7d0084e> in <module>()
----> 1 '%d' % 'dollars'

TypeError: %d format: a number is required, not str

while here the element is the wrong type.

A more powerful alternative is the string format method

In [23]:
"The sum of 1 + 2 is {0}".format(1+2)
Out[23]:
'The sum of 1 + 2 is 3'

to perform string formatting operations on the string it is called on.

The string can contain

  • literal text
  • replacement fields {}

Each {} contains

  • either the index of a positional argument
  • or the name of a keyword argument

It returns a copy of the string replacing each {} with the corresponding argument (string value).

Filenames and Paths¶

Files are organized into directories (aka “folders”).

Every running program has a “current directory”: the default directory for most of its operations.

... opening a file for reading, Python looks for it in the current directory.

The os module (for “operating system”) provides functions for working with files and directories.

os.getcwd (aka * “current working directory”) returns the current directory* name:

In [29]:
import os
cwd = os.getcwd()
cwd
Out[29]:
'/data/CIS1051-python/lectures/notebooks'

A string like this, identifying files/directories, is a path.

Simple filenames are considered relative paths, being related to the current directory.

If the path begins with / it is an absolute path (not depending on the current directory).

To find it we can use

In [30]:
os.path.abspath('README.md')
Out[30]:
'/data/CIS1051-python/lectures/notebooks/README.md'

os.path provides other functions for working with filenames/paths.

In [31]:
os.path.exists('README.md')
Out[31]:
True

this checks whether a file or a directory exists.

In [32]:
os.path.isdir('README.md')
Out[32]:
False

If it exists, this checks whether it’s a directory

In [33]:
os.path.isdir('/data/CIS1051-python/lectures/notebooks')
Out[33]:
True

or a file

In [35]:
os.path.isfile('README.md')
Out[35]:
True
In [45]:
os.listdir("./doc")
Out[45]:
['document.txt', 'funny_document.txt', 'words.txt']

this returns a list of files/directories in the given directory.

In [59]:
def walk(dirname):
    for name in os.listdir(dirname):
        path = os.path.join(dirname, name)

        if os.path.isfile(path):
            print(path)
        else:
            walk(path)
            
walk(cwd + "/../../lab-sessions/snake/challenge/lab_5")
/data/CIS1051-python/lectures/notebooks/../../lab-sessions/snake/challenge/lab_5/level_00/fruit.py
/data/CIS1051-python/lectures/notebooks/../../lab-sessions/snake/challenge/lab_5/level_00/game.py
/data/CIS1051-python/lectures/notebooks/../../lab-sessions/snake/challenge/lab_5/level_00/main.py
/data/CIS1051-python/lectures/notebooks/../../lab-sessions/snake/challenge/lab_5/level_00/snake.py
/data/CIS1051-python/lectures/notebooks/../../lab-sessions/snake/challenge/lab_5/level_00/wall.py
/data/CIS1051-python/lectures/notebooks/../../lab-sessions/snake/challenge/lab_5/level_01/fruit.py
/data/CIS1051-python/lectures/notebooks/../../lab-sessions/snake/challenge/lab_5/level_01/game.py
/data/CIS1051-python/lectures/notebooks/../../lab-sessions/snake/challenge/lab_5/level_01/main.py
/data/CIS1051-python/lectures/notebooks/../../lab-sessions/snake/challenge/lab_5/level_01/snake.py
/data/CIS1051-python/lectures/notebooks/../../lab-sessions/snake/challenge/lab_5/level_01/wall.py

this example “walks” through a directory to print the files names and calls itself recursively on all its directories.

os.path.join takes

  • a directory
  • a filename

and joins them into a complete path.

The os module already provides a similar (but more versatile) function walk

Catching Exceptions¶

When trying to read/write files a lot of things can go wrong.

fin = open('bad_file')
FileNotFoundError: [Errno 2] No such file or directory: 'bad_file'

opening a file that doesn’t exist, we get a FileNotFoundError

fout = open('/etc/passwd', 'w')
PermissionError: [Errno 13] Permission denied: '/etc/passwd'

or without the necessary permissions to access it:

fin = open('/home')
IsADirectoryError: [Errno 21] Is a directory: '/home'

or opening a directory for reading it!

To avoid all these errors, it would take a lot of

  • time
  • code

[Errno 21] indicates there are at least `21` things that can go wrong!

Better trying to go ahead and deal with problems as they happen.

This is exactly what the try statement does.

The syntax is similar to an if...else statement.

In [68]:
try:    
    fin = open('bad_file')
except:
    print('Something went wrong.')
Something went wrong.

Python starts by executing the try clause:

  • If all goes well, it skips the except clause and proceeds.
  • If an exception occurs, it jumps out and runs the except clause.

To handle an exception with a try statement is called catching an exception.

Here, the except clause is not that helpful (just a print).

In general, this gives us a chance to:

  • fix the problem
  • try again
  • end the program gracefully

Databases¶

(aka DB) is a file organized for storing data.

Many are organized like a dictionary, they map from keys to values.

However, a DB persists after the program ends (on permanent storage).

The dbm module provides an interface for:

  • creating
  • updating

database files.

Let's create a DB containining captions for image files.

Opening a DB is similar to opening other files:

In [81]:
import dbm
db = dbm.open('captions', 'c')

The mode 'c' is to create a database object if it doesn’t exist yet.

The result is to be used (for most operations) like a dictionary.

In [82]:
 db['cleese.png'] = 'Photo of John Cleese.'

creating a new item, dbm updates the database file.

In [73]:
 db['cleese.png']
Out[73]:
b'Photo of John Cleese.'

accessing one item, dbm reads the file.

The result is a bytes object (begins with b), similar to a string in many ways.

In [75]:
db['cleese.png'] = 'Photo of John Cleese doing a silly walk.'
db['cleese.png']
Out[75]:
b'Photo of John Cleese doing a silly walk.'

making another assignment (existing key), dbm replaces the old value.

Some dictionary methods don’t work with database objects, but iteration works:

In [76]:
for key in db:
    print(key, db[key])
b'cleese.png' b'Photo of John Cleese doing a silly walk.'

Close the DB when done (similarly to files).

In [83]:
db.close()

Pickling¶

A limitation of dbm is that:

  • keys
  • values

have to be strings or bytes.

With any other type, we get an error.

The pickle module helps, translating almost any type of object into a string (suitable for storage in a DB).

  • pickle.dumps (short for “dump string”) serializes Python objects into a binary string representation.

Then it translates these strings back into objects.

  • pickle.loads (short for “load string”) deserializes the binary string representation back into the original Python object.
In [85]:
import pickle
t = [1, 2, 3]
pickle.dumps(t)
Out[85]:
b'\x80\x03]q\x00(K\x01K\x02K\x03e.'

The format isn’t obvious to human readers: it is meant to be easy for pickle to interpret.

In [86]:
t1 = [1, 2, 3]
s = pickle.dumps(t1)
t2 = pickle.loads(s)
t2
Out[86]:
[1, 2, 3]

This new object has the same value as the old, but it is not (in general) the same object!

... pickling and then unpickling has the same effect as copying the object.

Thanks to pickle we can store non-strings objects in a dbm object. This combination has already been encapsulated in the shelve module!

A Shelf is a persistent, dictionary-like object.

Contrary to dbm databases, here the values (not the keys!) can be arbitrary Python objects, while the keys are ordinary strings.

In general, the process of:

  • serialization is to convert a data structure/object into a format that can be stored/transmitted.
  • deserialization is to reconstruct the original data structure/object from its serialized format.

In a few words, serialization/deserialization are the processes of converting back and forth an object into a stream of bytes.

The pickle module format is not the only one, some other like:

  • json
  • yaml

provide alternative methods.

However, it is sometimes preferred for its ability to maintain the state of complex objects/data structures.

Pipes¶

Most OS provide a CLI (aka a shell), providing for commands to navigate the file system and launch applications.

Any of these commands is executable from Python, using a pipe object (representing a running program).

In [107]:
cmd = 'ls -l ../../lab-sessions/snake/challenge/lab_5/level_01'
fp = os.popen(cmd)

this executes the Unix command ls -l to display the given directory content.

  • The argument is a string with the shell command.
  • The return value is a pipe object that behaves like an open file.
In [108]:
res = fp.read()
print(res)
total 20
-rwxrwxrwx 1 0 root  330 Apr  4 06:03 fruit.py
-rwxrwxrwx 1 0 root 3440 Apr  4 06:03 game.py
-rwxrwxrwx 1 0 root 7105 Apr  4 06:03 main.py
-rwxrwxrwx 1 0 root  958 Apr  4 06:03 snake.py
-rwxrwxrwx 1 0 root  830 Apr  4 06:03 wall.py

We can get the ls process output:

  • line by line, with readline method
  • or get the whole thing, with read method

Close the pipe like a file, when done:

In [109]:
stat = fp.close()
print(stat)
None

Its return value is the final status of the ls process.

None means no errors.

Note, popen is now deprecated: one is supposed to stop using it and start using the subprocess module.

In [124]:
import subprocess

cmd = ['ls', '-l', '../../lab-sessions/snake/challenge/lab_5/level_01']

# create a subprocess - capture output and error using a pipe
p = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
out, err = p.communicate() # read these streams

print(out.decode('utf-8'))
total 20
-rwxrwxrwx 1 0 root  330 Apr  4 06:03 fruit.py
-rwxrwxrwx 1 0 root 3440 Apr  4 06:03 game.py
-rwxrwxrwx 1 0 root 7105 Apr  4 06:03 main.py
-rwxrwxrwx 1 0 root  958 Apr  4 06:03 snake.py
-rwxrwxrwx 1 0 root  830 Apr  4 06:03 wall.py

For simple cases, the subprocess module is more complicated than necessary.

In Unix systems the md5sum command computes a “checksum” based on file contents.

In [129]:
filename = 'output.txt'
cmd = 'md5sum ' + filename
fp = os.popen(cmd)
res = fp.read()
print(res)
stat = fp.close()
d41d8cd98f00b204e9800998ecf8427e  output.txt

It's almost impossible that different contents yield the same checksum:

This is an efficient way to check whether two files have the **same contents**