Data Structures Week 1

Comp 271-400 Week 1

Crown 103, 4:15-8:15

Welcome

Readings:

    Bailey chapter 1, on objects generally.
    Bailey chapter 2, on assertions. We will come back to this later; you can skim it for now.
    Bailey chapter 3, on a Vector class

    Morin chapter 1, sections 1.1 and 1.2
        One slight peculiarity of Morin is that he refers to the array-based List implementation of chapter 2 as an ArrayStack.

Primary text: Bailey, online, and maybe Morin, also online.

What the heck is a data structure? Any structure that involves data! Generally, data that is in addition to that needed to implement the structure itself.

On page 1 of Morin there is a list of some applications of data structures:

opening a file and finding the right data blocks on a disk
Looking up someone in your phone's contacts list
Logging in to a social network
Looking up something with Google (or Bing)
Calling 911

All of these examples in some sense involve search, and retrieval of the correct data.

Data structures themselves can be a general-purpose container, like a list, or an application-specific structure. Note, however, that it is usually a good idea to try to make application-specific structures into something more general, if only to separate conceptually the app-specific part from the general part.

Examples we will look at this semester

Array-based list
linked list
ordered tree
unordered expression tree
dictionary / tree
dictionary / hash table
set / hash table

A couple we won't do much with:

directed graph
undirected graph

What else will we do?

We will look at algorithms, specifically for searching and sorting. The "obvious" algorithms can be much improved on. We will quantify this by looking at the approximate running time of algorithms.

We will look at a few advanced programming ideas, though these will be limited.

We will look at recursion, and how to handle complex recursive data structures.

As an example of recursion, and as an example of how programming works "under the hood", we will look at a simple compiler for a simple language, miniJava.

We will look at how objects and polymorphism can simplify programming.

Finally, we may look at Java's dynamic memory management (with garbage collection), and compare it to, say, the C++ approach.

Interfaces

In object-oriented languages, we implement data structures such as in the list above with classes. The class interface specifies the publicly available operations, in the form of methods (or, especially in C++, member functions) that the class provides. The interface is all the client programmer needs to know; the implementation can be changed at will. It is not uncommon, for example, for a programmer to change a list implementation from array to linked, or vice-versa.

Morin discusses interfaces in Section 1.2. Examples: queue, list, set, sorted set

Classes let us:

Keep the client-programmer interface abstract
Separate interface from implementation
Allow the implementation to be improved later

Long before object-oriented programming was a thing, there were abstract data types: data structures that presented to the user-programmer only the minimum interface needed; access to the underlying implementation was restricted. For example, a stack implemented using an array would not grant access to individual array components; the programmer could use only the interface operations push() and pop().

General-purpose container classes typically involve type parameters, called generic classes in Java. The parameter type is enclosed in angled brackets, a tradition that comes from the C++ notion of a template.

ArrayList<int> theCounts;
ArrayList<Message> mailbox;

The central idea of a class is that we can define the public interface however we want, and keep the internals to ourselves. Thus, outside users have only the interface we provide. If there is a length field, or a pointer to the first cell, then if we keep the field private we need not fear that some client programmer will modify it and thus make a mess of it.

Class rule #1: all class data fields should be private. This way, you who writes the class has full control over its internal structure. Bailey prefers protected, but there is considerable argument against that.

Note that protected is often considered to be a bad idea, despite Bailey's enthusiasm; we will use private instead!

Who are we keeping data private (or protected) from? The idea is that you are the class implementor, and some other programmer (possibly you at a later date) is the client programmer. See Bailey §1.9.

Class methods can be divided into accessors, which are guaranteed never to modify internal class fields, and mutators, which may (though perhaps only in selected cases). An accessor that grants access to a specific class field, for example getXcoord() or getName(), is called a field accessor; corresponding field mutators might be setXcoord() and setName().

A class need not have any mutators; such a class is said to be immutable. This means that, once the object is created, it cannot be changed. If one wants a modified object, one has to create a new one.

String

Example 1: Bailey, p 7: two ways of implementing a string. Possible interface:

charAt(int), length()

Note that how we implement the class has no bearing on the interface!
Other examples of objects:

Point (x and y coordinates)
Ratio (representing an exact fraction)
Student record (name, address, registrations, etc)
BankAccount record
Association
Stack
Rectangle

The Point, Ratio, BankAccount, Association and Rectangle objects each consist of two member values; Point, Ratio and BankAccount consist of two integer member values. But there is quite a difference!

Point

Here is a pretty minimal class. We have a constructor and two (field) accessors. There is no mutator, and so the class is immutable. If you want to move a point, you have to create a new one.

Note the convention that fields begin with _; many find some such convention helpful. (To be fair, others find this kind of convention irksome.)

class Point {
    private int _xcoord;
    private int _ycoord;
    public Point(int x, int y) {
          _xcoord = x;
          _ycoord = y;
    }
    public int getX() {return _xcoord;}
    public int getY() {return _ycoord;}
}

Note the methods getX() and getY(), which are field accessors. Field mutators, to allow "moving" a point, would make sense, but are not necessary as it's just as easy to create a new point at the new location.

Ratio example, Bailey p 8

We are creating a class to represent rational numbers. Note especially the gcd() and reduce() methods (we return to gcd() below). Also note that the Ratio class is again immutable. To change a value, we assign an entirely new value.

The ratio class has field accessors getNumerator() and getDenominator(), but these don't quite work they way you might think. What is returned by getNumerator for new Ratio(4,6)? For new Ratio(4,-6)?

toString()

Note the toString() method of the Ratio class. You can call this explicitly whenever needed, as r.toString(). However, note that toString() works (for us) more generally; it is our first example of inheritance. (There is nothing special about toString(); any method can, in the right circumstances, have its workings affected by inheritance.) The master parent class Object defines toString(); any subclass can override that definition, as is being done here. In System.out.println(), when something needs to be printed then toString() is called implicitly; the rules of inheritance ensure that the most specific version of toString is the one to be invoked.

See objects.html#object_toString.

Class demo:

change Ratio.toString() so it prints differently (in demos/ratio.java). Perhaps parentheses should be included: (19/37)
verify that conversion is not automatic; we can't do String s = r1. System.out.println() must be explicitly calling toString!

Loops

Suppose we have ArrayList<String> L, that has data in it. How can we print out the entries?

1. while loop

int i=0;    // Java
while (i< L.size()) {
    System.out.println(L.get(i));
    i++;
}

For a while loop, the loop variable 'i' must be declared before the while.

2. for loop ("classic for")

for (int i = 0; i< L.size(); i++) { // Java
System.out.println(L.get(i));
}

Note that I've chosen to declare i within the loop here. You can do that or else declare the loop variable as in the while loop example above.

3. for-each loop

for (String s : L)
System.out.println(s);

Note that we don't have get(i) here; the for-each loop uses the String variable s as the "loop variable". Note that s must be declared within the loop, as shown. Java takes care of assigning to s each element of L, in turn.

4. Iterator loop

Iterator<String> it = words.iterator();
while (it.hasNext())
System.out.println(it.next());

This is an iterator. Iterators were sort of a predecessor to the for-each loop. Both Iterators and for-each work for any Collection, not just ArrayList. Why would you use an Iterator, rather than the for-each loop? There are times when the loop structure isn't so simple; consider a single loop that takes elements from two lists, one from each for each loop pass. You can't do that with a for-each loop, because the for-each loop would go through just one of the lists.

What an iterator is is a precise way of keeping track of the "current position" in a list. The actual object representing the iterator has two pieces: a reference to the original list, and also a current position.

Loop Patterns

It is also possible to approach loops from the perspective of how to write different loops to accomplish different things. For a good summary of loop patterns, see www.cs.uni.edu/~wallingf/patterns/loops.html. Here are a few examples:

1. Process All Items: Here the for-each, classic for or while loops all work. The latter two give you the position value ("i").

You would use a process-all-items loop to find the sum of the values in a list. How would you find the maximum? Here you might use a conditional process-all-items approach:

int i=0;
int max=0;
while (i < A.size()) {
if (A[i]>max) max=A[i];
i++;
}

2. Loop-and-a-half, or Process-items-until-an-event: Suppose we want to process items until something happens. Suppose we're reading data values, and a special sentinel values (-1) is returned. We can do this as:

val = getvalue();
while (val != sentinel) {
processvalue(val);
val = getvalue();
}

A popular alternative is the break loop:

while (true) {
    val = getvalue();
    if (val == sentinel) break;
    processvalue(val);
}

This approach irritates some language purists, but I am not one of them.

3. Searching a list: this is like the preceding, except that we're searching a list for a value v for which valtest(v) is true. However, there might not be any such value in the list, so we have to ensure that we don't run off the end of the list:

int i = 0;
while (i<A.size() && ! valtest(A[i])) { // not valtest
i++;
}

This loop looks suspicious! The loop body is too plain! But it works. If the loop terminates with an i<A.size(), then A[i] is the value for which valtest() succeeded; if it terminates with i==A.size() then no value was found. The alternative is the break loop:

int i=0;
while (true) {
if (i>= A.size() || valtest(A[i])) break; // found it
i++;
}

Note the sense of the condition is reversed.

Back to the Ratio class

The gcd() method on Bailey page 9 is recursive: it calls itself. How does this work?

There are a few separate issues. First, we note that gcd(a,b) = gcd(a,b%a), always; any divisor of a and b is a divisor of b%a (which has the form b-ka), and any divisor of a and b%a is a divisor of b.

The second issue, though, is how it can even be legal for a function to call itself. Internally, the runtime system handles this by creating a separate set of local variables for each call to gcd(). This is done on the so-called runtime stack. This means that different calls to gcd(), with different parameter values, don't interact or interfere.

Finally, there's the question of whether rgcd() ever returns. One way to prove this is to argue that the first parameter to rgcd() keeps getting smaller. We stop when it reaches 0, as it must. The atomic case in the recursion is the case that involves no further recursive calls; in the gcd() example it is the case when a==0.

How could we create an iterative (looping) version? Here's one possibility.

// pre: a>=0,
      b>=0, a>0 or b>0

      int gcd(int a, int b) {

          while (a>0 && b>0) {

             if (a>=b) a = a % b;

             else b = b % a;

          }

          if (a==0) return b; else return a;

More classes with two fields

Ratio

Both fields are integers. There are accessors for numerator and denominator, but no mutators. Also, the numerator and denominator stored may not equal the numerator and denominator supplied by the constructor.

Point

Again, both fields are integers. There are accessors for both x and y, but, again, no mutators. However, the x and y supplied by the constructor are the x and y actually used.

Student and BankAccount examples

Bailey includes a BankAccount example on page 11. There are two fields: an account_number and a balance (Bailey has the account_number be a String, though it could be an integer as well).

A related example might be a Student class (which has more than two fields!), with fields for name, address, and other personal examples. Each of these classes comes equipped with a nearly full range of field accessors that return individual fields, and field mutators that update them.

Note, however, that in the BankAccount class there is no mutator for the account field; we do not anticipate changing that. Also, there is no field mutator for balance; we can deposit() and withdraw() money, but we can't just set the balance to whatever we want. In a real banking system, this helps the programmers verify that when money is deposited, it has to come from somewhere else.

Notice also the pre- and post-conditions for the methods. These are a good idea, though they take some getting used to and "trivial" preconditions can be inscrutable.

Also note the .equals() method.

The Student and BankAccount examples can be deceptive, as they tend to focus primarily on fields. In this sense they are more like database records than java classes. Java/C#/C++ classes tend to focus on methods. (The class Point also is dominated by its fields.) Compare with the class Ratio, with nontrivial methods reduce() and gcd(). (The BankAccount class might potentially have some "nontrivial" methods added to move funds around that verified the money was there, but this simple example doesn't do that.)

Association

The Association class in §1.5 (class on p 16, example on p 15) is simply a "pair", <key,value>, where we provide accessors for both fields but a mutator only for the value. That is, we do not allow ourselves to change the key (however, we can create a new Association object with a new key). We also provide an equals() method.

Note that there are two constructors for this class. Furthermore, the single-parameter constructor calls the two-param constructor; if we restructure the underlying implementation only the two-param constructor needs to be rewritten.

How does Association differ from Point?

Rectangle

Bailey's Rectangle class (page 22) contains two Point objects. There are mutators to set the left x-coordinate, and the width (and presumably the lower y-coordinate and the height). But none of these is directly a field mutator for the two internal Point objects with which the rectangle is constructed.

Classes based on shapes form one of the most common examples of an object hierarchy with polymorphism and inheritance. That is, there is a base class Shape, and then child classes, say, Rectangle and Triangle. We can create shapes as:

Shape s1 = new Rectangle(...);
Shape s2 = new Triangle(...);

and then draw the shape with

s1.draw();

What makes this work?

But the bulk of Bailey's Rectangle example does not involve inheritance. The primary goal of this section is how we might come up with a good interface for class Rectangle.

Note the drawOn() precondition that the window is a valid one. Note also the relatively rich set of operations. Note also that for left() and width() the accessor and the mutator have the same name! Java distinguishes between the two by the presence of the parameter. Some people find this approach helpful; others find it too confusing by half. The most common naming strategy is probably left() for the accessor and setLeft() for the mutator.

Normally, drawing a Rectangle is a primitive operation in the graphics library.

Interfaces

See section 1.8 starting at p 23. Bailey starts with interface Structure, which at first appears to implement a basic list. Note, however, that there is no mechanism to access the ith element of the Structure; it is not a list because you cannot retrieve elements in list order.

We could then have class List extend Structure, and also class Set extend Structure.

Vector

See lists.html#vector.

Introduction to C++

Here are a few notes on this: Intro to C++

What about installing it?

Macs sometimes have xcode. Or you can get it at https://developer.apple.com/xcode/ (or maybe the Apple App Store).

For windows, you can install MS Visual Studio, or mingw. The link to the MSDNAA site for Visual Studio keeps changing; right now it seems to be called Microsoft Imagine and is at docs.cs.luc.edu/syshandbook/academic-alliances-programs.html.

Be sure to click register the first time you connect. Your account identifier is your Loyola email address, with the "@luc.edu".

Hangman

The Hangman example (with embedded class WordList) starts at page 18. What is different about the WordList class? How do words get accessed?

This is in §1.6; part of the goal here is the example in §1.8 of an Interface. On p. 20 is the basic interface of WordList as a standalone class. On pp 22-23, an interface Structure is defined and WordList is then declared to implement that interface:
public class WordList implements Structure
That's a Java/C# feature; C++ doesn't quite have "implements".

A Java/C# class can extend just one parent class, but can implement multiple interfaces. In particular, a WordList could extend, say, StrList, and also implement Structure.

C++ does in fact allow classes to extend from multiple parents; this is called multiple inheritance. The general case is not nearly as useful as one might think; most (almost all?) reasonable examples of multiple inheritance involve cases where all but one of the inheritances is really an "implement".

Big-O notation and Bailey Chapter 5: Analysis

See lists.html#bigO

Binary Search

See sorting.html#binsearch