Discussion 9 (07-19)

There is a midterm this Thursday. As announced in lecture, you may bring a one page "cheat sheet".

The test will likely focus on data structure and algorithm design and analysis. This may involve designing a structure or algorithm that has certain properties or meets certain requirements (on run time, memory usage, etc), analyzing a structure or algorithm (either one that you designed or one that was given to you), and/or considering the possible tradeoffs (between memory usage and computational cost, for example) to be made when working with various algorithms and data structures.

George has stated that he doesn't want the tests to be tricky or obscure. As part of this, he is allowing you a single page of notes, on which you can write down any details you find difficult to remember. The goal is to have you worry less about memorizing things and more about getting good at designing and working with algorithms and data structures.

What will the test cover? The short answer is, everything we have covered from the first day of class to the lecture and lab on the day before the test, plus all of the associated reading, projects, etc.

The idea is that all of this material should serve as a basis, as a set of tools, that you can use to work with algorithms and structures. The actual test may take on various forms. The bottom line, however, is that if you are familiar with all of the material, and get good at working with it (by practicing!), you should be able to take on anything.

Begin by getting familiar with all the material we have covered. Look further, and work through, topics that you found difficult the first time.

Develop your design and analysis skills by practicing. Try out difficult problems; We will go over a few today in discussion.

Implementing the algorithms and structures is a great way to get to know their inner workings.

The list you saw in lecture and the book, the one you wrote for hw3/proj2, the one I am about to present, and the one George may present tomorrow are all examples of different implementations of the same thing, based on different designs. That is, each version is driven by a different design goal, which is often associated with a language feature that makes it possible. My version relies on access modifiers, while George's uses inner classes.

In my version, all of the list related classes are grouped into the same package. Access modifiers (both on classes and fields within those classes) are used to ensure package-level encapsulation; Users of the list package can only see the external interface I have designed, while the classes inside the list package can communicate low-level details through package protected methods.

Below is the code for the circularly, doubly linked list I wrote for my implementation of project 2. Please don't copy this code (as doing so would be plagiarism, copyright infringement, and cheating).

AwesomeList.java

package cs61b.awesome;

public class AwesomeList
{
      // All private
      private AwesomeListNode header;
      private int size;

      // Externally visible
      public AwesomeList()
      {
         header = new AwesomeListNode(null,
                                      AwesomeListNode.HEADER_NODE_TYPE);
         header.nextNode = header;
         header.prevNode = header;

         size = 0;
      }

      // Package protected
      void insertNode(AwesomeListNode currentNode, Object newContent)
      {
         AwesomeListNode newNode =
            new AwesomeListNode(newContent,
                                AwesomeListNode.NORMAL_NODE_TYPE);

         AwesomeListNode currentNextNode = currentNode.nextNode;
         
         currentNode.nextNode = newNode;
         newNode.nextNode = currentNextNode;
         currentNextNode.prevNode = newNode;
         newNode.prevNode = currentNode;

         size ++;
      }

      // Package protected
      Object removeNode(AwesomeListNode currentNode)
      {
         if (!(currentNode.isHeader()))
         {
            AwesomeListNode currentNextNode = currentNode.nextNode;
            AwesomeListNode currentPrevNode = currentNode.prevNode;
            
            currentPrevNode.nextNode = currentNextNode;
            currentNextNode.prevNode = currentPrevNode;

            // For good measure
            currentNode.nextNode = null;
            currentNode.prevNode = null;

            size --;

            return (currentNode.getContent());
         }
         else
         {
            return null;
         }
      }

      // Externally visible
      public int getSize()
      {
         return size;
      }

      // Externally visible
      public AwesomeListIterator iterator()
      {
         return (new AwesomeListIterator(this, header));
      }
}

Notice the package-protected insertNode() and removeNode() methods. These methods deal with all of the details of correctly inserting and removing nodes and should be called by any externally visible method that needs to do so. With these methods in place, it would be easy enough to add methods to the AwesomeList class to insert and remove nodes at particular indices, or insert or remove to/from the front or back of the list.

AwesomeListNode.java

package cs61b.awesome;

// Notice default protection here
class AwesomeListNode
{
      public static int HEADER_NODE_TYPE = 1;
      public static int NORMAL_NODE_TYPE = 2;

      // These are all public, but only within the package
      public Object content;
      public AwesomeListNode nextNode;
      public AwesomeListNode prevNode;
      private int nodeType;

      public AwesomeListNode(Object content,
                             int nodeType)
      {
         this.content = content;
         this.nodeType = nodeType;

         nextNode = null;
         prevNode = null;
      }

      public Object getContent()
      {
         return content;
      }

      int getNodeType()
      {
         return nodeType;
      }

      boolean isHeader()
      {
         if (nodeType == HEADER_NODE_TYPE)
         {
            return true;
         }
         else
         {
            return false;
         }
      }
}

I used the Java ListIterator as a guide in designing the external interface of my iterator.

AwesomeListIterator.java

package cs61b.awesome;

public class AwesomeListIterator
{
      // All private
      private AwesomeListNode currentNode;

      private AwesomeList associatedList;

      // Externally visible
      public AwesomeListIterator(AwesomeList associatedList,
                                 AwesomeListNode startNode)
      {
         this.associatedList = associatedList;
         currentNode = startNode;
      }

      // Externally visible
      public boolean hasNext()
      {
         if (!((currentNode.nextNode).isHeader()))
         {
            return true;
         }
         else
         {
            return false;
         }
      }

      // Externally visible
      public Object next()
      {
         // May return null if we are at the header
         currentNode = currentNode.nextNode;
         return (currentNode.getContent());
      }

      // Externally visible
      public boolean hasPrev()
      {
         if (!((currentNode.prevNode).isHeader()))
         {
            return true;
         }
         else
         {
            return false;
         }
      }

      // Externally visible
      public Object prev()
      {
         currentNode = currentNode.prevNode;
         return (currentNode.getContent());
      }

      // Externally visible
      public void insert(Object newContent)
      {
         associatedList.insertNode(currentNode, newContent);
      }
      
      // Externally visible
      public Object remove()
      {
         return (associatedList.removeNode(currentNode));
      }
}

public void foo(int n, int currentDepth)
{
   if (currentDepth == n)
   {
      1 + 1; // This is our "single operation". Just for good measure.
   }
   else
   {
      for (int i = 0; i < n; i ++)
      {
         foo(n, currentDepth + 1);
      }
   }
}

Some students came up with the slightly-incorrect version below. It turns out that it has a running time of n! (draw out the call tree if you are unclear why this is).

public void foo(int n)
{
   if (n > 0)
   {
      for (int i = 0; i < n; i ++)
      {
         foo(n - 1);
      }
   }
}

Other students came up with the clever and ultra-simple solution below. I still prefer my version, as it takes on a form that might actually occur in a useable program.

public void foo(int n)
{
   for (int i = 0; i < Math.pow(n, n); i ++)
   {
      1 + 1;
   }
}

Given a linked list, how can you detect if there is a cycle in that list using only two additional pointers/iterators? Assume the list is singly-linked and has no wrapper class. Focus mainly on designing a technique to find cycles; The details of detecting the end of the list (if there is no cycle) are fairly straightforward. Also assume that cycle (if present) may be of any size. You may use as much time as needed, but your algorithm must eventually halt and you may only use two pointers (no additional counters, etc).

Begin by intializing the first pointer to the first element in the list and the second to the second element in the list. Then, while the two pointers do not equal each other (in which case a cycle has been detected) and the end of the list is not reached (in which case there is no cycle), move the first iterator by one node and the second iterator by two nodes.

The intuition here is that, with each iteration, the distance between the two iterators grows by one. If there is a cycle, the second iterator will eventually "loop around" and catch up to the first iterator. We can be sure that the second iterator will never skip over the first because their difference in distance only grows by one each iteration.

As we look at each of the five sorting algorithms we have covered in class, let's 1. Analyze their run time with various classes of input (sorted, almost sorted, all the same, random, and reverse sorted) and 2. Look at their interesting aspects and important features.

Also check out the Homework 4 Solutions (especially problems 8.4, 8.5, and 8.6) for more information and analysis of the sorting algorithms.

Insertion sort is conceptually simple and easy to implement. Its analysis is also fairly straightforward. Assuming an intelligent implementation, Insertion sort will run in O(n) with sorted, almost-sorted, and all-the-same input. By "intelligent implementation" I mean that, as elements are added to the "current sorted sub array" (which starts as just the first element in the array and eventually becomes the entire array), they are only moved if necessary to preserve sorted order. If a move is not necessary, then an intelligent implementation will continue to the next iteration by adding the next new element to the sorted sub array.

As we know from class and the book (and our own analysis), Insertion sort will run in O(n^2) with random and reverse-sorted input.

	Sorted	Almost Sorted	All The Same	Random	Reverse Sorted
Insertion	O(n)	O(n)	O(n)	O(n^2)	O(n^2)

Selection sort, an overly simplistic sort with the worst running time of all those we have looked at, always runs in O(n^2). In fact, it always takes exactly the same number of operations for a given input.

	Sorted	Almost Sorted	All The Same	Random	Reverse Sorted
Insertion	O(n)	O(n)	O(n)	O(n^2)	O(n^2)
Selection	O(n^2)	O(n^2)	O(n^2)	O(n^2)	O(n^2)

When Shell sort runs, it places each element in the array being sorted into two groups. The first group is the sub array to which the element belongs. These sub arrays are the things actually be sorted by the Insertion sort at the heart of Shell sort. For some gapSize, there will be gapSize sub arrays. Conceptually, each element is also a member of a "gap group". A gap group is a group of gapSize adjacent elements from the array.

After a single iteration of Shell sort, with gapSize = 3, here's what the array looks like:

Notice that each sub array is now sorted. This is a basic property of Shell sort. Notice also that, in general, the elements in the lower numbered gap groups are less than the elements in the higher numbered gap groups, even after just a single iteration of the algorithm. This is an example of the "bulk phase" of Shell sort, during which the elements of the array are moved around quickly but imprecisely. The goal during this phase, which takes place with larger gapSize's, is to quickly get the smaller elements to the lower gap groups and the larger elements to the higher gap groups. Look at how quickly the value 8, originally at index 0, was moved to index 6. With Insertion sort, this move alone would have required 6 swaps. Here, it took only 2. Certainly, index 6 is not the final location of the value 8, but it is close. This is where the second phase of Shell sort, the "precision phase" becomes important. In this phase, which takes place with smaller gapSize's (all the way down to gapSize = 1), the elements are moved short distances to make sure that the resulting array is precisely sorted. A key aspect of Shell sort is that these later iterations (the "precision" iterations) can take advantage of the work done by the earlier iterations (the "bulk" iterations). In this way, the running time of the internal Insertion sort during the final iterations of Shell sort can be reduced to near linear time.

For descriptions of how the following run times were obtained for the sorted, all-the-same, and reverse sorted cases, see the Homework 4 Solutions. We will take the almost sorted case to have run time similar to the sorted case. Due to the difficulty of analyzing Shell sort, the run time for the random case has been obtained experimentally. Note that the run time included in the table for the random case assumes the most efficient gapSize sequence, which goes from n / 2 to 1, dividing by 2.2 at each iteration. There are other gapSize sequences that have different associated run times.

	Sorted	Almost Sorted	All The Same	Random	Reverse Sorted
Insertion	O(n)	O(n)	O(n)	O(n^2)	O(n^2)
Selection	O(n^2)	O(n^2)	O(n^2)	O(n^2)	O(n^2)
Shell	O(nlog(n))	O(nlog(n))	O(nlog(n))	O(n^(7/6))	O(n^2)

A fundamental requirement of Shell sort is that the last iteration must execute with gapSize = 1, so that the internal Insertion sort can do a complete sweep and make sure every element is in its precise place. But what if we only run Shell sort with gapSize = 1. That is, what if the entire gapSize sequence is "1"? In this case, Shell sort will just decompose into the internal Insertion sort, and will run in the time it takes Insertion sort to run on the given input.

The analysis of Merge sort is straightforward: Merge sort always runs in O(nlog(n)). In fact, like Selection sort, it always requires exactly the same number of operations. Unlike Selection sort, Merge sort is efficient. It might therefore be useful in an environment where efficient sorting is desired, but the run time variance of Shell or Quick sort is not acceptable.

At each level, the cost is n; In order to merge the results from the lower levels, we must at least look at each of n elements. There will be a total of log(n) levels, so the total cost is O(nlog(n)).

A question was asked last time about increasing the base of the logarithm by increasing the number of recursive calls made at each level. The idea was that each level would still cost n, but we could reduce the total number of levels.

Suppose we choose b as our base. There will then be log_b(n) levels in the tree. The problem is that each level now costs bn. When b was 2, the constant factor was negligable. But now that it's variable, and may take on large values, we cannot neglect it. This is not just a result of being mathematically strict; Comparing 4 or 8 or 16 numbers during merging will take 2 or 4 or 8 times as long, respectively, as comparing 2.

So, with variable base b, there will be log_b(n) levels, each costing bn. The total cost will be bnlog_b(n). If the goal was to reduce the depth of the tree, then why not evaluate this version of Merge sort in the extreme case, where b = n and the depth of the tree is 1. Here, the run time will be n(n)(log_n(n)) = n^2(1) = n^2. The conclusion is that, as b gets closer to n, the run time of this version of Merge sort approaches n^2. Interestingly, in this extreme case, Merge sort actually decomposes into Selection sort; Each child contains 1 element and each iteration of the merging algorithm search for the next lowest element to place in the final, sorted array.

	Sorted	Almost Sorted	All The Same	Random	Reverse Sorted
Insertion	O(n)	O(n)	O(n)	O(n^2)	O(n^2)
Selection	O(n^2)	O(n^2)	O(n^2)	O(n^2)	O(n^2)
Shell	O(nlog(n))	O(nlog(n))	O(nlog(n))	O(n^(7/6))	O(n^2)
Merge	O(nlog(n))	O(nlog(n))	O(nlog(n))	O(nlog(n))	O(nlog(n))

The performance of Quick sort depends entirely on how evenly each pivot partitions the other elements. The best pivots will always split the elements in the current sub portion of the array in half. In the worst case, each pivot causes all the other elements to go into just one partition.

How do we choose a good pivot? Always using the element at n/2 (that is, the element half way through the array) is simple and works well in the sorted, almost sorted, all the same, random, and reverse sorted cases. The median-of-three method, which does some "mini statistical sampling" by looking at the first, last, and n/2 elements and choosing the median of those as the pivot, works equally well. For further analysis, check out the Homework 4 Solutions (especially problems 8.4, 8.5, and 8.6).

Pivot selection is totally analogous to playing the game 20 questions. The wrong way to play 20 questions is to ask "Is it a kangaroo, is it a computer, is it yogurt, ...". In this case, you only gain 20 pieces of information, all of which are likely wrong. This is exactly the same as choosing a pivot that causes all of the other elements to fall into just one of the partitions. The best way to play 20 questions is to ask questions that split the remaining set of possibilites exactly in half. This is just like picking pivots that split the elements in the current sub array in half.

	Sorted	Almost Sorted	All The Same	Random	Reverse Sorted
Insertion	O(n)	O(n)	O(n)	O(n^2)	O(n^2)
Selection	O(n^2)	O(n^2)	O(n^2)	O(n^2)	O(n^2)
Shell	O(nlog(n))	O(nlog(n))	O(nlog(n))	O(n^(7/6))	O(n^2)
Merge	O(nlog(n))	O(nlog(n))	O(nlog(n))	O(nlog(n))	O(nlog(n))
Quick	O(nlog(n))	O(nlog(n))	O(nlog(n))	O(nlog(n))	O(nlog(n))

Decompositions - I find the cases where one sorte decomposes into another particularly interesting. Specifically, Shell sort with a gapSize sequence that only contains gapSize = 1 becomes Insertion sort, Merge sort with a base of n becomes Selection sort, and Quick sort with the worst pivot choices becomes Insertion sort.

Tunables - Also interesting is the affect "tuning" has on the run times of the sorts. Specifically, the technique used to choose the pivot for Quick sort can vary the algorithms run time from O(n^2) (with bad pivots) to O(nlog(n)) (with good pivots, as chosen by the n/2 or median-of-three techniques). Similarly, different gapSize sequences used with Shell sort can cause the algorithms run time to vary from O(n^(7/6)) to O(n^2).

You are given n strings, each of up to length n. Your job is to find duplicates. Don't worry too much about exactly what to do when you find a duplicte (printing out something like "String at index X and string at index Y are the same" is sufficient for now).

1. Begin by determining Big Omega for the problem.
2. Then come up with one or more solutions and analyze their running times. Be detailed in your analyses; Consider lower order terms. Also, don't take any operations for granted; Determining the length of a single string of length n is a O(n) problem.

1. The Big Omega for this problem is n^2. At the very least, we need to look at each letter of each string at least once. As it turns out, there is an n^2 solution.

2. There are a number of specific algorithms. They are described and analyzed below:

The straightforward algorithm compares each string to each other string. This involves comparing n strings to n other strings, with the comparison costing O(n) (the characters need to be compared one by one). The total cost here is n^3 = O(n^3).

We might try to improve performance by sorting the strings (or pointers to the strings) by their lengths and then only comparing strings of the same length. Calculating the length of n strings, each of up to length n, costs n^2. Sorting these lengths costs nlog(n). In the best case, each of the n strings will be a different length and "comparing them" (by scanning through the array of lengths and seeing that none are the same) will cost n. However, in the worst case, where all of the strings are the same length, calculating and sorting their lengths has gained us nothing. Here we must fall back to comparing each string to each other string, which costs n^3. So, in the best case, the cost will be n^2 + nlog(n) + n = O(n^2). In the worst case, the cost will be n^2 + nlog(n) + n^3 = O(n^3).

We might now try, in a similar fashion to the previous solution, to sort the strings based on their first character or first couple of characters. The actual implementation of this solution might encode the first or first couple of characters into a number and then sort these numbers. While analysis of this solution shows that it has O(n^3) worst case run time (when the first couple of characters are the same in all the strings), the basic idea is a good one. Why not extend it and encode the entire string into a number? This is exactly what a hash function can do.

A note: The correctness of this solution relies on the existance of a hash function that will not (or will very rarely) produce the same hash code for two different inputs. Hash functions with this property exist and are called cryptographic hash functions.

We begin by computing and storing the hash code for each of the strings. Computing the hash code for a single string requires constant time operations on all of the strings n characters. There are n strings total, so the cost of this phase is n^2.

At this point we have a couple of choices on how to proceed. The most straightforward solution simply compares each hash code to each other hash code. This will cost n^2, so the overall cost will be n^2 + n^2 = O(n^2).

A more efficient solution first sorts the hash codes (which costs nlog(n)). Adjacent hash codes are then compared. If they are the same, then the duplicates are eliminated. This last phase, no matter the input, will cost n. Therefore, this version of the algorithm costs n^2 + nlog(n) + n = O(n^2).

The key feature of any of the O(n^2) solutions involving hash codes is the decoupling of the hash code computation phase and the comparison phase. This is a good example of "representation"; We have taken something we don't know how to deal with (an array of strings) and converted it (by hashing) into something we know how to deal with (an array of numbers). We then applied familiar techniques for working with the known thing (sorting, duplicate elimination) to solve the problem with the unknown thing.