Searching and Sorting

Bailey, Chapter 6, p 119

Morin, Chapter 11, p 225 (O(n log n) sorts only)

Binary Search

Suppose you have an array A of size N that is sorted, so i<j => A[i]<A[j]. Then finding an element can be done in time O(log N).

Generally speaking, O(log N) means "growing only very slowly with N". Casually, O(log N) can be seen as "almost constant".

N	log₂(N)
100	7
1000	10
100,000	17
1,000,000	20
1,000,000,000	30

It doesn't really matter what base we use; a change of base just introduces a constant of proportionality. It is often convenient for visualization, however, to use base 2.

To search for value X in log(N) time, we keep dividing the array in half:

lo=0; hi=N-1
while (lo < hi) {
   mid = (lo+hi)/2;
   if (X<A[mid]) hi = mid-1;        // search A[lo]...A[mid-1]
   else if (X>A[mid]) lo=mid+1;      // search A[mid+1]...A[hi]
   else lo = hi = mid;              // found
}

Suppose this is the array, and we are searching for X=11.

0	1	2	3	4	5	6	7
2	3	5	8	13	21	34	55

Initially we have lo=0 and hi=7, so mid=3. X>A[3], so we set lo=4.

Now lo=4 and hi=7, so mid=5. We have X<A[5], so hi=4, at which point the loop stops.

Now let us search for X=25:

lo=0, hi=7: mid=3, X>A[mid] so lo=mid+1
lo=4, hi=7: mid=5, X>A[mid] so lo=mid+1
lo=6, hi=7: mid=6, X<A[mid] so hi=mid-1
lo=6, hi=5

Note in this case the loop terminates with lo>hi.

It is important to understand why the number of times through the loop here is log₂(N).

Sometimes it is helpful to introduce a loop invariant here: the statement that either X is to be found in the range A[lo]..A[hi], or else X is not present in A. With this in mind, we can eliminate the second comparison above (X>A[mid]).

This is one of relatively few elementary examples of a loop that is hard to write correctly without an invariant.

Another thing to keep in mind here is that lo≤mid<hi. However, lo==mid will occur if hi=lo+1. So in the following loop, we arrange for the search alternatives to be lo..mid and mid+1..hi. If we instead arrange for the search alternatives to be lo..mid-1 and mid..hi, then the loop can fail to terminate! That is, we can have hi=lo+1, so mid=lo, and so searching mid..hi is the same as searching lo..hi, and so we keep searching lo..hi forever.

lo=0; hi=N-1
while (lo < hi) {
   mid = (lo+hi)/2;               // ranges to be searched should be lo..mid and mid+1..hi, both of which are SMALLER than lo..hi
   if (X>A[mid]) lo = mid+1;    // search A[mid+1]...A[hi]
   else hi = mid // search A[lo]...A[mid]
}

Sorting

Bubble sort: Bailey p 120
Each time we pass through, we swap adjacent elements if they are out of order. Note that the same value may be carried forward in several swaps. There is no good reason to believe bubble-sort is reasonably efficient.

Selection sort, Bailey p 122
    Find the biggest, move it to the Nth position (data[N-1])
    Find the second biggest (that is, the biggest of positions 0 up to N-1) and move it to the N-1th position.

Here is code for this for StrList. Note the need to use Java's String.compareTo(s1,s2) rather than <.

    public void ssort() {
	for (int i = 0; i<currsize-1; i++) {
	    // find smallest of elements[i]..elements[currsize-1] and swap to position i
	    int index_min = i;
	    string curr_min_val = elements[i];
	    for (int j = i+1; j<currsize; j++) {
		if (String.compareTo(elements[j],curr_min_val) < 0) {
		    curr_min_val = elements[j];
		    index_min = j;
		}
	    }
	    swap(i,index_min);
	}
    }

For the TList<T> version, we need two things. First, the TList<T> class must require that T implement the IComparable interface:

class TList<T> where T : IComparable { ... }

Second, we then need to use compareTo() in sort():

    public void ssort() {
	for (int i = 0; i<currsize-1; i++) {
	    // find smallest of elements[i]..elements[currsize-1] and swap to position i
	    int index_min = i;
	    T curr_min_val = elements[i];
	    for (int j = i+1; j<currsize; j++) {
		if (elements[j].compareTo(curr_min_val)<0) {
		    curr_min_val = elements[j];
		    index_min = j;
		}
	    }
	    swap(i,index_min);
	}
    }

Demo: ssort.cs, which uses IntList.cs and comparisons using the < operator.

    public static void Main(string[] args) {
	if (args.Length > 0) {
		LISTSIZE = Convert.ToInt32(args[0]);
		System.out.format("List size is %d%n", LISTSIZE);
	}
        nums = new IntList(LISTSIZE);
	nums.RandomFill();
	nums.ssort();
    }

In the ssort() method we use Stopwatch to record the time:

    // selection sort
    public void ssort() {
	Stopwatch s = new Stopwatch();
	s.Start();
	for (int i = 0; i<currsize-1; i++) {
	    // find smallest of elements[i]..elements[currsize-1] and swap to position i
	    int index_min = i;
	    int curr_min_val = elements[i];
	    for (int j = i+1; j<currsize; j++) {
		if (elements[j] < curr_min_val) {
		    curr_min_val = elements[j];
		    index_min = j;
		}
	    }
	    swap(i,index_min);
	}
	s.Stop();
	System.out.format("sorting took %d milliseconds%n", s.ElapsedMilliseconds);
    }

Does the time appear to be quadratic?

Insertion sort, p 125
for (i=0, i<N, i++)
insert data[i] into the sorted portion data[0]..data[i-1].

Merge sort

Basic idea: split, and merge.

Here it is interpreted in dance: https://www.youtube.com/watch?v=XaqR3G_NVoo

To get an estimate of the running time, we count comparisons: there are log(n) stages, and all the merges at each stage take O(n) together. Total: O(n log n)

One strategy: merge into a temp array T, then copy back into A.

Book's strategy: a little different; requires copying only half the array. How much does this slow things down? One way to investigate this would be to copy the entire array twice, to see the speedup.

Another strategy for reducing the copying of T is the "back-and-forth" method: merging from A into T at one stage, and then from T into A at the next. This is easier for the nonrecursive version.

Count comparisons: log(n) steps, and all the merges at each step take O(n) together (more precisely, we have to merge lists whose total length is n). Total: O(n log n)

IterativeMergeSort: how it works. Issues with merge.

The optimization to avoid copying

1. Change to RecursiveMergeSort to make it a little easier to use the merge code directly in IterativeMergeSort

2. Compare
    T[k]=A[i];
    k++;
    i++;
with
    T[k++] = A[i++];

demos: sorting/msort.cs (must be compiled with sorters.cs)

Iterative versus Recursive mergesort

Quicksort (Bailey p 131)

The first sentence of Bailey section 6.5 is remarkably opaque:

Since the process of sorting numbers consists of moving each value to its ultimate location in the sorted array, we might make some progress toward a solution if we could move a single value to its ultimate location.

The idea that speed depends on getting values to their ultimate location early on is demonstrably false; see mergesort.

Nonetheless, the Quicksort algorithm is about as fast as sorting gets. Quicksort is like recursive mergesort in that we divide the data into two pieces and sort each piece. Unlike mergesort, the pieces may not be the same size. But even more unlike mergesort, the elements in the first piece are all less than the elements in the second piece, so once the pieces are sorted there is nothing more to do. There is no "merge" step.

The basic idea behind quicksort is straightforward:

On a single pass of a section of the array (usually up from the left end and down from the right, until they meet), divide the array section into a "low subsection" and a "high subsection". Whenever you find a big element in the low subsection and a small element in the high subsection, swap them.
Recursively sort those two subsections.

Here is the "simplest" partition strategy. It has a flaw. We take a number called the pivotvalue, and divide A[left]...A[right] into two sections, A[left]..A[mid] and A[mid+1]..A[right], so that the first section contains values less than the pivotvalue and the second section contains values greater than or equal to the pivotvalue. Here is the code:

    private static int simple_partition(int [] A, int pivotvalue, int left, int right) {
        while (true)
        {
            while (left < right && pivotvalue <= A[right]) right--;
	    // now left == right or A[right] < pivotvalue

            while (left < right && A[left] < pivotvalue) left++;
 	    // now left == right or pivotvalue <= A[left]
            if (left < right) swap(A,left,right);
	    else if (A[right] < pivotvalue) return right+1;
            else return right;	// left == right == pivot
        }
    }

This and other code can be found in qsort.cs.

One pass through the loop decrements right until it finds a value < pivotvalue, and increments left until it finds a value >= pivotvalue. The values are then swapped. When left and right finally meet, say at mid, then if i=mid then A[i] >= pivotvalue. A[left]..A[right] is divided into two sections A[left]..A[mid-1] and A[mid]..A[right].

There is some slight trickiness when left and right meet. If they meet at the end of the first inner while loop, we might have pivotvalue <= A[right] or might not; this is reflected in the test at the end.

Quicksort now looks like this:

    private static void quickSortRecursive1(int [] A, int left, int right)
    // pre: left <= right
    // post: data[left..right] in ascending order
    {
        int pivotindex;   // the final location of the leftmost value
        if (left >= right) return;
	int pval = (data[left]+data[right])/2;
        pivotindex = simple_partition(data,pval,left,right);    /* 1 - place pivot */
        quickSortRecursive1(data,left,pivotindex-1); /* 2 - sort small */
        quickSortRecursive1(data,pivotindex,right);/* 3 - sort large */
        /* done! */
    }

What is the flaw?

If simple_partition() returns left, then the second recursive call is quickSortRecursive(A, left, right); that is, we have infinite-depth recursion. This will happen if, for example, pivotvalue equals the mimimum value in the array segment. If all the values from A[left] to A[right] are equal, this will happen for any reasonable choice of pivotvalue.

Note that if simple_partition() should return right, the two calls are qSR(A,left,right-1) and qSR(A,right,right); the recursive subcalls are strictly shorter.

The usual way of fixing this is to pick a specific value known to be in the array as the pivotvalue, and also to make sure that, at the end, if the return value is mid, then A[mid] = pivotvalue. This means the recursive calls are qSR(A,left,mid-1) and qSR(A,mid+1,right); these are "safe" even if mid==left or mid==right.

The most common choice of pivotvalue is A[left]. However, if we want to choose A[i] as the pivotvalue for left

    private static int bailey_partition(int [] A, int left, int right) {
    // pre: left <= right
    // post: data[left] placed in the correct (returned) location

	// Random r = new Random();		// needs "using System;"
	// int index = r.Next(left, right+1);
	// swap(A, index, left);
        int pivotvalue = A[left];

        while (true)
        {
            // move right "pointer" toward left
            while (left < right && pivotvalue < A[right]) right--;
            if (left < right) swap(A,left++,right);
            else return left;	// left == right == pivot
	    // now pivotvalue = A[right]
            // move left pointer toward right
            while (left < right && A[left] < pivotvalue) left++;
            if (left < right) swap(A,left,right--);
		// after, A[left] == pivotvalue again
            else return right;	// left == right == pivot
        }
    }

Horvick, part 1 p 111 (final page of part 1) has an even simpler strategy. The pivotIndex value is chosen randomly between left and right, inclusive, by the caller. (Horvick's code is for a generic type T; I've replaced that with int. To compare generic values one uses .CompareTo(); I've replaced that with <.)

private int horvick_partition(int [] items, int left, int right, int pivotIndex) {
	int pivotValue = items[pivotIndex];
	Swap(items, pivotIndex, right);
	int storeIndex = left;
	for (int i = left; i < right; i++) {
	    if (items[i] < pivotValue) {
		Swap(items, i, storeIndex);
		storeIndex += 1;
	    }
	}
	Swap(items, storeIndex, right);
	return storeIndex;
}

This partitions the array in a single upwards pass; however, there is a lot more swapping.

Consider the following array of data:

Suppose we use 11 as the pivot value in Bailey's method. We must swap 11 so it is A[left]:

    Decrement right until it points to 5
    Swap 11 and 5: 5 3 8 18 7 14 11 13
    increment left until it points to 18
    Swap 18 and 11: 5 3 8 11 7 14 18 13
    decrement right until it points to 7
    swap 11 and 7: 5 3 8 7 11 14 18 13

Now the array is divided into elements < 11, 11 itself, and elements > 11.

Suppose we wanted to use 10 as pivotValue. Because 10 is not present in the array, we can't actually use Bailey's method.

Now consider this array:

0	1	2	3	4	5	6	7
11	3	11	18	7	14	11	13

We start by decrementing right to point to the rightmost 11, at A[6]. We swap A[6] with A[0], which is a no-operation, and then increment left to 1.
increment left until it points to A[2]; swap A[2] and A[6], decrement right to 5
decrement right to A[4]; swap A[2] and A[4] to get 11 3 7 18 11 14 11 13; increment left to 3
A[3] > pivotValue so swap A[3] and A[4], decrement right to 3:

left = right = 3, and A[3] = 18

Note that both sides have 11's. Nonetheless, sorting the sides (11,3,7) and (18,14,11,13) will leave the entire array sorted:

(3,7,11) 11 (11,13,14,18)

The quicksort algorithm in Morin avoids these complications by partitioning the array into three zones: values < pivot, values == pivot, and values > pivot. Here's the code. If we want a separate partition() method, it has to return two values, hence my class IntPair.

    private static IntPair morin_partition(int[] A, int left, int right) {
        int pivot = A[left];	// Morin actually chose A[left + rand.Next(right-left+1)]
        int lo = left-1, j = left, hi = right+1;
        // A[left..lo] < pivot, A[lo+1..j-1] = pivot, A[hi..right] > pivot
        while (j < hi) {
            if (A[j] < pivot) {	
                 // move to beginning of array
                 lo+=1;
                 swap(A, j, lo);
                 j+=1;
            } else if (A[j] > pivot) {
                 hi-=1;
                 swap(A, j, hi); // move to end of array
            } else {
                 j++;
                 // keep in the middle
            }
        }
        return new IntPair(lo,hi);
    }

Try this on an array with some duplicates. How about {11,3,11,18,7,14,11,13}?
11    3    11    18    7    14    11    13
3    11    11    18    7    14    11    13
3    11    11    13    7    14    11    18
3    11    11    11    7    14    13    18
3    7    11    11   11    14    13    18

There is also, as written, a swap of 14 with itself.

The Morin 3-category technique allows pivots not in the array, as long as the pivot is not less than all elements of the array or greater than all elements of the array. Why?

One drawback of the Morin partition is that it is a little harder to do manually, as j, lo and hi are all moving.

Radix Sort (Bailey p 134)

Basic algorithm: bucket sorting. Divide a big dataset into "buckets", and then sort the buckets. MergeSort is a form of doing this when the buckets represent the two halves of the data.

What if we did the bucketing in low-to-high order? (least-significant to most-significant digit). That is, if the numbers are all < 1000, we sort first on the ones digit, then on the tens digit, and then on the hundreds digit.

Is this O(n)? (answer: no)

Sorting objects: TList.cs

osort.cs and TList.cs in my sorting folder. Note that I had to make a small correction to TList.cs

Demo: print out comparisons, by setting debug=true.

Finding the Median

Given an array A[0]..A[N-1], how can we find the median? One way is to sort the data, set mid=N/2-1, and then return the value A[mid] if N is odd, or (A[mid]+A[mid+1])/2 if N is even. The run-time cost for sorting the array is O(N log N).

Can we do better?

The standard O(N) strategy is based on the Quicksort partition: run the partition(A,0,N-1) method, and get back an index i. At that point we know A[i] is in the right place. If imid, the median must be in A[0]..A[i-1].

One catch is that finding the median of A[0]..A[i-1] is not particularly helpful. We need to expand the recursive case a little: to find the Kth element, for 0≤K
int findKth(A,left,right,K) // find Kth smallest element of A[left]...A[right], starting at K=0

The algorithm then becomes:

        int pivot = bailey_partition(A,left,right);    // now A[pivot] is (pivot-left)th smallest
    	if (K == pivot-left)    return A[pivot];
   	else if (K < pivot-left) {    // Kth-smallest must be among A[left]..A[pivotindex]
            return findKth(A,left,pivot-1,K);
    	} else {
            return findKth(A, pivot+1, right, K-(pivot-left)-1);
    	}

See median.cs.

Demo: some cases to make sure the -1's, etc are sensible.