**MOTIVATION**

we use the term list to mean a collection of records, each record having one or more fields. The fields used to distinguish among the records are known as keys. Since the same list may be used for several different applications, the key fields for record identification depend on the particular application. For instance, we may regard a telephone directory as a list, each record contains three fields: name, address, and phone number. The key is’ usually the person’s nune. However, we may wish to locate the record corresponding to I given number, in which case the phone number field would be the key. In yet another application we may desire the phone number at I particular address, so the address field could also be the key.

Once we have a collection of records, there are at least two ways in which to store them: sequentially or non-sequentially. For the time being let us assume we have a sequential list f, and we wish to retrieve I record with a certain key value k. If I has n records with I[i]. Uy the key value for record i, 1 S ; S a, then we may carry out the retrieval by examining the key values I[n]. key, 1(1). key, in that order, until the correct record is located. Such a search is known as a sequential search, since the records are examined sequentially. Program gives a sequential search function that uses the following data type:

We use functions getKey () and setKey () to get and set data member key. Note that the declaration of function getKey () is followed by the const keyword. This makes getKey () a const member function, which means that it must not modify any of the data members of class Element. A member function that is to be invoked by a const class object must be declared as a const member function. Since we will occasionally be using function getKey () on objects of type const Element, we declare it as a const member function.

Finally, note that even though the expression e. key is invalid in a C++ program because an attempt is made to access a private data member, we use it in our descriptions of the algorithms for simplicity.

Note that the introduction of the dummy record 0 with f[O] . key = k simplifies the search by eliminating the need for an end-of-list test (; < I) in the while loop. Although this might appear to be a minor improvement, it actually reduces the run time considerably for large n. If no record in the list has key value k, then i = 0, and the alxfe function requires n + I comparisons. ‘The number of key comparisons made in the case of a successful search depends on the position of the key in the list. If all keys are distinct and key f[; ] . key is being searched for, then n – i + I key comparisons are made. ‘The average number of comparisons for a successful search is, therefore,

It is possible to do much better when looking up phone numbers. The fact that the entries in the list (i.e., the telephone directory) are in lexicographic order (on the name key) enables one to look up a number while examining only a very few entries in the hst. Binary search is one of the better-known methods for searching an ordered, sequential list. A binary search takes only O(logn) time to search a list with n records. This is considerably better than the O(n) time required by a sequential search. We note that when a sequential search is performed on an ordered list, the conditional of the while loop of SeqSearch can be changed to f[i] . getKey () k. This improves the performance for the case of unsuccessful searches.

Getting back to our example of the telephone directory, we notice that neither a sequential nor a binary search strategy corresponds to the search method actually employed by humans. If we are looking for a name that begins with the letter W, we start the search toward the end of the directory rather than at the middle. A search method based on this interpolation scheme would begin by comparing k with f[i) . key, where i=(k-f[l].key)l(f[u).key-f[l).key)*n, and f[l].key and f[u].key are the values of the smallest and largest keys in the list. An interpolation search can be used only when the list is ordered. The behavior of such a search depends on the distribution of the keys in the list.

We have seen that as far as the searching problem is concerned, something is to be gained by maintaining the list in an ordered manner if the list is to be searched repeatedly. Let us now look at another example in which the use of ordered lists greatly reduces the computational effort. The problem we are now concerned with is that of comparing two lists of records containing data that are essentially the same but have been obtained from two different sources. Such a problem could arise, for instance, in the case of the United States Internal Revenue Service (IRS), which might receive millions of forms from various employers stating how much they paid their employees and then another set of forms from individual employees stating how much they received. So we have two lists of records, and we wish to verify that there is no discrepancy between the two sets of information. Since the forms arrive at the IRS in a random order. we may assume a random arrangement of the records in the lists. The keys here are the social security numbers of the employees.

Let the two lists be Fl and n with keys Fl [i]. key and n[i]. key. We make the following assumptions about the required verification:

(I) If there is no record in the employee list corresponding to a key Fl [i] . key in the employer list, a message is to be sent to the employee.

(2) If the reverse is true, then a message is to be sent to the employer.

(3) If there is a discrepancy between two records with the same key, a message to this effect is to be output.

Function Verify 1 solves the verification problem by directly comparing the two unsorted lists. Its complexity is O(mn). On the other hand, if we first sort the two lists an!! then do the comparison, we can carry out the verification task in time .

O(tson(n)+tson(m) + n + m), where tson(n) is the time needed to sort a list of n records. As we shall see, it is possible to sort n records in O(n logn) time, so the computing time becomes O(max(nlogn, mlogm}). Function Verify 2 achieves this time.

We have seen two important uses of sorting: (l) as an aid in searching and (2) as a means for matching entries in lists. Sorting also finds application in the solution of many other more complex problems, e.g., from operations research and job scheduling. In fact,

it is estimated that over 25 percent of all computing time is spent on sorting, with some installations spending more than 50 percent of their computing time sorting lists. Consequently, the problem of sorting has great relevance in the study of computing. Unfortunately, no one method is the best for all initial orderings of the list being sorted. We shall therefore study several methods, indicating when one is superior to the others.

First let us formally state the problem we are about to consider. We are given a list of records (RI, R2′ … , R,,). Each record, Rio has key value Kn In addition, we assume an ordering relation «) on die keys so that for any two key values x and y, x = y or x < y or y < x. The ordering relation (e) is assumed to be transitive (i.e., for any three values .r, y, and z, x < y and y < z implies x < z), The sorting problem then is that of

finding a permutation,

The desired ordering is,

Note that when the list has several key clues that are identical, the permutation, σ, is not unique. We shall distinguish one permutation. crs’ from the others that also order the list. Let o, be the permutation with the following properties:

A sorting method that generates the permutation σ2, is stable.

We characterize sorting methods into two broad categories: (1) internal methods (i.e., methods to be used when the list to be sorted is small enough so that the entire sort can be carried out in main memory) and (2) external methods (i.e., methods to be used on larger lists). The following internal sorting methods will be developed: insertion sort , quick sort. merge sort. heap sort. and radix sort. This development will be followed by a discussion of external sorting.

**INSERTION SORT**

‘The basic step in this method is to insert a record R into a sequence of ordered records. R1, R2, .R3,…… Rn, (K1, K 2,. . .. Kn) in such a way that the resulting sequence of size i + I is also ordered. Function insert (Program 7.4) accomplishes this insertion. It assumes the existence of an artificial record Ro with key Ko = MININT (i.e, ail keys are Ko)·

The use of Ro enables us to simplify the loop. avoid int a test for end of list (i.e n < 1). In insertion sort. we begin with the ordered sequence Ro. R I and successively insert the records R2, R3,,,, Rn. Since each insertion leaves the resultant sequence ordered. the list can be ordered making insertions. The details are liven in function Insertion Sort

Analysis of Insertion Sort: In the worst ‘case insert (e, list. i) makes; + I comparisons before min int the insertion. Hence ill complexity is 0(i). Insertion Sort invokes insert for; IS j -I = 1.21. So. the worst-case time is

We can also obtain an estimate of the cdm put int time of this method based on the relative disorder in the input Hst. Record R, is left out of order

The insertion step has to be carried out only for those records that are LOO. If k is the number of LOO records. the computing time is O(k + l)n). The worst-case time is still O(n2). We can also show that the average time is O(n2).

**Example**

Assume that n = 5 and the input key sequence is 5.4,3, 2. 1. After each insertion we have,

For convenience. only the key field of each record is displayed. and the-sorted part of the list is shown in bold. Since the input list is in reverse order. as each new record R; is inserted into the ordered sublist R1 R2_I. the entire sublist is shifted right by one position. Thus. this input sequence exhibits the worst-case behavior of insertion sort.

**Example**

Assume that n = 5 and the input key sequence. is 2. 3.4. 5. 1. After each iteration we have.

In this example. only Rs is LOO. and the time for each j = 2.3. and 4 is 0(1). whereas for j = 5 it is O(n).

It should be fairly obvious that Insertion Sort is stable. The fact that the computing time is O(k + l)n) = O(kn) makes this method very desirable in sorting sequences in which only a very few records are LOO (i.e . k«n). The simplicity of this scheme makes it about the fastest sorting method for n S 20 (for example).

**Variations**

1. **Binary insertion sort:** We can reduce the number of comparisons made in an insertion sort by replacing the sequential searching technique used in insert with binary search. The number of record moves remains unchanged.

2. **List insertion sort:** The elements of the list are represented as a linked list rather than as an array. The number of record moves becomes zero because only the link fields require adjustment. However. we must retain the sequential search used in insert.

**EXERCISES**

1. Write the status of the lis’t F = (12. 2. 16. 30. 8. 28. 4. 10. 20. 6. 18) at the end of each phase of Insertion sort.

2. Write a C++ function that implements binary insertion sort., What is the worst case number of comparisons made by your sort function? What is the worst-case number of record moves made? How do these compare with the corresponding numbers for Program?

3. Write a C++ function that implements linked insertion sort. What is the worst case number of comparisons made, by your sort function? What is the worst-case number of record moves made? How do these compare with the corresponding numbers for Program?