CHAPTER 11

Other SETL native and utility packages

10.1. Introduction

One of the most essential features of very high level languages is the ability to interface easily and efficiently with lower level languages. This is important for two reasons. (I) Such interfaces make it possible to reuse the vast treasure of public domain code and libraries available on the Net, (i) particularly critical routines can be brought to high efficiency by coding them in lower level languages.

When SETL2 was revised in 1990, this need was recognized by Kirk Snyder, who added an interim Call-out and Call-back capability to the language. The present system provides a considerably more mature and powerful version of this important capability. Using it, the SETL system has been supplied with a collection of native packages providing useful libraries of high-efficiency functions in various areas. Besides the extensive Tk package described in the preceding chapter, these now include operations for string manipulation and matching, analysis of strings using regular expressions, image processing, numerical computation, computation with modular integers, and others. Still other native packages make important internal SETL system functions like parsing, compilation, and library manipulation available for external use within the SETL language. This includes the interface to Unix decribed below.

An interface to the Python language is also provided, making it possible to access all of Python's facilities, including its Web access operations.

The high-efficiency native packages are supplemented by a family of utility packages written in SETL itself. These provide a variety of string, graph, I/O, and integer procedures.

In this report we will first describe this collection of libraries in a manner helpful to their would-be user. Subsequently we descend to the 'C' level, to give the layer of detail needed by would-be implementors of additional native packages. This will make it necessary to describe parts of the internal structure of the SETL2 system and the way in which native packages are implemented.

10.1.1 native package Specifiers

Every native package must have a SETL specification, which must be compiled into a SETL library to make the native package available to the system.

An example of a SETL specification for a native package is:

   native package string_utility_pak;

        procedure breakup(obj,chars);            
        	-- break string into tuple at any occurrence of a c in chars. 
                        -- works recursively on tuples of strings, etc.

        procedure join(tuple,chars);             
        	-- joins tuple of strings into string by inserting 'chars' between

        procedure single_out(obj,chars);         
        	-- break string into tuple by separating occurrence of a c in chars
        	-- into singletons. works recursively on tuples of strings, etc.

        procedure segregate(obj,chars);
        	-- break string into tuple by separating runs of c's in chars
        	-- from nonruns.  works recursively on tuples of strings, etc.

        procedure keep_chars(string,chars);      
        	-- suppress chars of string that are not in 'chars'

        procedure suppress_chars(string,chars); 	
        	-- suppress chars of string that are 'chars'

        procedure case_change(string,mode);      
        	-- capitalize if mode is "lu"; decapitalize if mode is "ul"; 

			-- and more ...
   end string_utility_pak;

Although SETL performs no type checking at compilation time (and no type declarations appear in a native package specification), the specification does document the functions provided by the compiled package.

As the above example illustrates, a native package specification is identical to the specification part of a standard SETL package, with the exception that the keyword native appears in the package header. Any variable name can be used as a placeholder for a parameter in the native package Specification, but it is important the number of arguments match the number expected by the package implementation.

10.1.1 Using a native package

Once a native package is included by some SETL code that uses it, the functions provided by the will native package will become available to the user exactly as if they were built-in functions.

For example, a user of the breakup operation supplied by the native package whose specifier is seen above, would write:

   program test_breakup;
   use string_utility_pak;

      print(breakup({"a,b,c,d",["d,e","f"]},","));

   end;
to use it. When the above program is compiled, the system will verify statically that the breakup function is called correctly, i.e. with two parameters, but of course will not check the types of the arguments statically, since SETL treats these dynamically.

When program execution begins, the SETL runtime system will realize that the program seen above requires a native package, in this case 'string_utility_pak', and will search the appropriate directories for the presence of the DLL containing the code. If no such DLL can be found, the program will abort with an error message announcing that certain Native functions cannot be resolved. Otherwise, the DLL is loaded, and the system checks that all the functions declared in the native package specification are defined in the DLL.

10.2. A Survey of SETL's principal native packages, and some associated packages written in SETL

A main part of the business of this chapter is to review the principle native packages that have so far been developed for the SETL system. We begin by repeating the table of native packages given in Chapter 7, expanding it somewhat to include several closely associated packages or otherwise important packages written in SETL.

string_utility_pakProvides various commonly used string manipulation primitives, in high-performance form. These include string split and join operations, character suppression operations, and upper/lowercasing. A library of high-performance operations on strings represented as flat byte-arrays is also provided.
file_pakProvides various file and directory manipulation operations, including creation, deletion, listing, and navigation.
rx_pakHigh-performance regular-expressions package derived from the Gnu library.
stringm_pakHigh-performance exact and approximate string matching operations, including Knuth-Morris-Pratt and Boyer-Moore matching algorithms, Karp-Rabin probabilistic match, Aho-Corasik match of list of patterns to a string, suffix-tree match, edit-distance based approximate string matching and alignment, and detection of longest common substring of a pair of strings.
tk_pakInterface to the Tk widget set. This is the basis for SETL's interactive Graphical Interface, described in Chapter 10.
parser_pakInterface to SETL's parse and compile functions. 'Parse' analyses a SETL source file and returns its parse tree as a nested collection of tuples, making the standard SETL error messages available if there are parse errors. A 'compile' function, which takes a string as representing a SETL source file as input and compiles it into the currently active library is also provided.
ide_pakThis package gives access to the string contents of the currently active SETL edit window and makes it possible to create the interactive source-code analysis and manipulation utilities described below. Its functions also allow the currently selected area of an edit window to be detected and manipulated. A second group of functions in this package allow the library list currently in effect to be read and modified, packages to be reloaded after compilation, and programs to be rerun.
setldb_pakExtensive collection of SETL database operations, described in Chapter 12.
libnr_pakLarge library of high-performance numerical routines, derived from the collection described in the well-known treatise 'Numerical Recipes in C.'
libntl_pakPackage of integer, modular arithmetic, and modular polynomials package derived for Victor Shoup's well-known 'NTL' library.
unix_pakMakes the facilities of Unix available to SETL programs on systems allowing this.
python_pakInterface to the Python language and all of its many libraries. In particular,this gives access to Python's internet routines, including its URL and Socket routines.
grlib_pakExtensive library of image-manipulation functions. Multiplane images are handled in both byte and floating formats.
zip_pakA Lempel-Ziv based package of compression and Zip-File access utilities.
sound_pakProcedures for manipulation and analysis of sound in AIFF and related formats.
midi_pakProcedures for manipulation of music in MIDI format.
movie_pakProcedures for manipulation and display of video image sequences.

10.2.1 STRING_UTILITY_PAK (String analysis and manipulation utilities)

STRING_UTILITY_PAK provides various commonly used string manipulation primitives, in high-performance form. These include string split and join operations, character suppression operations, and upper/lowercasing. A library of high-performance operations on strings represented as flat byte-arrays is also provided. The specifier for this package is as follows.

native package string_utility_pak;
                
                -- ************ general string utilities  ************
                
    procedure breakup(obj,chars);        
        -- break string into tuple at any occurrence of a c in chars. 
                                                
        -- works recursively on tuples of strings, etc.
    procedure join(tuple,chars);         
        -- joins tuple of strings into string by inserting 'chars' between
    procedure single_out(obj,chars);     
        -- break string into tuple by separating occurrence of a c in chars
                                                
        -- into singletons.  works recursively on tuples of strings, etc.
    procedure segregate(obj,chars);      
        -- break string into tuple by separating runs of c's in chars
                                                
        -- from nonruns.  works recursively on tuples of strings, etc.
    procedure keep_chars(string,chars);  
        -- suppress chars of string that are not in 'chars'
    procedure suppress_chars(string,chars);
          -- suppress chars of string that are 'chars'
    procedure case_change(string,mode);  
        -- capitalize if mode is "lu"; decapitalize if mode is "ul"; 

    procedure hexify(string); 
                    	-- converts string to its hex representation
    
                
        
        -- ************ operations on flat strings  ************
    
    procedure flat_create(length);       
        -- creates a flat-string object, as a flat C-type buffer of specified length
    procedure flat_len(obj);             
        -- returns the length (in characters) of a flat-string object
    procedure flat_to_setl(obj);         
        -- converts a flat-string object to the SETL string with the same characters
    procedure flat_from_setl(string);    
        -- converts a SETL string to the flat-string object with the same characters
    
    procedure flat_clone(obj);           
        -- returns copy of a flat-string object (as flat-string object)
    procedure flat_add(obj1,obj2);       
        -- returns concatenation of two flat-string objects (as flat-string object)
    procedure flat_reverse(obj);         
        -- returns reverse of a flat-string object (as flat-string object)
    procedure flat_slice(obj,i,j);       
        -- returns indicated slice of a flat-string object
        -- (as flat-string object)
    procedure flat_slice_end(obj,i);     
        -- returns indicated tail slice of a flat-string object
        -- (as flat-string object)
    procedure flat_set_slice(obj,i,obj2);
        -- sets indicated slice of a flat-string object
        -- (from another flat-string object, in place)
     
    procedure flat_translate(obj,map);   
        -- translates characters of flat-string object, returns result  
    procedure flat_reverse_translate(obj,map);
       -- translates and reverses characters of flat-string object, returns result
    
    procedure flat_set_char(obj,i,c);    
        -- sets indicated character of a flat-string object
        -- (from SETL char, in place)
    procedure flat_get_char(obj,i);      
        -- gets indicated character of a flat-string object (as SETL char)
    
    procedure flat_translate_all(obj,off,minlen,codestr,rar); 
        -- 
    procedure flat_file_get(filename,start_char,end_char);    
        -- reads flat string from specified section of a file
    procedure flat_file_put(filename,start_char,obj);         
        -- write flat string to specified location in a file
    
    procedure flat_match_scores(obj1,obj2,off,repeats);       
        -- 
    
    procedure flat_toto_prepare(flats); 
        -- 
    procedure flat_toto_print(obj);     
        -- 
    procedure flat_toto_clear(obj);     
        -- 
    procedure flat_toto_match(obj,stg,maxnumber);             
        -- 
    

end string_utility_pak;

10.2.1.2 STRING_UTILITY_PAK Examples

General string utilities. The first parameter of 'breakup' can be a string, tuple or set of strings, tuple of tuples of strings, etc. Its second parameter must be a string of characters. The same is true of 'single_out' and 'segregate'. If the first parameter of 'breakup' is simply a string 'stg', 'breakup' breaks 'stg' into a tuple by cutting it at any occurrence of any c in its second parameter, 'chars'. (It two characters of 'chars' follow in immediate succession, two cuts are made, and an empty string between them results.) The characters belonging to 'chars' are removed.

'single_out' behaves in almost exactly the same way, but retains the characters belonging to 'chars', which appear as single characters in the result tuple. 'single_out' behaves similarly, but makes only one cut for each run of characters belonging to 'chars'; these runs are left in the result tuple, in which they alternate with run of characters not belonging to 'chars'.

The small example

	program test;                        -- string_utility test
	    use string_utility_pak;
	
	    print(breakup("a,b.c,.d",",."));
	    print(single_out("a,b.c,.d",",."));
	    print(segregate("a,b.c,.d",",."));
	
	end test;

illustrates these rules. Its output is

	["a", "b", "c", "", "d"]
	["a", ",", "b", ".", "c", ",", ".", "d"]
	["a", ",", "b", ".", "c", ",.", "d"]

The first parameter of 'breakup', 'single_out', or 'segregate' can also be a tuple or set of strings, tuple of tuples of strings, etc., in which case 'breakup', 'single_out', or 'segregate' is applied recursively to the elements or components of the set or tuple until strings to which it can be applied directly are reached. This is illustrated by the example

	program test;                        -- string_utility test 2
	    use string_utility_pak;
	
	    print(breakup({"a,b.c,.d","A,B.C,.D"},",."));
	    print(single_out(["a,b.c,.d","A,B.C,.D"],",."));
	    print(segregate({"a,b.c,.d","A,B.C,.D"},",."));
	
	end test;

which produces the output

	{["A", "B", "C", "", "D"], ["a", "b", "c", "", "d"]}
	[["a", ",", "b", ".", "c", ",", ".", "d"], 
		["A", ",", "B", ".", "C", ",", ".", "D"]]
	{["A", ",", "B", ".", "C", ",.", "D"], 
		["a", ",", "b", ".", "c", ",.", "d"]}

The recursive action of these three operations makes it easy to deal with arrays of objects represented by punctuated strings with multiple layers of punctuation marks. For example, if a 3-dimensional tensor of integers is represented in the string form

	stg :=  "1,2,3;11,12,13;21,22,23:101,102,103;111,112,113;"
		 + "121,122,123:201,202,203;211,212,213;221,222,223";

the following lines of code will read it and produce the output seen:

    print(unstr_rec(breakup(breakup(breakup(stg,":"),";"),",")));
    
    procedure unstr_rec(obj); 
        return if is_set(obj) then {unstr_rec(x):x in obj}
            elseif is_tuple(obj) then [unstr_rec(x):x in obj] else unstr(obj)
	  end if;
    end unstr_rec;
	[[[1, 2, 3], [11, 12, 13], [21, 22, 23]], 
		[[101, 102, 103], [111, 112, 113], [121, 122, 123]], 
			[[201, 202, 203], [211, 212, 213], [221, 222, 223]]]

The 'join' operation is a kind of inverse of 'breakup'. join(tuple,chars); takes a tuple of strings as its first argument and joins this into a string by inserting the characters of 'chars' between its components. An example is

		join(breakup("a,bb,ccc",","),"@@@")

which produces the output

		a@@@bb@@@ccc

Another example: to replace all the runs of (possibly multiple) whitespace characters in a string by single spaces, one can simply use

		join([x: x in segregate(stg," \t") | x(1) notin " \t"]," ")

'case_change' is a simple capitalization/decapitalization utility which operates in the two modes "lu" and "ul". It capitalizes its first string argument if its mode parameter (second parameter) is "lu"; decapitalizes if its mode is parameter is "ul",The following line of code

		print(case_change("a.b.c","lu")," ",case_change("A.B.C","ul"));

which produces the output

		A.B.C a.b.c

shows this.

'keep_chars' and 'suppress_chars' are simple character-filtering utilities. 'keep_chars' suppresses all characters of its first string argument that are not found in its second argument 'chars'; 'keep_chars' suppresses all characters of its first string argument that are found in its second argument 'chars'. This is shown by the example

	print(keep_chars("a.b.c.A.B.C","abc")," ",
		suppress_chars("a.b.c.A.B.C","."));

which produces the output

		abc abcABC
Operations on flat strings. The standard SETL string objects are designed to handle string modifications, including concatenation and insertion operations, rapidly for strings of moderate length but operations on them tend to bog down as the strings involved grow longer.For example, the double loop
	for k in [1..50] loop
	    stg := 1000 * " ";  
	    for j in [1..#stg] loop stg(j..j) :=  "ab"; end loop;
	end loop;

performs its 50,000 insertions in strings of length averaging 1200 in 4 seconds (450 MHz Mac processor), but the code

	for k in [1..5] loop
	    stg := 10000 * " ";  
	    for j in [1..#stg] loop stg(j..j) :=  "ab"; end loop;
	end loop;

requires 43 seconds to perform the same number of insertions in strings of length averaging 12000. The reason is that the internal SETL string representation is as a linked list of short string sections. For work with longer strings, and provided that the operations involved do not modify the length of the strings involved, a C-like 'flat' string representation, which allows high-efficiency machine-level addressing of individual, will be much more efficient. A library of operations on such objects, which we shall call 'flat strings', is provided by string_utility_pak. The mutually inverse operations

		flat_from_setl(stg) and flat_to_setl(obj)
convert standard SETL strings to and from 'flat string' objects. This is shown in the following example,
	print(flat_to_setl(obj := 
		flat_from_setl("Hello World"))," ",obj," ",flat_len(obj));

which produces the output

		Hello World  11

This example also shows the use of the 'flat_len' operation, which returns the length (in characters) of its flat-string argument.

	flat_set_char(obj,i,c)		and  flat_get_char(obj,i)

provide access to individual characters of flat strings, the first of these setting a designated character and the second retrieving a single character. Examples are

	obj := obj2 := flat_from_setl("Hello World"); 
	print(flat_get_char(obj,1));
	flat_set_char(obj,1,"J");
	print(flat_to_setl(obj)," ",flat_to_setl(obj2));

producing the result

		H
		Jello World Jello World 

Note that this operation (characteristically for native objects, but uncharacteristically for SETL) has 'pointer' semantics; it changes the object on which it acts 'in place'. The same is true for the 'flat_set_slice' operation, as seen in the example

	obj := obj2 := flat_from_setl("Hello World");
	print(flat_to_setl(flat_slice(obj,2,5))," "
		flat_to_setl(flat_slice_end(obj,2)));
	flat_xs := flat_from_setl("XXXX");
	flat_set_slice(obj,2,flat_xs); 
	print(flat_to_setl(obj)," ",flat_to_setl(obj2));

This produces the output

		ello ello World
		HXXXX World HXXXX World

The operation flat_slice(obj,i,j) returns the indicated slice of a flat-string object (as a flat-string object). The operation flat_slice_end(obj,i) returns indicated tail slice of a flat-string object, from the indicated character to the end, (as a flat-string object). The operation flat_set_slice(obj,i,obj2) sets the indicated slice of a flat-string object, starting at the indicated position, and for as many characters as ore in obj2 , in place (s the preceding example shows).

flat_clone(obj) creates a copy of the indicated flat string. flat_add(obj1,obj2) returns the concatenation of two flat-string objects (as a flat-string object). These appear in the following small example, whose output is 'Hello World!Hello World!'.

	Hello World!Hello World!
	program test;                        -- flat-string cloning and adding
	    use string_utility_pak;
	    
	    flat_stg := flat_from_SETL("Hello World!");
	        
	    print(flat_to_SETL(flat_add(flat_stg,flat_clone(flat_stg))));
	                -- print translated 
	        
	end test;

flat_translate(obj) and flat_reverse_translate(obj)

program test;               -- flat_translate and flat_reverse_translate
    use string_utility_pak;
    
    flat_stg := flat_from_SETL("Hello World");
    
    map_stg := "" +/ [char(j): j in [0..255]];
                -- all 256 characters map to themselves
    map_stg(abs("H") + 1) := "J";
                    -- change two of the character mappings
    map_stg(abs("o") + 1) := "A";
    map :=  flat_from_SETL(map_stg);        -- convert to flats string
    
    print(flat_to_SETL(flat_translate(flat_stg,map)));
                -- print translated 
    print(flat_to_SETL(flat_reverse_translate(flat_stg,map)));
        -- also translated and reversed
        
end test;
The output produced is
			JellA WArld
			dlrAW AlleJ

10.2.2 RX_PAK (Regular Expressions Package)

RX_PAK, which is directly derived from the Free Software Foundation's GNU regular expression package, provides just three operations, but these are very general and powerful. The package specifier is
native package rx;            -- derived from the GNU regular expression package

procedure regcomp(rw rx,pattern,flags);
    -- sets rx to the compiled form of the regular expression 'pattern'
    -- and returns 0 or an error code.
    -- flags is an integer whose bits allow for special options,including 
    -- extended =  (treat as extended expression, see GNU documentation)
    -- case =  (ignore case); nosub = (don't return matching parts info)
    -- line =  (match list of substrings delimited by newlines separately)
    -- for additional options, see the detailed  documentation on the 
    -- Free Software Foundation website 

procedure regexec(rx,s,flags,rw pos);
    -- runs compiled rx against string, returning 0 or an error code
      -- if the match succeeds, pos is returned as a tuple [] defining
      -- the way in which the whole expression
      -- and its subparts matched 
      -- flags is an integer whose two low order bits allow for
      -- the following special options:
      -- NOT_BEGLINE (value = 1): don't regard start of string as line start 
      -- NOT_ENDLINE (value = 2): don't regard end of string as line end 
 
procedure regerror(err,rx);
  -- returns detailed error string if rx failed and
  -- returned 'err' as its error code

end rx;
In the next few paragraphs, we give a brief syntactic and semantic summary of regular expressions and their use. Regular expressions (which are the chocolate truffles of programming: small but amazingly rich) are patterns which can be matched efficiently to strings. Each such pattern is itself represented by a string, which must conform to the simple, very condensed syntax described in the next few paragraphs. This syntax treats a small number of characters, namely

.     *     [     ]     :     /     ^     $

in a special way, as syntactic/semantic markers, designating such things as subpattern repetition, and alternation. (Each of these characters can then be used 'as itself' within other strings by being preceded with the backslash character '\'). As a 'gotcha' for the unwary, a few other important characters must be preceded with a backslash to function as syntactic/semantic markers; if unescaped they represent themselves. These are the characters

\(     \)         \{     \}     \|     \+     \?    \digit

Yet a further nuisance arises from the fact that to be written within SETL quotes the backslash character '\' must be doubled to '\\'.

The regular expression matching process works left-to-right through the 'target' string being matched, and tries to find the leftmost-longest substring of the target which matches the pattern. More specifically, the normal matching process begins first at the first character of the target string, but then if no match can be found, tries again starting at the second, third, .. characters of the target, until a match can be found; the match found is then the longest substring, starting at the given position, which matches the pattern. For simple matches,the match process generates and return a pair [i,j], where i is the starting position of the match, and j is the position of the first subsequent character which cannot be matched, i.e. 1 past the position of the last matched character. (When matches involving patterns and sub-patterns are performed, a more complex tuple is returned, as explained below.) if no match can be found, an empty tuple is returned.

The 'flags' parameters appearing in 'regcomp' and in 'regexec' allow regular expression compilation and matching to proceed in a variety of modes, e.g. case-sensitive and case-insensitive matching. Details concerning these flags and their use are given at the end of the present section.

For the rest one must understand what strings are matched by each possible pattern string. The rules determining this are as follows.

  1. Any non-special string matches itself. For example, the code
     program test;                 -- regular expressions example 1
       use rx;
    	    
       regcomp(regx,"Bc",0);              -- compiles "Bc" into 'regx'
       regexec(regx,"abcdaBcd",0,pos);
    	              -- matches compiled "Bc" to "abcdaBcd"
    	    
       print(pos);
    	
     end test;
    

    produces the output

    		[[6, 8]]

    which indicates that the matched section of the target runs from its 6th to its 7th character.

  2. the character . matches any single character. Thus the code
     program test;               -- regular expressions example 2
       use rx;
    	    
       regcomp(regx,"b..a",0);          -- compiles "b..a" into 'regx'
       regexec(regx,"abcdaBcd",0,pos); 
    	               -- matches compiled "b..a" to "abcdaBcd"
    	    
       print(pos);
    	
     end test;

    matches the substring 'bcda' of the target and so produces the output

    		[[2, 6]]

  3. bracketed constructs like [a-z0-9$&] consisting of a concatenation of character ranges and single characters match any character that is either explicitly listed or belongs to one of the ranges listed. For example,
     program test;                        -- regular expressions example 3
       use rx;
    	    
       regcomp(regx,"ab[a-df-z!][a-df-z!]d",0);
            -- compiles "ab[a-df-z!][a-df-z!]d" into 'regx'
       regexec(regx,"Aab!cdzBcd",0,pos);
            -- matches compiled "ab[a-df-z!][a-df-z!]d" to "Aab!cdzBcd"
       
       print(pos);
    	
    	end test;

    produces the output

    		[[2, 7]]

    since it matches the substring 'ab!cd' of the target.

  4. [^chars_spec] matches the set of characters complementary to the characters explicitly listed in'chars_spec' or belonging to one of the ranges listed. For example,
    	program test;                        -- regular expressions example 4
       use rx;
       
       regcomp(regx,"[^a-d][a-df-z!]d",0);
                -- compiles "[^a-d][a-df-z!]d" into 'regx'
       regexec(regx,"aab!cdzBcd",0,pos);
                  -- matches compiled "[^a-d][a-df-z!]d" to "aab!cdzBcd""
       
       print(pos);
    	
    	end test;

    produces the output

    		[[4, 7]]

    since it matches the substring '!cd' of the target.

  5. [[:builtin:]] matches the set of characters designated by one of a list of keywords designating important character sets. The sets provided are designated by the keywords alnum (all alphanumeric characters), alpha (all alphabetic characters), lower (all lowercase alphabetic characters), upper (all uppercase alphabetic characters), digit (all digits), xdigit (all hexadecimal digits), punct (all punctuation marks), blank (a blank character), space (all whitespace), cntrl (control characters), graph (all printable characters except space). For example,
     program test;                        -- regular expressions example 5
       use rx;
       
       regcomp(regx,"[[:punct:]]c[[:digit:][:punct:]][[:digit:][:punct:]]",0);         
       regexec(regx,"aab!c9,Bcd",0,pos);           
       
       print(pos);
    	
     end test;

    since it matches the substring '!c9,' of the target,and so produces [[4, 8]].

  6. To this point we have listed constructs matching single characters and character runs of prespecified length. Now we begin to list regular expression forms designating repetitions, and so capable of matching character runs of arbitrary length. These are all postfixes. The first of these is '\+'. When postfixed to an arbitrary regular subexpression, this designates an arbitrary positive number of repetitions (at least one) of the subexpression. An example is
     program test;                        -- regular expressions example 6
        use rx;
        
        regcomp(regx,"[[:digit:][:punct:]]\\+",0);          
        regexec(regx,"aab!9,99!!888Bcd",0,pos);           
        
        print(pos);
    
     end test;

    which matches the substring '!9,99!!888' of the target, and so produces [[4, 14]].

  7. when postfixed to a subitem of a regular expression, the construct \{m,n\}, where m and n are both integers, calls for at least m and at most n repetitions of the preceding item. This is shown by the example
     program test;       -- regular expressions example 7
       use rx;
       
       regcomp(regx,"[[:digit:][:punct:]]\\{2,9\\}",0);  
       	-- compiles "Bc" into 'regx'
       regexec(regx,"aab!9,99!!888Bcd",0,pos);
                   -- matches compiled "Bc" to "abcdaBcd"
       
       print(pos);
    	
     end test;

    which produces the output [[4, 13]] since it matches the 9 characters '!9,99!!88' of the target.

  8. The suffix '\?' is exactly equivalent to '{0,1}', e.g.
     program test;      -- regular  expressions  matching example 8
       use rx;
       
       print(regcomp(regx,"\\(fe\\)\\?d",0));     
       regexec(regx,"fed",0,pos); print(pos);
       regexec(regx,"d",0,pos); print(pos);
    	
     end test;

    is seen to match successfully in both cases since the 'optional' group '\(fe\)\?' matches both the 'fe' in'fed' and the empty string in 'd'.

  9. The suffix '*' is like '\+', but allows 0 repetitions of the preceding item. and so can match a null string.

  10. The next two regular expression constructs serve to group subexpressions and allow one of several such subexpressions to be matched. The first of these is simply \(subexp\), which groups a subexpression. The more general form \(subexp1|subexp2|subexp3|...\) lists alternative patters separated by the sign '|'. These subpatterns are tried in sequence; the one leading to the longest match is then selected for the overall match.

    A first example of this is

     program test;         -- regular expressions example 9a
       use rx;
       
       regcomp(regx,"c\\(a\\+\\)",0); 
                   -- compiles 'c\(a\+\)' into 'regx'
       regexec(regx,"caaaabb",0,pos);
                   -- matches compiled 'c\(a\+\)' to "caaaabb"
       
       print(pos);
    	
     end test;

    which we shall compare to

     program test;         -- regular expressions example 9b
       use rx;
       
       regcomp(regx,"ca\\+",0);
                        -- compiles 'ca\+' into 'regx'
       regexec(regx,"caaaabb",0,pos);
                   -- matches compiled 'c\(a\+\)' to "caaaabb"
       
       print(pos);
    	
     end test;

    The difference between these two examples is that in the second form the regular expression used is simply 'ca\+', which matches 'c' followed by indefinitely many a', while in the first version the regular expression is the more sophisticated 'c\(a\+\)', in which the same repeated a's are matched, but as a group. The second example generates the expected match tuple [[1, 6]], but the first example generates the less expected [[1, 6], [2, 6]]. This is because besides the first, widest pair representing the entire range of characters matched, each subgroup on a regular expression generates an additional pair showing the range matched by the subgroup. In the example shown, this is the run of characters 'aaaa' in the second through fifth positions. However, if the group as a whole is repeated in the regular expression, only the last of its matches generates such a pair. The effect of this can be seen in the example

     program test;        -- regular expressions example 9c
    	   use rx;
    	   
    	   regcomp(regx,"c\\(a\\)\\+",0);
    	            -- compiles 'c(a)\+' into 'regx'
    	   regexec(regx,"caaaabb",0,pos);
    	            -- matches compiled 'c(a)\+' to "caaaabb"
    	   
    	   print(pos);
          
     end test;

    Here, since we match the repeated group '(a)\+', only its final match generates a pair into the resulting tuple, which is [[1, 6], [5, 6]], indicating that the last match of the repeated group is to the character 'a' appearing 5th in the target string. (Compare this to the [[1, 6], [2, 6]] returned when 'c\(a\+\)' is matched instead.

    The example

     program test;        -- regular expressions example 9b
    	   use rx;
    	   
    	   regcomp(regx,"c\\(a\\+\\)\\(b\\+\\)",0);
    	    -- compiles 'c(a)\+(b)\+' into 'regx'
    	   regexec(regx,"caaaabb",0,pos);
    	       -- matches compiled 'c(a)\+' to "caaaabb"
    	   
    	   print(pos);
          
     end test;

    which returns [[1, 8], [5, 6], [6, 8]], makes another rule apparent. This is the fact that a series of successively matched subexpressions generates a series of successive pairs into the result returned. Each subexpression generates a pair indicating the range of characters that it has matched, or, if it is repeated, the range of characters matched by the last of its repetitions.

    The general rule for construction of the tuple returned upon successful match of a regular expression involving (possibly nested) subexpressions and repetitions can now be stated as follows. Successful match generates a series of pairs corresponding to character ranges. The first such pair defines the total range covered by the match. This is followed by a sequence of pairs, grouped into successive subsequences corresponding to the series of top-level subgroups of the regular expression. For each such subgroup, the run of characters matched by the (final repetition of) the subgroup (if repeated) is indicated. If the subgroup contains nested sub-subgroups, the same rule is applied recursively, the pairs generated by any single subgroup always following successively in the overall list of pairs returned.

    All this is shown in the example

    	program test;      -- regular expressions example 10
       use rx;
       
       regcomp(regx,"\\(\\(a\\+b\\+\\)\\+\\(c\\+\\)\\)\\(d\\+\\)",0);     
       regexec(regx,"aaaabbaabbbcccddd",0,pos);         
    	
       print(pos);
    	
    	end test;

    which returns the list of pairs

    		[[1, 18], [1, 15], [7, 12], [12, 15], [15, 18]]
    and repays careful study. The value returned reflects the following facts:

    1. the full regular expression '\(\(a\+b\+\)\+\(c\+\)\)\(d\+\)' matches the whole target string "aaaabbaabbbcccddd";

    2. its regular subexpression '\(\(a\+b\+\)\+\(c\+\)\)' is not repeated, and matches the first 14 characters of the target string;

    3. within this subexpression, the last repetition of the repeated sub-subexpression \(a\+b\+\) matches characters 7-11, and is followed by the sub-subexpression \(c\+\), which matches characters 12-14;

    4. the final part '\(d\+\)' of the regular expression matches characters 15-17.

    5. Within a parenthesised subexpression of a regular expression, any number of alternatives can be separated by the sign '\|', designating 'or'. For example,
       program test;      -- regular expressions matching example 11
         use rx;
         
         regcomp(regx,"\\(\\(ab\\)\\+\\|\\(cd\\)\\+\\)\\+",0);     
         regexec(regx,"ababcdcdcd",0,pos);         
      	
         print(pos);
      	
       end test;

      returns [[1, 11], [5, 11], [5, 5], [9, 11]] since the alternative '\(\(ab\)\+\\(cd\)\+\)\+' matches either runs of 'ab' or runs of 'cd',and so the whole target string "ababcdcdcd" is matched.

    6. As matching proceeds, the matcher keeps track of the subexpressions it encounters and the substrings they match. This makes it possible to use the construct '/n', where n can be any digit from 1 to 9, to refer to 'the substring matched by the n-th subexpression' (but of course this construct can only be used in positions following the n-th subexpression.) This gives a way of making later matches depend on the outcome of previously matched regular subexpressions. A simple example is '\(.\+)\1', which matches any string that is composed of two identical halves, since `\(.*\)' matches the first half, which can be anything, but then '\1' must match exactly the same string. Another example is
       program test;      -- regular  expressions  matching example 12
         use rx;
         
         regcomp(regx,"\\(cd\\+\\)\\+\\(ab\\+\\)\\+z\\1\\(\\2\\)",0);     
         regexec(regx,"cdcddababbzcddabb",0,pos);         
      	
         print(pos);
      	
       end test;

      which returns [[1, 18], [3, 6], [8, 11], [15, 18]] since its first part '\(cd\+\)\+\\(ab\+\)\+' matches 'cdcddababb' and associates the two subexpressions it contains with 'cdd' and 'abb' (the last character ranges these repeated regular subexpressions match); then '\1' and '\(\2\)' match the second 'cdd' and 'abb' respectively. The latter is a subgroup match, and so generates the pair [15, 18] into the result tuple. This example also shows that the construct '/n' can be included in grouped subexpressions, etc/

    7. The special characters '^' and '$' are used to pin matches to the start and end of a target string respectively. Both match null strings, but '^' matches only at the very start, and '$' only at the very end, of the target string. This is shown in the following example:
       program test;                 -- regular expressions example 13
         use rx;
      	   
         regcomp(regx,"car$",0);          -- compiles 'car$' into 'regx'
         regexec(regx,"carcar",0,pos);
             -- matches compiled 'car$' to "carcar"
      	   
         print(pos);
            
       end test;

      returns [[4, 7]] since it can only match the second occurrence of 'car' in 'carcar'. With the '$' removed it would instead match the first occurrence, and so returns [[1, 4]]. Similarly

       program test;                 -- regular expressions example 14
      	use rx;
      	   
      	  regcomp(regx,"^car",0); 
      	  regexec(regx,"acar",0,pos);    
      	  print(pos);
            
       end test;

      fails to match, since it can only match at the start of the target string. Without the '^' it would return [[2, 5]], since a match starting at the second character would be found.

    8. A few other special constructs which pin matches to specific position at the start, end, or middle of words are provided. These are `\>' (end of word), `\<' (beginning of word) ,`\b' (beginning or end of word), and `\B' (not beginning or end of word). All of these constructs match null strings in their indicated positions. This is shown in the composite example
      program test;      -- regular expressions example 15
         use rx;
         
          regcomp(regx,"cat\\>",0);     
          regexec(regx,"thecatcat catcatcat",0,pos);         
          print(pos);
         
      	 regcomp(regx,"\\<cat",0);     
          regexec(regx,"thecatcat catcatcat",0,pos);         
          print(pos);
         
          regcomp(regx,"\\Bcat\\B",0);     
          regexec(regx,"the catcat catcatcat",0,pos);         
         print(pos);
         
          regcomp(regx,"\\bcat\\B",0);     
          regexec(regx,"thecatcat catcatcat",0,pos);         
         print(pos);
      	
      	end test;

      which returns

      	[[7, 10]]
      	[[11, 14]]
      	[[15, 18]]
      	[[11, 14]]

      reflecting the fact that

      1. 'cat\>' can only match at the end of a word, hence the second occurence of 'cat';

      2. '\<cat' can only match at the start of a word, hence the third occurence of 'cat';

      3. '\Bcat\B' can only match in the middle of a word, hence the fourth occurence of 'cat';

      4. '\bcat\B' can only match in the start of a word of more than 3 letters, hence the third occurence of 'cat'.

Regular expression compilation and matching mode flags

The 'flags' parameters appearing in 'regcomp' and in 'regexec' allow regular expression compilation and matching to proceed in a variaety of modes, e.g. case-sensitive and case-insensitive maching. To set these flags, one passes appropriate integer 'flags' parameter values to 'regcomp' and 'regexec' respectively. Each flag corresponds to a bit position in this integer, which must be 1 for the corresponding flag to be set. The following table lists the 'regcomp' flags.

EXTENDED (value = 1)Treat the pattern as an extended regular expression, rather than as a basic regular expression. Basic regular expressions are regular expressions having the sytax detailed in the preceding paragrphs. Extended regular expressions differ from these in that instances of the characters

( ) { } + ? |

do not need to be preceded by the backslash character '/'. That is, we can write '(', ')', '+', '?', '|', '{', and '}' instead of '\(', '\)', '\+', '\?', '\|', '\{', and '\}' respectively, a welcome relief. In extended regular expressions, we can also use {m} as an abbreviation for {m,m}, and can use {m,} to mean 'at least m occcurences'. Examples are given below.

ICASE (value = 2)Ignore case when matching letters.
NEWLINE (value = 4)Don't treat the end-of-line character as an ordinary character; instead, treat each line as a target to be matched. This allows '^' to match at the start of any line, and '$' to match at the end of any line, but prevents '.' from matching the end-of-line character. Examples are given below.

Here are examples of the use of these flags.

   program test;         -- regular expressions example 9a.1
       use rx;
       
       regcomp(regx,"c(a+)",1);                  -- compiles 'c(a+)' into 'regx'
       regexec(regx,"caaaabb",0,pos);
                   -- matches compiled 'c(a+)' to "caaaabb"
       
       print(pos);

   end test;/

is the extended regular expression version of Example 9a, and returns the same result. To use and ordinary regular expression we would have to write "c\\(a\\+\\)", as in Example 9a, instead. Similarly

   program test;      -- regular  expressions  matching example 11.1
       use rx;
    
       regcomp(regx,"((ab)+|(cd)+)+",1);     
       regexec(regx,"ababcdcdcd",0,pos);         
                         
    print(pos);
                         
   end test;

is the extended regular expression version of Example 9a, and returns the same result. Plainly "((ab)+|(cd)+)+" is much more readable than "\\(\\(ab\\)\\+\\|\\(cd\\)\\+\\)\\+".

The example

   program test;      -- regular expressions example 16
       use rx;
    
       regcomp(regx,"\\(abc\\)\\+",2);     
       regexec(regx,"ABCabc",0,pos);         
                         
       print(pos);
                         
   end test;

which returns [[1, 7], [4, 7]], or still better its extended regular expression version

   program test;      -- regular expressions example 16b
       use rx;
    
       regcomp(regx,"(abc)+",3);     
       regexec(regx,"ABCabc",0,pos);         
                         
       print(pos);
                         
   end test;

shows the use of case-insensitive matching.

The example

	program test;      -- regular  expressions  matching example 17
	    use rx;
	 
	    regcomp(regx,"e$",4);     
	     regexec(regx,"Oh, Jack be nimble\nJack be quick",0,pos);         
	    print(pos);
	 
	    regcomp(regx,"^J",4);     
	     regexec(regx,"Oh, Jack be nimble\nJack be quick",0,pos);         
	    print(pos);
	 
	    regcomp(regx,"e$.+^J",7);     
	     regexec(regx,"Oh, Jack be nimble\nJack be quick",0,pos);         
	    print(pos);
	                      
	end test;

which returns

		[[18, 19]]
		[[20, 21]]
		[]

shows the alternative treatment of end-of-line characters, and the failure of the third match in this example shows also that if this option is used matches can never extend over more than one line.

10.2.2.1 class rxp (A SETL front-end for using Regular Expressions)

It is often useful to simplify the use of a native package by representing the raw functions it provides in some more convenient object syntax. In this section we present such a 'wrapper class', called 'rxp', for the regular expressions native package described in the preceding section. This allows us to write the regular expression and matching operation of the preceding section as 'pattern_object * target', where the 'pattern_object' is created from a regular expression string by writing

pattern_object := rxp(pattern,flags);

Here 'pattern' is a regular expression string of the kind described in the preceding section, and 'flags' is an integer composed of bitflags, as described there. If the matching operation pattern_object * target succeeds it returns exactly the kind of position tuple described in the preceding section.

This object class also allows 'pattern_object ** target' to be used to match a regular expression repeatedly to string

Note that the class seen below interprets the error codes returned by 'regcomp' and so explains their meaning.

The class code is as follows.

class rxp;        -- SETL regular SETL regular-expression class

    procedure create(pattern,flags);
            -- compile pattern into regular expression
    
end rxp;

class body rxp;        -- SETL regular SETL regular-expression class
    use rx;
    var native_rx;     -- the native compiled form of the regular expression
    
    procedure create(pattern,flags);   -- compile pattern into regular expression
        const REG_NOSUB := 2;
        flags ?:= REG_NOSUB;        -- use no flags as default
        if (errc := regcomp(native_rx,pattern,flags)) = 0 then return; end if;
        
        message_string := case errc                -- otherwise there is an error
            when 1 => "Bad brackets \\{...\\}"
            when 2 => "Bad regular expression syntax"
            when 3 => "Bad use of repetition operator, e.g. *,?,+ \\{...\\}"
            when 4 => "Bad collating element"
            when 5 => "Bad character class name"
            when 6 => "Terminal occurrence of \\"
            when 7 => "Bad digit in digit construct"
            when 8 => "Unbalanced square brackets"
            when 9 => "Unbalanced parentheses \\(...\\)"
            when 10 => "Unbalanced braces \\{...\\}"
            when 11 => "Bad endpoint in a range expression"
            when 12 => "Ran out of memory during compilation"
        end case;
        
        abort("**** REGULAR EXPRESSION COMPILATION ERROR: " + message_string);
        
    end create;

    procedure self * stg; 
           -- match regular expression to string, returning [] or match-tuple

        if (errc := regexec(native_rx,stg,0,pos_tuple)) > 1 then
                -- ran out of memory
            abort("**** REGULAR EXPRESSION EXECUTION ERROR: "
            	 + "Ran out of memory");
        end if;
        
        return if errc = 0 then pos_tuple else [] end if;

    end;

    procedure self ** stg;
         -- match regular expression repeatedly to string,
         -- returning [] or match-tuple
        
        matches := [];        -- these will be collected
        match_offset := 0;    -- offset to apply to match tuple
        
        while stg /= "" loop

            if (errc := regexec(native_rx,stg,0,pos_tuple)) > 1 then
                    -- ran out of memory
                abort("**** REGULAR EXPRESSION EXECUTION ERROR: "
                	 + "Ran out of memory");
            end if;
        
            if errc > 0 or (nexc := pos_tuple(1)(2)) = 1 then 
            	return matches; 
            end if;
        -- nothing matched

            matches with:= 
            	[[x + match_offset,y + match_offset]: [x,y] in pos_tuple];
            	        -- apply the match offset to the tuple returned
            match_offset +:= (nexc - 1);
                     -- advance the match offset to the last matched character
            stg := stg(nexc..);
                    -- continue with the remaining part of the string
            
        end loop;
        
        return matches;
        
    end;
    
end rxp;

The following small test program illustrates the use of this object class.

	program test;      -- regular expressions class example
	    use rxp;
	    
	   print(rxp("ab\\+",OM) * "aabbabbabbb");  
	   print(rxp("ab\\+",OM) ** "aabbabbabbb");  
	
	end test;

The following small test program illustrates the use of this object class. It returns

		[[2, 5]]
		[[[2, 5]], [[5, 8]], [[8, 12]]],

the first pair corresponding to the three characters 'abb' matched by the first, uniterated, match operation, and the list of three pairs corresponding to the threefold match iteration triggered by the second match operation.

10.2.2.3 package extended_SETL_lexer (A lexical analysis package for SETL)

Next we give a larger example of the use of regular expressions, and use it to discuss some issues of style in regular expression usage. Our example implements a full 'lexical analyzer' for SETL. This decomposes any SETL program (or package, or class) into the series of tokens, blanks, and line_ends of which it is composed. Comments are not decomposed, but appear as simple strings. The analyzer allows the following list of 'extra operation signs' to appear "~!@$%^&". Words beginning with such signs

The full regular expression for this entire lexical analyzer would be a totally incomprehensible coded string of several hundred characters. To have a chance of writing this correctly (which will certainly involve some amount of debugging) we must find some way of decomposing it into the string-sum of meaningful submodules. A technique generally applicable to large regular expressions is to think of it as a (perhaps recursive) alternation of concatenated subparts which can be written separately, debugged, and then combined. To make an alternation out of its separate alternatives we have then only to write

		"("alternative1 + "|" + alternative2 + "|" + ... + alternativek")"
To make a concatenation out of its separate pieces we write
		"piece1 + piece2 + ... + piecek
This is done to construct the following lexical analyzer. Note that the analyzer is written as an extended regular expression to reduce clutter. At the top level, it sees the string to be analyzed as a concatenation of whitespace, variable names, quoted strings, based numerals, ordinary numerals, operation words, comments, and end-line characters. If a line ends with a comment, the following end-line character is included in the comment. Otherwise the end-line character which terminates a line counts as whitespace.

The lexical analyzer appears as the routine 'get_tokens'in the following package, which is supplied with the SETL system.

package extended_SETL_lexer;
    const extra_opsigns := "~!@$%^&"    

    procedure get_tokens(stg);
            -- decompose a string into generalized SETL tokens

end extended_SETL_lexer;

package body extended_SETL_lexer;
    use rx,string_utility_pak;
    var lexer  := OM;

        -- the 'escaped' characters allowed within SETL strings
        -- are \\, \0, \n, \r, \f, \t, \",\xdd

    const D_hex := "\\\\x[0-9a-f][0-9a-f]";
                -- regular expression for hex strings
    const D_escaped := "\\\\f|\\\\0|\\\\t|\\\\n|\\\\r|\\\\\\\\|\\\\\\\"";
        -- regular expression for escaped special characters
    const D_opsigns_and_parens "`~!@#$%^&*()+={}[|':/?><,.[`;";
        -- regular expression for generalized opsigns/parens
    const D_opsigns := "([~!@#$%^&*+=:(){}[/?><.]|])";
        -- regular expression for generalized opsigns


    const quote := "\"";              -- quote character
    const quote_eol := "[\"\n\r]";
        -- quote and endline characters
    const not_quote_eol_char := "[^\"\n\r]";
            -- characters other than quotes and endlines
    const escaped_quote := "\\\\\\\"";
                     -- regular expression for quote mark preceded by backslash
    const start_quoted_string := 
    	quote + "(" + not_quote_eol_char + "|" + escaped_quote + ")+";
            -- regular expression for start of complete or incomplete 
            -- quoted string, without terminating quote or endline character

    const D_quoted := start_quoted_string + quote_eol;
        -- regular expression for complete and incomplete quoted strings
    const D_name := "[a-zA-Z]([[:alnum:]]|_)*";
            -- regular expression for SETL variable names
    const D_white := "( |\t|\r|\n)*";
                    -- regular expression for whitespace
    const D_num := "[0-9][_0-9]*(.[0-9][_0-9]*((E|e|e-|e-)[0-9]+)?)?";
            -- regular expression for real and floating numerals
    const D_based_num := "[0-9]+#[0-9a-z][_0-9a-z]*(.[0-9a-z]"
    		 + "[_0-9a-z]*#((E|e|e-|e-)[0-9]+)?)?";
          -- regular expression for real and floating numerals
          -- with non-decimal bases
    const D_not_monadic_end := "([~!@#$%^&*+=-?><.:/]*[*+=:?><./])";
    const D_opword := "([~!@#$%^&*+=(){},;[:/?><.]|]|" + 
			D_not_monadic_end + "|[a-zA-Z][[:alnum:]]*)";
							
    const ppe := {"procedure","program","class","package","use","var","const","end"};
                                        -- declaration headers and tails
    const icl:= {"if","case","loop"};
        -- control structure headers (omitting headers for iterative loops,
        -- which, as tokens, do not differ from variable names)

    procedure get_tokens(stg);
            -- decompose a string into generalized SETL tokens

        if  #stg = 0 then return []; end if;
                            -- null string case
        if  stg(#stg) notin "\n\r" then stg +:= "\n"; end if;
            -- add a terminating endline if needed
        
        if lexer = OM then
                -- we must compile the regular expression for the lexer
            regcomp(lexer,"(" + D_white + "|" + D_name + "|" + D_quoted + "|" + 
                    D_based_num + "|" + D_num + "|" + D_opword + ")",1);
                            -- use extended regular expression syntax
        end if;
        
        tokens := [ ];            -- will collect
        
        while stg /= "" loop
            if #stg > 1 and stg(1..2) = "--" then
                            -- we have a comment
                comment := break(stg,"\n\r");
                                     -- this runs to the line end
                comment := comment + span(stg,"\n\r \t");
                        -- and includes the line-end character
                tokens with := comment;
                                -- collect the comment string as a token
            elseif regexec(lexer,stg,1,pos) = 0 then 
                       -- the regular expression finds a token
                [strt,endd] := pos(1);
                                            -- get token start and end
                tokens with := stg(strt..(endd - 1) min #stg);
                    -- collect the token
                if endd = 1 then endd := 2;; end if;
                      -- if the token is a null string then take 1 char
                stg := stg(endd min (#stg + 1)..);
                                -- continue  with remainder of input string
            else                          
                                    -- if no token can be found,then exit
                exit;
            end if;
        end loop;

        return tokens;        -- return the list of tokens
        
    end get_tokens;        

end extended_SETL_lexer;

10.2.3 STRINGM_PAK (Exact and inexact string matching and alignment)

This native package provides a library of exact string matching algorithms (for matching either single patterns or patterns chosen from sets of such patterns), including all the best-known algorithms of this type. It also makes a variety of efficient match-with-errors and string-alignment procedures available.

The package specifier is

native package stringm;				-- string matching package
	
------------------------------------
-- Exact String Matching
------------------------------------

procedure kmp(string,pattern);
		-- Knuth-Morris-Pratt match of a pattern to a string
		-- returns tuple of all starting points of matches
procedure kmp_compile(pattern);	
	-- optimized Knuth-Morris-Pratt match; returns compiled pattern
procedure kmp_exec(string,pat);	
	-- optimized Knuth-Morris-Pratt match using pre-compiled pattern
	-- returns tuple of all starting points of matches

procedure bm(string,pattern);	
	-- Boyer-Moore match of a pattern to a string
	-- returns tuple of all starting points of matches
procedure bm_compile(pattern);	
	-- optimized Boyer-Moore match; returns compiled pattern
procedure bm_exec(string,pat);	
	-- optimized Boyer-Moore match using pre-compiled pattern
	-- returns tuple of all starting points of matches

procedure kr(string,pattern);	
	-- Karp-Rabin probabilistic match of a pattern to a string
	-- returns tuple of all starting points of matches

------------------------------------
-- Exact Matching to a Set of Strings
------------------------------------

procedure ac_compile(tuple);	
	-- Aho-Corasik match of list of patterns to a string; compiles pattern
procedure ac_init(pattern,text);
	-- Aho-Corasik match of list of patterns to a string; binds pattern to  text
procedure ac_next_match(pattern);
	-- Aho-Corasik match of list of patterns to a string; 
	-- returns next match 
	-- (as a triple [tuple_index_of_pattern_matched, lic,len]), or OM

------------------------------------
-- Exact Matching using Suffix Trees
------------------------------------

procedure st_compile(tuple);	
	-- suffix-tree match of list of patterns to a string; compiles pattern
procedure st_match(pattern,text);
	-- suffix-tree match of list of patterns to a string
	-- returns tuple of all match points 

------------------------------------
-- Edit Distance
------------------------------------
procedure edist(s1,s2);			
	-- edit distance between two strings
procedure exedist(s1,s2,w);		
	-- weighted edit distance between two strings
	-- the weights w should be supplied as an integer quadruple:
	-- [insert_weight,delete_weight,substitute_weight,match_weight]
procedure etrans(s1,s2);		
	-- best edit sequence between two strings
	-- returns a descriptor consisting of characters M (matching character)
	-- S (replace character) D (delete), I (insert), followed by integer distance 
procedure exetrans(s1,s2,w);	
	-- best edit sequence between two strings, using weights
	-- returns a descriptor consisting of characters M (matching character)
	-- S (replace character) D (delete), I (insert), followed by integer distance
	-- the weights w should be supplied as an integer quadruple:
	-- [insert_weight,delete_weight,substitute_weight,match_weight]

------------------------------------
-- Longest matching substring
------------------------------------
procedure lcseq(s1,s2);			
	-- finds longest matching subsequence of two strings; returns a pair
	-- [common_string,length]. Note that this is an increasing sequence, not
	-- necessarily contiguous

------------------------------------
-- Approximate matching operations
------------------------------------
procedure pwscores(a);
	-- compile a list of character-score triples into the form
	-- expected by simil and simlt
procedure simil(s1,s2,pws);	
	-- finds the 'optimal alignment distance' between two strings
procedure similt(s1,s2,pws);	
	-- finds the 'optimal alignment' of two strings, representing this
	-- as a descriptor of the same kind that 'exetrans' returns

end stringm;

Exact String Matching. Three exact matching procedures for matching to single strings, namely the Knuth-Morris-Pratt, the roughly equivalent Boyer-Moore procedure, and the potentially more efficient Karp-Rabin probabilistic algorithm constitute the first group of operations provided. The first two of these are made available in two forms, one of which matches a pattern string directly to a target, while the second begins by compiling the pattern into a set of tables and then matches this compiled form of the pattern to the target string. This second form is intended for use in situations in which a single pattern is to be matched repeatedly to the members of a collection of target strings.

A first example of the use of these routines is

	program test;      -- string  matching example 1
	    use stringm;
	    
	   print(kmp("ababaaba","aba"));  
	   print(bm("ababaaba","aba"));  
	   print(kr("ababaaba","aba"));  
	
	end test;

which returns

		[1, 3, 6]
		[1, 3, 6]
		[1, 3, 6]

in each case the starting characters of the three match positions. Note that the position '3' is found by all these procedures, even though it lies within the zone covered by the match to its left.

The 'precompiled' form of this same example is

	program test;      -- string  matching example 1
	    use stringm;
	   
	    print(kmp_exec("ababaaba",kmp_compile("aba")));  
	    print(bm_exec("ababaaba",bm_compile("aba")));  
	 
	end test;

which returns the same result.

Exact Matching to a Set of Strings. Two methods, the Aho-Corasik method and a method using suffix Trees, are provided to match a list of patterns to a target string. As seen in the following example, the Aho-Corasik scheme is set up as a triple of procedures. The first of these compiles a list of strings into a form suitable for efficient matching, the second binds the resulting pattern object to the target string to be matched, and the third acts something like a SET iterator, retuning a new match each time it is called, until a final call returns OM, indicating that no more matches exist.

	program test;      -- exact string matching to a list of patterns
	    use stringm;
	 
	    compiled_pat := ac_compile(["abc","aa","cde"]);
	            -- compile list of patterns       
	    ac_init(compiled_pat,"abaabcde");
	                          -- bind  to target string
	    print(ac_next_match(compiled_pat));
	                 -- extract successive matches
	    print(ac_next_match(compiled_pat));
	    print(ac_next_match(compiled_pat));
	    print(ac_next_match(compiled_pat));
	                    
	end test;

Since each successful call to 'ac_next_match' returns a triple of the form [index_of_matching_pattern,match_start,match_length] this example generates the output

		[2, 3, 2]
		[1, 4, 3]
		[3, 6, 3]
		<om>

Note that, even in the presence of overlapping matches, the algorithm reports every point at which one of the strings in the pattern list can be matched to the target string.

Approximate Matching.

This group of functions can be used to compute edit distances, longest common subsequences and string similarity.

Edit Distance. The edit distance between two strings is defined as the minimum number of edit operations needed to transform the first string into the second.

The edit operations considered In our implementation are insertions, deletions and substitutions. Two versions 'edist' and 'exedist' of the edit distance primitive are provided. The first of these computes edit distance using standard weights assigned to the edit operation. The second, extended version allows explicit assignment of a weight to each edit operation. These weights must be passed as a tuple of four integer weights having the form:

[cost_of_insertion,cost_of_deletion,cost_of_replacement,cost_of_match]

The default weights used in the standard version are simply [1,1,1,0].

The two routines 'etrans' and 'exetrans' work with the same edit distance as 'edist' and 'exedist', but also generate an edit transcript. This is a sequence of one-character edit operation codes in which D, I, S, and M designate Deletion, Insertion, Replacement, and Match respectively. The following example illustrates the use of 'etrans' and the way in which the edit transcript that it returns can be used to set up an alignment between two similar but not identical strings.

	program test;      -- approximate string matching
	    use stringm;
	 
	     print([transcript,dist] := etrans(s1 := "aaababab",s2 := "aababaaa")); 
	     print;	 -- get edit distance and transcript
	    
	    loc_in_s1 := 0; revised_s1 := "";
	    loc_in_s2 := 0; revised_s2 := "";
	    indicator_stg := "";
	 
	    for c in transcript loop
	
	        case c
	            when "M" => revised_s1 +:= s1(loc_in_s1 +:= 1); 
	            	revised_s2 +:= s2(loc_in_s2 +:= 1);
	                        indicator_stg +:= ".";
	            when "S" => revised_s1 +:= s1(loc_in_s1 +:= 1);
	            	revised_s2 +:= s2(loc_in_s2 +:= 1);
	                        indicator_stg +:= "|";
	              when "D" => revised_s1 +:= s1(loc_in_s1 +:= 1); 
	              	revised_s2 +:= "-";
	                           indicator_stg +:= "|";
	               when "I" => revised_s1 +:= "-"; 
	               		revised_s2 +:= s2(loc_in_s2 +:= 1);
	                    indicator_stg +:= "|";
	          end case;
	        
	    end loop;              
	
	    print(revised_s1); print(indicator_stg); 
	    print(revised_s2); 
	    
	end test;

This produces the output

			["MMIMMMSMD", 3]

			aa-ababab
			..|...|.|
			aababaaa-

The first line is the standard pair [edit_transcript,edit_dist] returned by 'etrans'. The three remaining lines are (i) the first string, with insertions marked by dashes; (ii) an indicator string, showing where the aligned first and second strings differ; (iii) the second string, with insertions marked by dashes.

By increasing the cost of insertions and deletions, as in the following variant, we can of course force a different alignment.

	program test;      -- approximate string matching
	    use stringm;
	 
	     print([transcript,dist] 
	     	:= exetrans(s1 := "aaabababb",s2 := "aababaaa",[3,3,1,0]));
	     	    -- triple the cost of insertions and deletions
	     print;         -- get edit distance and transcript
	    
	    loc_in_s1 := 0; revised_s1 := "";
	    loc_in_s2 := 0; revised_s2 := "";
	    indicator_stg := "";
	 
	    for c in transcript loop
	
	        case c
	            when "M" => revised_s1 +:= s1(loc_in_s1 +:= 1); 
	            revised_s2 +:= s2(loc_in_s2 +:= 1);
	                        indicator_stg +:= ".";
	            when "S" => revised_s1 +:= s1(loc_in_s1 +:= 1); 
	            revised_s2 +:= s2(loc_in_s2 +:= 1);
	                        indicator_stg +:= "|";
	              when "D" => revised_s1 +:= s1(loc_in_s1 +:= 1); 
	              revised_s2 +:= "-";
	                           indicator_stg +:= "|";
	               when "I" => revised_s1 +:= "-"; 
	               revised_s2 +:= s2(loc_in_s2 +:= 1);
	               indicator_stg +:= "|";
	          end case;
	        
	    end loop;              
	
	    print(revised_s1); print(indicator_stg); print(revised_s2); 
	    
	end test;

This generates the alignment

			["MMDMMMMSS", 5]

			aaabababb
			..|....||
			aa-babaaa
String comparison by maximum similarity after alignment. A more flexible way of measuring the relatedness of two strings is to the two strings must be align them in all possible ways, possibly inserting spaces. Then, given a similarity score for each possible pair of characters (including the nominal character '0', indicating a blank), we can use the sum of the scores for the alignment as the value of the alignment. The alignment of maximum score then measures the similarity of the two strings.

This approach is that embodied in the group of three procedures 'pwscores', 'simil', and 'similt'. pwscores(score_mat) accepts a scoring matrix as parameter and returns a compiled form of this matrix as a opaque 'score matrix object'. The scoring matrix should be supplied as a list of triples [char_1,char_2,similarity_value], where scores for the nominal character '0' indicating a blank are also required. As seen in the next example, the procedures 'simil', and 'similt' then use this 'score matrix object' to find the optimal alignment of two strings containing the characters for which similarity values have been assigned.

	program test;      -- approximate string matching using a similarity matrix
	    use stringm;
	 
	     pws_obj := pwscores([["a","b",1],["a","a",2],
	     	["b","b",2],[0,"a",0],[0,"b",0]]);
	                     -- compile character similarity matrix
	    
	    print([transcript,score] := similt(s1 := "aaabababb",
	    	s2 := "aababaaa",pws_obj));        -- use this to align strings
	      print;
	   
	    loc_in_s1 := 0; revised_s1 := "";
	    loc_in_s2 := 0; revised_s2 := "";
	    indicator_stg := "";
	         
	    for c in transcript loop
	        
	        case c

	             when "A" => revised_s1 +:= (c1 := s1(loc_in_s1 +:= 1)); 
	               revised_s2 +:= (c2 := s2(loc_in_s2 +:= 1));
	               indicator_stg +:= 
	                  if c1 = c2 then  "." else "|" end if;

	             when "D" => revised_s1 +:= s1(loc_in_s1 +:= 1); 
	                revised_s2 +:= "-";
	                indicator_stg +:= "|";

	              when "I" => revised_s1 +:= "-"; 
	                revised_s2 +:= s2(loc_in_s2 +:= 1);
	               indicator_stg +:= "|";

	          end case;
	        
	    end loop;      
	        
	    print(revised_s1); print(indicator_stg); print(revised_s2); 

	end test;

The code seen here generates the alignment

			["AADAAAAAA", 14]

			aaabababb
			..|....||
			aa-babaaa

Note that 'simil' returns only an optimal similarity score; 'similt', like 'exetrans', returns a pair [transcript,score].

The 'similt' alignment technique is commonly used in computational genomics to align sequences of characters representing the amino acids out of which proteins are built. The standard characters for the 23 amino acids occuring in ordinary proteins are "ABCDEFGHIKLMNPQRSTVWXYZ". A commonly used protein-alignment scoring matrix, the so-called 'Blosum-62' matrix derived by collecting statistics from large numbers of related proteins, is (in the same order of amino acids, and with very low scores understood for pairs containing blanks)

     4, 
    -2,  4, 
     0, -3,  9, 
    -2,  4, -3,  6, 
    -1,  1, -4,  2,  5, 
    -2, -3, -2, -3, -3,  6, 
     0, -1, -3, -1, -2, -3,  6, 
    -2,  0, -3, -1,  0, -1, -2,  8, 
    -1, -3, -1, -3, -3,  0, -4, -3,  4, 
    -1,  0, -3, -1,  1, -3, -2, -1, -3,  5, 
    -1, -4, -1, -4, -3,  0, -4, -3,  2, -2,  4, 
    -1, -3, -1, -3, -2,  0, -3, -2,  1, -1,  2,  5, 
    -2,  3, -3,  1,  0, -3,  0,  1, -3,  0, -3, -2,  6, 
    -1, -2, -3, -1, -1, -4, -2, -2, -3, -1, -3, -2, -2,  7, 
    -1,  0, -3,  0,  2, -3, -2,  0, -3,  1, -2,  0,  0, -1,  5, 
    -1, -1, -3, -2,  0, -3, -2,  0, -3,  2, -2, -1,  0, -2,  1,  5, 
     1,  0, -1,  0,  0, -2,  0, -1, -2,  0, -2, -1,  1, -1,  0, -1,  4, 
     0, -1, -1, -1, -1, -2, -2, -2, -1, -1, -1, -1,  0, -1, -1, -1,  1,  5, 
     0, -3, -1, -3, -2, -1, -3, -3,  3, -2,  1,  1, -3, -2, -2, -3, -2,  0,  4, 
    -3, -4, -2, -4, -3,  1, -2, -2, -3, -3, -2, -1, -4, -4, -2, -3, -3, -2, -3, 11,

     0, -1, -2, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -2, -1, -1,  0,  0,
     		 -1, -2, -1, 
    -2, -3, -2, -3, -2,  3, -3,  2, -1, -2, -1, -1, -2, -3, -1, -2, -2, -2,
    		 -1,  2, -1,  7, 
    -1,  1, -3,  1,  4, -3, -2,  0, -3,  1, -3, -1,  0, -1,  3,  0,  0, -1,
    		 -2, -3, -1, -2,  4

Longest common subsequence. The final native procedure supplied by STRINGM_PAK is lcseq(s1,s2), which finds longest common subsequence of two strings, returning a pair [common_string,length]. Note that the sequence found is an ordered sequence of characters, not necessarily contiguous in either of the two strings being compared. That is, as the following example shows, the procedure finds a longest common subsequence, not the longest common run.

	program test;      -- longest common subsequence
    use stringm;
 
     print(lcseq("Now is the time for all good men","What hath God wrought?"));
            
	end test;

The pair ["ht a od ", 8] is returned.

String alignment using 'similt'.

	program test;                        -- string alignment using 'similt'
	    use stringm;
	    
	    scores_list := [["a","b",1],["a","c",1],["a","c",1],["a",0,-2],
	                    ["b","a",1],["b","c",1],["b","c",1],["b",0,-2],
	                    ["c","a",1],["c","b",1],["c","c",1],["c",0,-2],
	                    ["d","a",1],["d","b",1],["d","c",1],["d",0,-2]];
	    scores_obj := pwscores(scores_list);
	    print(similt("aabbccdd","aabbccdd",scores_obj));
	
	end test;

10.2.4 PARSER_PAK (Access to internal SETL Parse Functions)

PARSER_PAK provides a small but important collection of functions which give access to SETL's internal parse operations. Its specifier is

 native package parser;      -- package of SETL parse functions
	 procedure parse(str);
	    -- parse a string, returning its parse tree as a nested set of tuples
	    -- this handles all SETL 'compilation units'
	 procedure parse_expr(str);
	       -- parse a string, returning its parse tree as a nested set of tuples
	       -- this handles any string that is a valid 'program body'
	 procedure compile(str);
	         -- compile a string, writing result to current library
	 procedure setl_num_errors(); 
	      -- read number of errors generated during last previous
	      -- 'parse' or 'compile' operation
	 procedure setl_err_string(err_num);
	   -- get text of indicated error message generated during
	   -- last previous 'parse' or 'compile' operation
 end parser;
The 'compile' function can be used to compile any SETL string dynamically that could otherwise be compiled as a SETL program, package, or class.If compilation is successful, the byte-code produced by the compilation is written to the current library in the standard manner, as explained in Section XXX.

The 'parse' and 'parse_expr' functions. The 'parse' and 'parse_expr' functions give more intimate access to the internal structure of the parse trees generated by the SETL system as a compilation intermediate, and so can be used to support variant forms of SETL and other quite different systems which can make use of the SETL syntax. When a compilation succeeds, the 'parse' function returns the syntax tree of the code it has compiled. This is returned as a nested set of tuples, structured in the manner explained below. 'parse' can handle any SETL 'compilation unit', i.e. anything that the ordinary SETL compiler can handle. 'parse_expr' is a variant form of 'parse', designed to make it a bit easier to work with isolated expressions. 'parse_expr' can handle sequences of expressions and statements, which must end in semicolons. It works as follows: the string submitted to as argument to 'parse_expr' is 'wrapped' by prefixing it with an initial line 'program test;' and appending a final line 'end;'. The resulting string is then parsed as a SETL program. If this operation succeeds, the relevant part of the resulting parse tree is returned. Otherwise OM is returned,in which case the number of error messages which are generated is available via the procedure 'setl_num_errors', and the individual error messages via 'setl_err_string'. The following small example shows the kind of constructs that can be parsed successfully:

 program test;                        -- examples of successful parses
    use parser;

    print(parse_expr("x + 1; y + 1;"));
                  -- expressions can be parsed
    print(parse_expr("x + 1; x := y + 1;"));
             -- statements can be parsed
    print(parse_expr("for j  in [1..3] loop x + 1; x := {w * w: w in  y}; end loop;"));
    	                -- loops can be parsed
    print(parse_expr("x + 1; y + 1; procedure f(z); return  {z}; end f;"));
	   -- any 'program body' (hence package body and  class body) can be parsed
	
 end test;
The output produced by the preceding code is seen below. Details concerning the manner in which this output represents parse trees are explained a bit later.
	["ast_list", ["ast_add", "X", "1"], ["ast_add", "Y", "1"]]
	["ast_list", ["ast_add", "X", "1"], ["ast_assign", "X", 
		["ast_add", "Y", "1"]]]
	["ast_list", ["ast_for", ["ast_iter_list", 
		["ast_in", "J", ["ast_arith_tup", ["ast_list", "1"], "3"]]], 
			["ast_null"], ["ast_list", ["ast_add", "X", "1"], 
			["ast_assign", "X", ["ast_genset", ["ast_mult", "W", "W"], 
			["ast_iter_list", ["ast_in", "W", "Y"]], ["ast_null"]]]]]]
	["ast_list", ["ast_add", "X", "1"], ["ast_add", "Y", "1"]]
'parse' handles error messages in the same way as 'parse_expr' and 'compile'. Here is an example showing how error messages generated by failed parses can be recovered.
	program test;
       -- example of failed parse, and error message recovery
	  use parser;
	
	  print(parse_expr("for j in [1..3] loop"
	  	 + " x +, 1; x := {w * w: w inn y}; endloop;"));
	  print(ne := setl_num_errors());
	  
	  for j in [0..ne] loop print(setl_err_string(j)); end loop;

	end test;
The output produced by this example is
	
	3
	[2:26]         *ERROR* => syntax error at ,
	
	[2:29]         *ERROR* => syntax error at ;
	
	[2:46]         *ERROR* => syntax error at INN 

Parse tree structure. The parse trees returned by the 'parse' function reflect the nested structure of the inputs analyzed.

The overall list of statements is represented by a top-level list node of the form

[_list, repn_of_statement_1, repn_of_statement_2, ...]

Constructs involving built-in SETL operators and syntactic forms are represented by special nodes.

Variables and calls to functions other than built-in infix and prefix operators are represented using nodes of the form "_of, followed by the name of the function and its list of arguments. Examples are:

	b(x,y)		is represented by	["_of", "B", ["_list", "X", "Y"]]
	b()		is represented by	["_of", "B"]
	b;		is represented by	the list element "B"

As these examples show, in parse trees all names are capitalized; this gives SETL its case independence.

The node-types used are:

	[x,y]	is represented by	["_enum_tup", "X", "Y"]
	{x,y}	is represented by	["_enum_set", "X", "Y"]
	[x]	is represented by	["_enum_tup", "X", "Y"], etc.
	[ ]	is represented by	["_nulltup"], etc.
	[x..y]	is represented by
		["_arith_tup", ["_list", "X"], "Y"], etc.
	[x,y,z..w]	is represented by
		["_arith_tup", ["_list", "X", "Y", "Z"], "W"], etc.
			-- (note that the parser handles this general case)
	b(x..y)	is represented by	["_slice", "B", "X", "Y"]
	b(x..)	is represented by	["_end", "B", "X", "Y"]

	b := y	is represented by	["_assign", "B", "Y"]
			--- all the other assignment forms are derived from this.
			-- For example, 

	b(x..) := y	is represented by	["_assign", ["_end", "B", "X"], "Y"]
	b(x..y) := z	is represented by
		["_assign", ["_slice", "B", "X", "Y"], "Z"]
	b(x) := z	is represented by
		["_assign", ["_of", "B", ["_list", "X"]], "Z"]
	b(x,y)		is represented by
		["_of", "B", ["_list", "X", "Y"]]
	b(x,y) := z	is represented by
			["_assign", ["_of", "B", ["_list", "X", "Y"]], "Z"]
		-- superfluous parentheses are elided in the parse tree,
		-- simplifying it but creating some problems
		-- when parse trees must be back-correlated with
		-- the token tuples from which they originate.
The setl binary operators ** * / mod ? + - max min = /= < <= > >= in notin subset incs and or from fromb frome npow are represented by the following nodes:
 "_EXPON" "_MULT" "_DIV" "_MOD" "_QUESTION" "_ADD" "_SUB" "_MAX" "_MIN"
 "_EQ" "_NE" "_LT" "_LE" "_GT" "_GE" "_IN" "_NOTIN" "_SUBSET" "_INCS" "_AND" 
 "_OR" "_FROM" "_FROMB" "_FROME" "_NPOW"
The SETL monadic operators 'not - # arb domain range pow' are represented by the following nodes:
	"_UMINUS" "_NELT" "_ARB" "_DOMAIN" "_RANGE" "_POW"
Set and tuple formers like [x: y in s | condition] are represented by structures like
 	["_gentup", "X", ["_iter_list", ["_in", "Y", "S"], "CONDITION"],
		 etc.
If the CONDITION is omitted in the source text it is omitted in the parse tree also. If the leading expression is omitted, "_gentup" and "_genset" are replaced by "_gentup_noexp" and "_genset_noexp" respectively.

All such iterative constructs involve an '"_iter_list' node, whose elements are the successive iterators. These are all expressions, e.g. [y: y = f(x) | condition] is represented by

 	["_gentup", "Y", ["_iter_list", ["_eq", "Y", 
		["_of", "F", ["_list", "X"]]]], "CONDITION"].
Conversely the parser will accept arbitrary expressions in this position, e.g. [y: u + v = f/x | condition] is represented by
 	["_list", ["_gentup", "Y", ["_iter_list", ["_eq", ["_add", "U", "V"],
		 ["_div", "F", "X"]]], "CONDITION"]]
In the full SETL compiler such constructs are rejected during a post-parse phase; but programs which use the parser directly may find use for them.

Similar remarks apply to other allowed iterators, e.g. 'exists y = f(x) | condition' is represented by

 	["_exists", ["_ex_iter", ["_eq", "Y", 
		["_of", "F", ["_list", "X"]]]], "CONDITION"]
and likewise for 'forall y = f(x) | condition'. The 'for' iterator construct
 	for y = f(x) | condition loop a; b; c; end loop;
is represented by
 ["_for", ["_iter_list", ["_eq", "Y", ["_of", "F", ["_list", "X"]]]], 
		"CONDITION", ["_list", "A", "B", "C"]],
the body of the loop being represented by a final component of "_list" type. 'while' and 'until' iterators like
 	while condition loop a; b; c; end loop;
are similarly represented by
 	["_while", "CONDITION", ["_list", "A", "B", "C"]], etc.
Conditionals like
 	if c1 then a; b; c; elseif c2 then d; else e; end if;
are represented by
	["_if_stmt", "C1", ["_list", "A", "B", "C"], 
		["_if_stmt", "C2", ["_list", "D"], ["_list", "E"]]]
and the corresponding conditional expressions, e.g.
 	if c1 then a elseif c2 then d else e end if;
by
	["_if_expr", "C1", "A", ["_if_expr", "C2", "D", "E"]].
A 'case' statement like
 		case a when b,b2 => c; otherwise => d; end case;
is represented by
 ["_case_stmt", "A", 
		["_list", ["_when", ["_list", "B", "B2"], ["_list", "C"]]], 
			["_list", "D"]]
If the 'otherwise' clause is omitted so is the final component of the subtree representing the case statement, e.g
 	case a when b,b2 => c; when b3 => d; end case;
is represented by
 ["_case_stmt", "A", 
		["_list", ["_when", ["_list", "B", "B2"], ["_list", "C"]], 
			["_when", ["_list", "B3"], ["_list", "D"]]]]
similarly case expressions like
 	case a when b,b2 => c otherwise => d end case
are represented by trees like
 	["_case_expr", "A", ["_list", ["_when", ["_list", "B", "B2"], "C"]], "D"]
Alternative case statement forms like case when b => c; otherwise => d; end case; are represented by trees like
 	["_guard_stmt", ["_list", ["_when", "B", ["_list", "C"]]], ["_list", "D"]];
Again, omission of the 'otherwise' clause simply causes omission of the final component of the corresponding subtree.

Alternative case expression forms like

 	case when b => c otherwise => d end case;
are represented by trees like
 	["_guard_expr", ["_list", ["_when", "B", "C"]], "D"]
The 'return' statement is treated like a monadic operator, so
	return x	is represented by	["_return", "X"]
	return 		is represented by	["_return"]		
Similarly
	exit 		is represented by	["_exit"]		
	continue 	is represented by	["_continue"]	
	stop 		is represented by	["_stop"]	
Assignment targets like [a,-,b,-,-] containing placeholders are represented by nodes like
 	["A", _placeholder, "B", _placeholder, _placeholder]
Since no semantic checks are performed by the parser, constructs of this kind are allowed in any position.

Quoted strings within Setl Expressions are represented by themselves, with their enclosing double quotes.

Using the Parser as a front end to other systems requiring parsed input

Since no semantic checks are performed by the parser, it can handle a somewhat wider range of constructs than the SETL language proper allows, including various expression forms useful in a logic system.

This includes Iterators without restrictions in set formers and quantifiers, e.g {x + y: x, y in z | x > 0}, unrestricted existentals and universals like 'exists x, y in z | x > 0' and 'forall x, y in z | x > 0', and even mover general constructs like 'forall OM | expn(x,y,z,u,v)', which can be used to represent universal quantification all the variables in a general expression expn(x,y,z,u,v). These are seen in the following example

	program test;     -- parsing of generalized SETL-like expressions
	    use parser;
	
	    print(parse_expr("{x + y: x, y in z | x > 0};"));
	            -- set former with unrestricted 'iterator'
	    print(parse_expr("exists x, y in z | x > 0;"));
	                -- existential with unrestricted 'iterator'
	    print(parse_expr("forall x, y in z | x > 0;"));
	                -- universal with unrestricted 'iterator'
	    print(parse_expr("forall OM | expn(x,y,z,u,v);"));
	            -- universal over 'all free variables'
	end test;
The parse-trees produced are
	["ast_list", ["ast_genset", ["ast_add", "X", "Y"], 
	["ast_iter_list", "X", ["ast_in", "Y", "Z"]], ["ast_gt", "X", "0"]]]
	["ast_list", ["ast_exists", ["ast_ex_iter", "X", ["ast_in", "Y", "Z"]], 
		["ast_gt", "X", "0"]]]
	["ast_list", ["ast_forall", ["ast_iter_list", "X", ["ast_in", "Y", "Z"]], 
		["ast_gt", "X", "0"]]]
	["ast_list", ["ast_forall", ["ast_iter_list", "OM"], ["ast_of", "EXPN", 
		["ast_list", "X", "Y", "Z", "U", "V"]]]]

Dynamic compile using the 'compile' function.

The 'compile' function in PARSER_PAK compiles its string argument, which can be any program, package, package body, class, or class body. If compilation succeeds, the resulting SETL bytecode is entered in the normal way into the current setl library (see Section XXX), from which it can be loaded dynamically using the 'library_package' (see Section XXX) and 'reload_library_package' (see Section 10.2.5) operations. This makes it possible to compile new bytecode form a dynamically created string and then execute the result as part of the SETL program which created the string.

If compilation fails, error messages can be recovered using 'setl_num_errors' and 'setl_err_string' in the manner already explained in connection with our discussion of the 'parse' operation.

Here is an example of the use of 'compile' to generate and execute a series of entirely different procedures, all called 'changes_dynamically'.

	package test_pak;
	    procedure changes_dynamically(x);
	            -- declare a procedure which  will change dynamically
	end test_pak;
	
	program test;    -- parsing of generalized SETL-like expressions
	    use parser,ide_pak;
	            -- define header and trailer lines for
	            -- packages and procedures to be compiled
	    var changes_dynamically;
	        -- declare global name for the procedure to be created dynamically
	    
	    header_line := 
	       "package body test_pak;\nprocedure changes_dynamically(x);\n";
	    trailer_line := "end changes_dynamically;\nend test_pak;";
	
	    proc_bodies := ["return x + x;\n",
	    	"return str(x) + str(x);\n",
	    	"return cos(float(x));\n"];
	            -- these three strings  will be used as procedure bodies
	
	    for pbody in proc_bodies loop
	        res := compile(header_line  + pbody + trailer_line);
	            -- wrap the string containing the procedure body,
	            -- and compile it
	        pak := reload_library_package("TEST_PAK");
	            -- force load of the newly compiled package body
	        changes_dynamically := pak("CHANGES_DYNAMICALLY");
	                -- extract the 'changes_dynamically' routine
	        print(my_procedure(2));
	                -- call a static routine that uses 'changes_dynamically'
	    end loop;
	    
	    procedure my_procedure(x);       -- auxiliary static routine
	    	return changes_dynamically(x);
	    end my_procedure; 
	
	end test;
The three entirely different results produced by the three quite different routines ultimately called are:
		4
		22
		-0.41614683655
Note the following facts about the preceding example: (i) Both the name "TEST_PAK" of the package with which we work and the name "CHANGES_DYNAMICALLY" of the procedure retrieved from it must be given as uppercase strings, since (a) this is what 'reload_library_package' and (b) 'reload_library_package' produces a map from capitalized procedure names to the corresponding procedure object. (ii) in order for 'library_package' or 'reload_library_package' to find a package, the package's specifier must have been compiled previously.



10.2.5 IDE_PAK (Interface to the SETL Development Environment)

IDE_PAK provides a interface to the SETL development environment, making it possible for SETL programs to (i) read and manipulate the development environment edit window, including mouse selections made in that window; (ii) read and manipulate the current SETL library list, and to force reload of packages newly compiled into such libraries; (iii) trigger reruns of previously compiled SETL programs.

These operations allow the IDE edit window to be used as an input console and are crucial to the programming of SETL Help Scripts of the kind described in Section XXX.

The IDE_PAK specifier is

native package ide_pak;			
	       -- procedures useful only in programs invoked by Help scripts

    procedure get_buffer();
		-- get contents of edit window
    procedure get_length();
		-- get length of string in edit window
    procedure set_buffer(from_pos,to_pos,string);
		-- insert string into edit window
    procedure get_selection();
		-- get boundaries of current selection in edit window
    procedure set_selection(from_pos,to_pos);
		-- set boundaries of current selection in edit window
    procedure get_position();
    procedure get_parameter();
		-- Help Scripts, especially those of the form <SCRIPT=Personal4>,
        -- can be preceded by a text line which serves as a personalizing
        -- parameter. 'get_parameter()' returns this line
    
	                -- procedures of general utility
    procedure ide_rerun(commandline);
    procedure get_library_list();
		-- get current library list, as a comma-separated string
		-- of file names
    procedure set_library_list(liblist);
        -- set current library list, from a comma-separated string
        -- of file names
    procedure reload_library_package(pak);
		-- force  reload of the named package,which may have been
		-- newly compiled

end ide_pak;

We now give a variety of examples illustrating the use of these operations. Since the operations of the first group all refer implicitly to the contents of a SETL IDE edit window, they must be executed as SETL Help Scripts, by dragging a Help item into such a window, in the manner explained in Section XXX. For this it is convenient to use one of the 4 special help items available under the Help Panel tools tab as 'Personal1'-'Personal4'. Each of these is set up so that when dropped on a SETL IDE edit window it executes a program of the same name (which must have been precompiled). For example, the 'Personal1' Help Script is set up as

	<Personal1>
	ANY PARAMETER STRING YOU LIKE
	<SCRIPT=Personal1>

so that when dropped on an edit window it executes the program named 'Personal1' (which must have been precompiled). Suppose that this has been supplied as

 program Personal1;                        -- 13:59:58
    use ide_pak;

    stg := get_buffer();               -- read edit window contents
    print(get_length()," ",get_selection());
         -- read length  of edit window contents, plus selection boundaries
    print(get_position()," ",get_parameter());
       -- read insert cursor position, plus Help script parameter
       -- (from Help script)
    set_buffer(28,35,time());
       -- write time to edit window
 end Personal1;

and that the 'Personal1' Help item is dropped on an Edit window containing the text of this small program when characters 87-97 of the Edit window have been selected. Then the output printed will be

	461 [87, 97]
	462 ANY PARAMETER STRING YOU LIKE

and the current time will replace the original '13:59:58' in the edit window.

As a second example we give the script of the 'capitalize' Help tool item, which is simply

 program test;
	use ide_pak,string_utility_pak;
	    [ix1,ix2] := get_selection();
	            -- read edit window selection boundaries
	    if abs(ix2 - ix1) /= (ix2 - ix1) then
	     	[ix1,ix2] := [1,get_length()];
	     end if;
	                -- if no selection, take whole buffer
	    buffer_part := get_buffer()(ix1..ix2);
	            -- read selected part of edit window
	    set_buffer(ix1,ix2,case_change(buffer_part,"lu"));
	            -- replace by capitalized version
 end test;

The following example shows the use of 'set_library_list' and 'get_library_list'. In order to minimize the risk of including a nonexistent library file in the library list (which would disable SETL) we set the library list to the appropriately comma-separated but redundant 'setl2.lib,setl2.lib'. The output printed by 'print(get_library_list());' is then of course 'setl2.lib,setl2.lib'.

 program test;                        -- library-list manipulation
    use ide_pak;

    set_library_list("setl2.lib,setl2.lib");
    print(get_library_list());

 end test;
An example of the use of reload_library_package is given in the preceding section.

10.2.6 SETL's Interface to UNIX

SETL's native package 'unix_pak' makes the facilities of Unix available to SETL programs on systems allowing this. For example, Unix operations are available through SETL on the Macintosh OS X operating systems and in various unix environment which use X-windows as the basis for their graphical user interface.

10.2.7 PYTHON_PAK (Python language Interface)

This small package typifies the small interfaces that can be used to give SETL access to other fully dynamic languages, i.e. languages which support dynamic execution of statements in them. Unless very high efficiency is needed, the interface can consist of just one command which must be added to the external language, and two SETL routines. The command C added to the external language can simply pass a string back to SETL. One of the two SETL interface procedures should then which pass commands from SETL to the external language. The second must capture the string passed by the command C added to the external language.

Note that to use this scheme, one must prepare a slightly modified version of the external language, embodying the added command C mentioned above.

python_pak is just such an interface, for the increasingly popular Python language. The modified version of the Python interpreter that is needed is supplied with the SETL system as the binary module 'PythonCore'. This must be placed in the same folder as the SETL interpreter.

Note that when running in default mode, Python opens an annoying console window that SETL does not use and that the SETL-associated version of Python will only use when it needs to write error messages. To prevent this window from appearing unnecessarily you should invoke the 'EditPythonPrefs' utility that comes with Python and check the 'Delay Console Window Until Needed' option offered under the 'Default Startup Options' section of the Python preferences dialog.

The python_pak specifier is simply

	native package python;        -- SETL interface to Python language
	    procedure python_exec(command);
	           -- pass command string to Python for execution      
	    procedure python_getstring();
	           -- gets string returned to SETL by setl.pass_string(stg)
	end python;

The command added to the modified 'PythonCore' Python interpreter supplied with SETL is

			setl.pass_string(stg)
Here, 'stg' can be any string_valued expression in the Python language. This command converts it to a SETL string, which is returned to SETL. Note that this command implicitly uses Python's object syntax; 'setl.pass_string' is a normal Python method call, addressed to the object 'setl' which is built-in to 'PythonCore'.

Using these primitives it is easy to set up SETL procedures which make it easy to use Python, and in particular its extensive built-in libraries. Two such procedures are

	procedure python_calc(expn);
        -- evaluate a Python expression, returning the value as a string
	   -- request Python to execute 'result = expn', passing 'result' to SETL
	   python_exec("result = " + expn + "\nsetl.pass_string(str(result))");
	    return python_getstring();
	              -- return value of Python expression 'str(result)'  to SETL
	end python_calc;

	procedure python_val(vname);
	            -- get the value of the indicated Python variable
	    python_exec("setl.pass_string(str(" + vname + "))"); 
	    return python_getstring();
	             -- return value of Python expression 'str(vname)'  to SETL
	end python_val;

An extensive collection of Internet access an file/folder manipulation utilities, built on the corresponding Python operations, is then available in the package net_utilities supplied with the SETL system, The specifier for this package is

 package net_utilities;
            -- package of Python-derived internet utilities
    
    var telnet_prompt := "%";            -- the telnet prompt character 

    procedure net_init();
           -- obligatory initialization
    procedure python_calc(expn);
           -- evaluate a Python expression, returning the value as a string
    procedure python_val(vname);
           -- get the value of the indicated Python variable
    procedure http_get(url);
           -- transmit http request in 'get' form
    procedure http_post(url,stg);          
           -- transmit http request in 'post' form ('old' version)
    procedure http_post_multi(url,map);  
           -- transmit http request in 'post' form ('multi' version)
    procedure ftp_read(url);             
           -- transmit request for ftp file read by blocks
    procedure ftp_read_lines(url);         
           -- transmit request for ftp file read by lines
    procedure ftp_list(url);             
           -- transmit request for ftp file-listing of directory
    procedure mail_send(host,who_from,whoall_to,subj,msg);     
           -- transmit request for mail send
    procedure mail_summary(mail_host_user_pwd,nlines); 
           -- transmit request for summary of mailbox
    procedure read_gzip(file_name);      
           -- read a Gzipped file

                        
           -- ******** telnet routines ********

    procedure telnet_open(host);        
           -- open connection to telnet host; name can be qualified by port
    procedure telnet_close(host);        
           -- close connection to telnet host; 
    procedure telnet_read_until(host,stg);
           -- read telnet stream up to indicated  marker
    procedure telnet_expect(host,regex_list,secs);
           -- read telnet stream up to regexp, or timeout
    procedure telnet_read_timeout(host,stg,secs);
           -- read telnet stream up to marker, or timeout
    procedure telnet_write(host,stg);    
           -- write telnet command
    procedure telnet_response(host,stg);
           -- write telnet command and get response
    procedure telnet_put_file(host,file_name,dest_file_name);    
           -- send a text file via telnet
    procedure telnet_put_stg(host,dest_file_name,stg);            
           -- send a string to start a file via telnet
    procedure telnet_add_stg(host,dest_file_name,stg);            
           -- send a string to start adding to a file via telnet
    procedure telnet_end_stg(host);                                
           -- end a string sent via telnet
    procedure telnet_continue_stg(host,dest_file_name,stg);            
           -- add a string to the end of a file via telnet
    procedure telnet_get_file(host,source_file_name,dest_file_name);    
           -- get a text file via telnet
    procedure login(host,who,pwd);        
           -- telnet login procedure
--   procedure sanitize(stg);            
           -- sanitize a string containing quotes and backslashes

                        
           -- ******** file utilities ********

    procedure file_chdir(path);            
           -- change current working directory to "path"
    procedure file_getcwd();            
           -- get current working directory
    procedure file_listdir(path);        
           -- get contents of indicated directory
    procedure file_mkdir(path);            
           -- make a new directory
    procedure file_remove(path);        
           -- remove a file
    procedure file_rmdir(path);            
           -- remove a directory
    procedure file_rename(src,dst) ;    
           -- rename a file or directory
    procedure file_stat(path);            
           -- file status tuple, in order st_mode, ino, dev, 
           --nlink, uid, gid, size, atime, mtime, ctime
    procedure file_kind(path);            
           -- returns 1 of directory, 0 for path
    procedure file_set_times(path,creation,modif);
           -- set file creation and modification times
    procedure file_set_creator_type(path,creator,ftype);
           --set file creator and type 4-char Macintosh fields

    procedure python_ctime(floating_secs);    
           -- convert python floating point time to standard rep 

 end net_utilities;
(The commented-out lines in this list document a few procedures available in the body of package net_utilities, but not declared for external use). As seen, these fall into roughly a half-dozen categories. The 'http' routines 'http_get' and 'http_post' retrieve Internet files identified by URL's, the first of these by the 'get' and the second by the 'post' method. (The 'post' method is often associated with retrieval of files generated dynamically on the server supplying them.) The strings that these operations retrieve are essentially identical with the text files that would appear if the same url is examined using the 'view text' option of a standard browser.

The 'http' routines. It is generally not hard to examine HTML pages visually and unravel their structure. Done successfully, this makes many Web information services available for use as a kind of subprocedure for programs you may wish to construct. For example, the well-known 'CNNfn' web site makes frequently updated stock- and bond-market quotes available in 4 forms: Dow-Jones average, NASDAQ average, Standard&Poor average, and Bond average. Suppose we want to access this data for continual use within a program (perhaps a Web application) that we are writing. As of Feb. 2001, CNN is putting this data into an HTML table with the HTML comment of the form '' near the start of the table (as can be seen by inspecting the HTML source text of their page). Hence we can begin to locate our desired data by taking the page text between the first occurrence of '' and the next following occurrence of ''. These occurrences are located by the code lines

	stg :=  http_get("www.cnnfn.com");		-- download cnnfn home page
	dowlocs := kmp(stg,"");		-- get occurrence locations of '' 
	etlocs := kmp(stg,"");			-- get occurrence locations of '' 
If we print the variables 'dowlocs' and 'etlocs' the result is
	[14844, 19200] 
	[2043, 3186, 3270, 3913, 12619, 14589, 20521, 26362, 30704, 30722, 30950, 
	   33395, 33417, 34608, 35112, 35181, 42182, 42206, 51780, 53415, 54100, 
		56451, 56474, 56513, 56535, 57498]
This makes it clear that '' has only two occurrences, both within the HTML table containing the data we are interested in. Hence
	exists loc in etlocs | loc > dowlocs(2)
will locate the end of this table. Breaking the range of text from the first occurrence of '' to the end of the table into lines and examining the result, we easily see that the desired data is found in lines 4,8,12,16,27,28,30,31,33,34,36 and 37 of the table. The following program assembles these observations into code that extracts the stock-market data we want from the 'CNNfn' home page.
program test;                    -- getting market quotes from cnnfn website
    use python;
    use net_utilities,stringm,string_utility_pak;
    const market_names:= ["DOW","NASDAQ","S&P","BOND"];

    stg :=  http_get("www.cnnfn.com");        -- download cnnfn home page
    dowlocs := kmp(stg,"");        -- get occurrence locations of '' 
    etlocs := kmp(stg,"");            -- get occurrence locations of '' 
    if #dowlocs /= 2 or not (exists loc in etlocs | loc > dowlocs(2))  then
        -- assumed cnnfn page format  has changed
        print(" ********** CNNFN page format has changed **********"); stop;
            -- give warning and stop
    end if;

    tup := breakup(stg(dowlocs(1)..loc),"\n\r");
            -- break page section between '' and '' into lines

    tup := [tup(j): j in [4,8,12,16,27,28,30,31,33,34,36,37]];
            -- get lines containing desired data
    data := [back3(line): line in tup];
                     -- extract numerical data from right end of lines
    
    for j in [1..4] | data(2 * j + 4)(1) = "-" loop
        data(j) := "-" + data(j);       -- capture signs for first 4 data items
     end loop; 
    
    for mname = market_names(j) loop 
       j2 := 2 * j + 3; 
       print(mname," ",data(j2)," ",data(j)," ",data(j2 + 1)); 
    end loop;
    
    procedure back3(stg);
            -- extract numerical data from right end of lines;
            -- remove other HTML clutter
        for j in  [1..3] loop
            rbreak(stg,"<");    -- back over 3 HTML tags
            rmatch(stg,"<"); 
        end loop;

        data := rbreak(stg,">");
               -- extract numerical data, back to next tag on right
        data := suppress_chars(data,"nbsp;");
                          -- suppress 'nbsp;'
        return "" +/ [if c = "&" then " "  else c end if: c in data]; 
                -- change remaining '&' signs to  blanks
    end back3;

end test;
Note that the data tuple as first constructed has the form
	[" 43.45", " 61.94", " 11.51", " 1/8", "10903.32", 
		"-0.40%", "2427.72", "-2.49%", "1318.80", "-0.87%",
		 "99 9/32", " 5.42%"]
The first 4 items in this tuple lack the '-' sign that they need if the market is falling. (This is because these signs appear in cnnfn's page as icons). But the necessary signs are readily extracted from the corresponding percentage changes in data items 6, 8, 10,and 12. Doing this, our data tuple reads
	["- 43.45", "- 61.94", "- 11.51", " 1/8", "10903.32", "-0.40%",
		 "2427.72", "-2.49%", "1318.80", "-0.87%",
		  "99 9/32", " 5.42%"]
As of the early morning of Feb. 14, 2001 the output of the above program is
	DOW 10903.32 - 43.45 -0.40%
	NASDAQ 2427.72 - 61.94 -2.49%
	S&P 1318.80 - 11.51 -0.87%
	BOND 99 7/32  3/16  5.42%
As of 10:13 AM of Feb. 14 it produces
	DOW 10855.29 - 48.03 -0.44%
	NASDAQ 2421.29 - 6.43 -0.26%
	S&P 1312.06 - 6.74 -0.51%
	BOND 99 5/16  3/32  5.42%
(This illustrates the remark allegedly made by J.P. Morgan when asked to predict the future course of the stock market; "The market will fluctuate.").

Here is another example, The 'cnnfn' page we have been discussing also provides a stock quotation service. One can enter comma-separated 3-letter codes for several stocks (e.g. IBM,CNN) into a small text box and click on a 'Go' button. An HTML page containing the requested data is then generated and returned. If this is done in a browser it will be seen that the URL being requested is simply

qs.cnnfn.cnn.com/tq/stockquote?symbols=ibm%2Ccnn

That is, the parameter for the request (the part of this URL that follows the separating '?' character) is simply 'symbols=ibm%2Ccnn', essentially the two stock symbols separated by the 'Internet sanitized' form '%2C' of the comma character. This easy visual inspection makes it plain that the 'http_get' call

http_get("qs.cnnfn.cnn.com/tq/stockquote?symbols=ibm%2Ccnn")

should retrieve this same data, as indeed it does. It is generally easy to read 'get'-generated URLs in just this way and so to write small routines which make equivalent Web services available as SETL procedures built using 'http_get'.

http_post The http_post operation has the form

http_post(url,stg)

It is generally used when a relatively large 'stg' parameter needs to be transmitted along with a Web URL request. In interactive pages, this data is commonly collected from For example, the cnnfn pare we have been discussing

The 'ftp' routines. 'ftp', or 'file transfer protocol' is a standard communication convention for the download of substantial files. Many free file collections accessible by ftp are available on the Web. The ftp_read, ftp_read_lines, and ftp_list procedures make these files available to SETL. The first two of these routines have the form ftp_read(url) and ftp_read_lines(url), and make the requested file available as a SETL string. For example, many files relevant to the Python system (including full source files) are available via ftp at www.python.org/pub/python/, and either of the expressions

		ftp_read("www.python.org/pub/python/README")
or
		ftp_read_lines("www.python.org/pub/python/README")
will retrieve the small README file available at that location. The two procedures differ only in the manner that the file is fetched over the Web, and so both produce the same result. ftp_read_lines requests that the file be read by lines, and ftp_read that it be read by blocks. The second method should be substantially more efficient for large files and more appropriate for binary files.

The ftp_list procedure has the form ftp_list (url), but here the url should reference a directory of other files rather than a bottom-level file. For example, a directory of python binaries is available at ftp://www.python.org/bin/ftp-exec/, so

		ftp_list("www.python.org/bin/ftp-exec"
can be used to retrieve a UNIX-form listing of this directory, which appears as
	drwxr-xr-x   2 root     bin          512 May 19  1998 .
	drwxr-xr-x   3 root     wheel        512 May 19  1998 ..
	-r-xr-xr-x   1 root     bin        17500 Jun 11  1997 ls
	-rwxr-xr-x   1 root     bin          163 Jun 11  1997 rmake
	-rwxr-xr-x   1 root     bin         1022 Jun 11  1997 sigjob
	-rwxr-xr-x   1 root     bin          851 Jun 11  1997 sigjoblog
	-rwxr-xr-x   1 root     bin           19 Jun 11  1997 tme
	-r-xr-xr-x   1 root     bin         6816 Jun 11  1997 touch
The 'mail' routines. The 'telnet' routines.

As noted in in Section 10.19 of the preceding Chapter, all higher-level Internet communication function sare built upon the 'socket' mechanism, which provides the basic capability of sending byte-streams reliably between two computers, with certain (but not detailed) notification of communication failures.

Once socket communication has been opened between two computers, communication proceeds by the exchange of byte streams, which can be thought of as strings.

To understand the issues which any such conversation must involve we can think of the computer-to-computer socket communication as an exchange between two persons, initiated when a phone rings and one of them picks it up. The person called is normally expected to say something, as an indication that communication has been initiated and that further responses will be forthcoming. Whatever is said is expected to be in an understandable language. If after calling someone what you hear on the telephone is a mysterious series of clicks or beeps, or something in a completely unknown language, for example, 'Nov Schmoz kapop', you will normally hang up rather than trying to proceed with the conversation. Likewise, if the person called says 'Hello, who is this?', and the caller responds only with things like 'Nov Schmoz Kapop', the person called will normally terminate the conversation. Only when the conversation proceeds in conformity with mutually-understood conventions can it successfully continue.

A second issue has to do with the termination of messages coming from one or the other side. Suppose somebody telephones you and that you respond by saying 'Hello, who is there?' The caller then begins to speak in perfectly comprehensible English, but then goes on endlessly without ever stopping, or still worse, pauses occasionally, but without ever clearly finishing and then resumes. In this case, sensible conversation is impossible since one never knows when a response is called for. This shows that productive bilateral communication must depend on clear, mutually understood signals defining the termination of messages.

A final consideration relates to communication delays. Suppose a phone conversation goes like this:

To recover from such situations the most sensible procedure is to wait for some reasonable length of time and then hang up if no response is received.

The telnet routines lie closer to the raw process of socket communication than do the other Web-related routines provided by NET-UTILITIES and therefore must handle all of these conversation-coordination problems. Telnet communication proceeds by the exchange of punctuated strings. Since the punctuation characters used play a special role they cannot be used in quite the ordinary way and so must be specially 'sanitized' when they appear in ordinary strings. A subroutine 'sanitize' is provided for this purpose, (but is used internally by all the other routines, and so need not be applied explicitly.)

'telnet_open' initiates communication with a specified host; 'telnet_close' ends this communication. 'telnet_write' sends a string to a designated host and 'telnet_read_until' reads the response from this host. However, since we must be able to recognize the punctuation mark which ends the host's response 'telnet_read_until' has a second parameter 'stg' defining the sequence of characters in the host's response which signal the end of this response. This operation is provided in three variant forms: 'telnet_read_until' waits for the occurrence of a single specified string in the response stream and is prepared to wait forever. (Waiting forever is only a sensible thing to do in a multi-threaded environment. In a single threaded SETL environment waiting forever will hang SETL and possible also the system under which it is running.) The 'telnet_read_timeout' version of 'telnet_read_until' has a third time-limit parameter, given in seconds which indicates how long one should wait until hanging up if a complete response has not yet been received. The third variant 'telnet_expect' of the same primitive allows one to wait, not just for the occurrence of a specific string in the response stream, but for any string matching a specified regular expression. This gives one a way of waiting for the occurrence of any one of a collection of strings, any one of which might signal the end of a response. For example, if one is communicating with a UNIX system and uses the UNIX 'more' operation, the response generated will end with one kind of prompt character if the whole of a requested file is returned, but with a different prompt character if only part of the requested file has been sent. So the resulting communication can end with either of these characters; one must be prepared to resume one's own end of the conversation in either case.

One additional procedure in the group of telnet routines, 'telnet_put_file' transmits a text file via telnet, simply by sending its lines one after another.

It is instructive to trace through the series of communications which occur when one accesses a remote computer via telnet and logs in the ordinary way. One begins by signaling the remote computer via 'telnet_open'. It is then the remote computer's responsibility to send back a message, which can consist of any amount of announcement and advertising material that it wants to send, but which must end with the word 'login:'. This word is understood to mark the end of the remote computer's initial response, and so cannot occur elsewhere in this response. Thus to get this initial response one executes 'telnet_read_until' with 'login:' as its second parameter. Once the whole of this login request has been received, it is the calling client's computer's responsibility to send an account name. The remote computer will respond to this account name by sending a second message, which again can be anything, but which must end in the word 'Password:', which signals that a password is expected. The 'client' computer must then reply to this response by sending an acceptable password. Once this is done the remote computer will check the password, and respond with one of two types of messages. The login request may be rejected because the account does not exist or the password is wrong, but if the account exists and the password is correct it is the remote computer's responsibility to reply with a message ending in a mutually-understood prompt character or sequence, typically a '$' or '%'. At this point the client will understand that it has been put in communication with whatever type of 'shell program' the remote computer uses to deal with remote logins. Typically this is a UNIX shell. From this point on communication must proceed in conformity with the command line conventions defined by this UNIX shell. That is, the client computer is expected to send valid UNIX commands or other character sequences defined by UNIX, and the remote computer is expected to respond in the standard UNIX fashion using messages that always end with standard prompt characters or strings.

The remaining routines of the 'telnet' group are utility composites which could be programmed using the routines we have described already, but which are provided to facilitate common operations. 'telnet_put_stg' opens a file on the remote computer and writes a string to it; 'telnet_add_stg' adds a specified string to a file that has already been created. 'telnet_continue_stg' . 'telnet_end_stg' signals the end of a string being sent via telnet. Examples illustrating the use of these operations are given below. 'telnet_get_file' reads an entire file via telnet. 'login' is a utility procedure which handles all of the steps of a login operation, given the Web address of the remote host, plus an account name and password.

The following example shows the use of some of the main telnet primitives. Note that to execute it successfully, you will need to change the values shown for 'host', 'account_name', and 'password' to whatever values are appropriate for somehost, account, and password to which you have access. The telnet_prompt character will also have to be changed if the host system you are accessing uses a different prompt.

 program test;   -- communication  via telnet
     use net_utilities,get_lines_pak;
 
     host := "falce.mrl.nyu.edu"; account_name := "setl";   
	  -- define host to  be  used,  and account name
     password := "nine98"; telnet_prompt := "$";    
	  -- define password,  and expected prompt
 
     telnet_open(host);    
	  -- open communication with the host
     print(telnet_read_until(host,"login:"));
	  -- wait for the "login:" response
     telnet_write(host,account_name); 
	  -- send the account name
     print(telnet_read_until(host,"Password:"));
	  -- wait for the "Password:" response
     print(telnet_response(host,password));    
	  -- send the password and wait for the standard "$" prompt
 
     print(telnet_response(host,"ls")); 
	  -- send a Unix 'list' command
     telnet_put_stg(host,"myfile","contents  for myfile!!!");      
	  -- create a Unix file and  write  to it
     telnet_add_stg(host,"myfile"," more contents for myfile!!!");  
	  -- add to the Unix file
     telnet_end_stg(host);      
	  -- close the Unix file
     print(telnet_response(host,"cat < myfile"));
	  -- read the Unix file
     print(telnet_response(host,"ls")); 
	  -- send a Unix 'list' command

     telnet_get_file(host,"myfile","myfile_here");   
	  -- fetch the Unix file
     print(get_lines("myfile_here"));  
	  -- print the local  copy of the file
     print(telnet_response(host,"rm myfile"));
	  -- delete the Unix file
     print(telnet_response(host,"ls")); 
	  -- send a Unix 'list' command

     telnet_close(host);   
	  -- close the telnet  link, to log off

 end test;

The code shown above log