AUTOMATED ERROR CHECKING OF BATCH JOBS WITH MPEX/3000 by Adrian Partridge, GAINSBOROUGH SOFTWARE LTD Published by INTERACT Magazine, Apr 1995. Checking $STDLISTs for errors and filing them for archive purposes is an important function of the Information System department. It is also an unloved chore. Sites with XL machines and NMSPOOLER have the benefit of having their spool files all held as disk files. These can be searched and printed with normal MPE commands. HP even supply you with a JOBABORT condition for use with the SPOOLF/LISTSPF commands to display all $STDLISTs that have JOBABORTed during execution. Using the excellent MPEX/3000 package from VESOFT I have created a $STDLIST Management Tool (SMT) which provides significant advantages over a simple SPOOLF @;SELEQ=[JOBABORT=TRUE];SHOW when required. The finished $STDLIST Management Tool is a good example of the features and power provided by the MPEX software. This is a list of major elements that make a comprehensive and useful SMT. $STDLIST SELECTION It is important for our SMT to select just $STDLISTs from the spoolqueue. These $STDLISTs are filtered out to make sure that they are in a READY state (not still OPENed or LOCKed by a job or utility). In this simple SMT any $STDLISTs that have been checked will be altered to a priority of 4, therefore only $STDLISTs above this priority will be selected. REPEAT .. .. .. FORFILES O@.OUT.HPSPOOL(SPOOL.FILE="$STDLIST" & SPOOL.ISREADY & SPOOL.OUTPRI>=5) This REPEAT...FORFILES is an MPEX construct, that gives us the ability to repeat a number of commands on files that match the FORFILES selection condition. There are several other ways that you might wish to do this procedure, another possible example is: Have a logon UDC for batch jobs, which does a SPOOLF @;SELEQ=[FILEDES=$STDLIST]; PRI=1. This will defer the $STDLIST down to a priority of 1 so the SMT need only check $STDLISTs at this priority. The $STDLIST priority can then be increased once checked, this could even cause the $STDLISTs to automatically print if the priority is above the OUTFENCE. Because $STDLISTs are deferred down to 1 there is no way that they can be accidentally printed if the outfence is below 8. Whatever choice you make to select the $STDLISTs, the important thing is that $STDLISTs are only checked once. $STDLIST DIAGNOSIS A batch management tool must be able to tell if a job has completed or not. This SMT uses a number of basic principles to deduce the status of the job. JOBABORTed $STDLISTs have definitely failed to complete and need immediate attention. The JOBABORT condition has one major downfall, if the command proceeding the line with the error is a CONTINUE statement then JOBABORT will not pick this up. If an error occurred while this is set, then JOBABORT is rendered useless. A $STDLIST in our SMT is treated as if it has finished in one of three states: The $STDLIST did not complete. This could be caused by the job being ABORTJOBed or an error by one of the commands within a JCL caused a flush during execution. If the job does complete then there should not be a :EOJ in the $STDLIST (its always a good idea to end your JCLs in !EOJ). This can be easily searched for by passing the following line into the SMTERROR file: UPS(R[1:4]) <> ':EOJ' AND RECNUM = VEFINFO(FNUM).EOF - 2 This line instructs SMT to deem any file without :EOJ in the second from last line, as terminated in error. The $STDLIST contains an error. If MPE commands cause an error there is a good chance that either a CIERR or FSERR will be returned, but not all commands. For example, if you did a STORE of files and some where not stored correctly, you would want SMT to point out that the STORE was not 100% successful. The following errors will cater for the bulk of MPE failures. '(CIERR' or '(FSERR' or 'NOT STORED' Your custom errors can be added into the SMTERROR file also. $STDLIST completed successfully - If the SMT completes the 2 procedures above then the $STDLIST is deemed as OK. This SMTERROR file is the file that contains the errors and is read in each time a $STDLIST is checked. It is advised that caution be used in not having an excessive number of strings in this file. Remember strings can be ORed together to reduce checking time. Any strings that you wish to look for MUST be enclosed in ' ' symbols. Other %PRINT functions avilable for use are CL (CaseLess), DL (DeLimited), RECNUM and virtually any MPEX opertor such as OR, AND, NOT, MATCHING, BETWEEN, etc. These options should be left undelimited. The SMTERROR file is read using the REPEAT...FORRECS construct, which enables us to repeat a number of commands for each line read from the file in the FORRECs selection. Each line is passed through a PRINT ;SEARCH= statement. If a PRINT command finds something, then that $STDLIST is deemed as ending in error. These are immediately reported. The PRINT variable MPEXPRINTLINESFOUND is used to accomplish this task. $STDLIST HANDLING What do you do with these $STDLISTs? What special treatment do you give to $STDLISTs that have ended in error? How do you reduce paper consumption by unnecessary $STDLIST printing? How do you make your operator's time more productive instead of checking $STDLISTs all day long? Our SMT, that's how! Our SMT at the moment only does the most basic $STDLIST handling because different sites might want to do something slightly different. Some possible options for $STDLIST handling that you can easily implement into the SMT are: * By copying $STDLISTs selected as being in an error state to a different name (for example, STDERROR), you can then delete $STDLISTs when they become READY. * Using different spool file priorities you could create a daily tier system. The $STDLIST priorities reflect the day on which the $STDLIST was created (5=today, 0=5 days previous). Each day you roll the $STDLISTs down a priority, deleting ones at 0. * Because spool files are on disk you could copy the $STDLIST to a group created each day. This group would contain a contents file which is written to each time a spoolfile is copied into the group. An ON-LINE system could easily be written to pull back any contents file and from it, pull back $STDLISTs from any number of days previous. The SMT we run with uses a number of the above options. All $STDLISTs are first copied to a log group created daily. OK $STDLISTs are deleted from the queue and all $STDLISTs that are in error are copied to a new printout called STDERROR and the $STDLIST purged. CONTINUAL EXAMINATION OF $STDLISTs One of the following scenarios might apply to you: On big sites, where batch jobs are running continually throughout the day, $STDLISTs that have ended in error might not be found from anything from from 10 minutes to 1 hour. When you arrive in the morning you have the overnight batch run to check through before any users can log on, just in case some files not have been backed-up, or a job which updates your data hasn't run successfully. In both cases it is essential that $STDLISTs be checked quickly. The easiest way of doing this is to read the $STDLISTs in a loop and continually cycle around that loop until it is broken to keep the job looping around and around until you wish to stop offered a few problems at first. Many people might want to pause between checks for up to 10 minutes. Checking if a flag file was built could only be done once in the JCL of the SMT. This means that if a pause had just started and you request the SMT to stop by BUILDing the stop flag it would not finish until the pause had completed I have devised a control mechanism that allows the user to stop the job instantly after a check has completed. This mechanism will also instruct the SMT to do a check as and when it is requested. This control mechanism uses good old message files, background task and all sorts of other trickery: CONTROL SKELETON - FILE SMTMESS=SMTMESS,OLD;SHR;GMULTI PURGE SMTMESS BUILD SMTMESS;REC=-10;REC=-10;DISC=1;MSG SETVAR OPTION "CHECK" WHILE TRUE DO IF OPTION = "STOP" THEN RETURN ELSEIF OPTION = "CHECK" THEN < $STDLIST CHECKER ROUTINES > ENDIF IF SONALIVE(GOONPIN) = FALSE THEN RUN MAIN.PUB.VESOFT;PARM=1;INFO="PAUSESMT";GOON;PRI=DS;STDLIST=$NULL SETVAR GOONPIN MPEXPIN ENDIF INPUT OPTION < *SMTMESS ENDWHILE PAUSESMT - PAUSE 300 ECHO CHECK >*SMTMESS Well, what does all this do? A message file is unique in that anything reading that file will wait until something is written to it. This principle is applied to our control skeleton above. Firstly a message file is created called SMTMESS. This file is passed instructions from either users or the program itself. In each loop of SMT we read SMTMESS by way of the INPUT command, which in turn will cause SMT to wait. Before this, a background process (PAUSESMT) is started that pauses for 300 seconds (5 minutes) and then writes to SMTMESS. This will then cause SMT to continue processing. The INPUT command is also used to set a variable (OPTION) to whatever value is written into the message file. With this variable we can instruct SMT to either stop or do another check of $STDLISTs in the queue. The SMTMESS file can be written to by you, therefore a check is set up to see if PAUSESMT is running if it is then a new PAUSESMT is not started. This ensures that every 5 minutes a check will begin regardless of any outside user intervention!!! INFORM OF SUSPECTED ERRORS OK so this fancy SMT has found the errors in some $STDLISTs - what now. This utility must make as much noise, flashing highlighted text as possible to inform the console operator that such a job has aborted or needs checking and he/she can promptly inform the necessary personnel To inform the console of any errors within $STDLISTs we use TELLOPs and the ; FORMAT command within the PRINT line. A variable called R is the current record that PRINT is processing. Combine this with a TELLOP and we can send the lines directly from the $STDLIST straight to the console. This gives virtually instantaneous knowledge to the jobs creator of why it aborted without having to print out the $STDLIST. This is done by outputting the current record (R) with TELLOP in front of it to a file and then executing that file: BUILD TELERROR;TEMP;REC=-60,1,F,ASCII;NOCCTL FILE TELERROR = TELERROR,OLDTEMP PRINT < FILE AND CONDITION>;FORMAT="TELLOP "+R[1:50];OUT=*TELERROR XEQ TELERROR Only the first 50 chars of the $STDLIST are copied across because then lines will not wrap around when the message appears on the console. If you don't wish errors to do do the console but some dedicated device just for SMT messages, then the TELLOP command could be substituted with the WARN or TELL command. PRINT ;FORMAT="WARN LDEV=nnn " + R[1:50];OUT=*TELERROR whereby 'nnn' is the device number for the messages to be WARNed to. An alternate idea is to have errors not only sent to the console but also your PAGER or BEEPER!! This SMT could then be used to inform of errors 24 hours a day. FILE TOMOD;DEV=nnn ECHO ATDTttt >>*TOMOD whereby 'nnn' is the device number of the MODEM and 'ttt' is you PAGER or BEEPER number! To save time the SMT does not want to search each $STDLIST for every different entry in the SMTERROR file. When an error is detected we want to inform the console and then continue with the next $STDLIST. A TRAPERROR...IFERROR/ENDIFERROR is another unique MPEX construct that will catch errors caused within a command file or JCL and then execute commands between the IFERROR/ENDIFERROR construct to rectify the problem. The TRAPERROR...IFERROR /ENDIFERROR is used not just to trap errors in our SMT, but also as a means of using the ESCAPE command. This ESCAPE command assigns the CIERROR variable to a number and forces a jumps from the TRAPERROR routine to the IFERROR/ENDIFERROR subroutine. This feature is used in MPEX as the equivalent of the GOTO command used within languages such as BASIC. Whenever an error is found, instead of proceeding with the next string we jump from that routine to another. GOTO CONSTRUCT - REPEAT TRAPERROR REPEAT PRINT <$STDLIST> IF ERROR THEN ESCAPE 1 ENDIF FORRECS RECORD=SMTCHECK,OLD TELLOP $STDLIST OK IFERROR IF CIERROR = 1 THEN TELLOP ERROR FOUND !!! ENDIF ENDIFERROR FORFILES $STDLISTs The complete SMT job stream looks like:- !JOB SMT,BATCH.SYS !SETVAR VESOFTCONTINUENOSPACE 1 !RUN MAIN.PUB.VESOFT;INFO="SMTCHECK" !EOJ The complete SMTCHECK routine looks like:- FILE SMTMESS=SMTMESS,OLD;SHR;GMULTI CONTINUE PURGE SMTMESS BUILD SMTMESS;REC=-10,,,ASCII;DISC=1;MSG CONTINUE PURGE TELERROR,TEMP BUILD TELERROR;TEMP;REC=-51,,,ASCII FILE TELERROR = TELERROR,OLDTEMP SETVAR OPTION "CHECK" SETVAR GOONPIN 0 WHILE TRUE DO IF OPTION = "STOP" THEN RETURN ELSEIF OPTION = "CHECK" THEN COMMENT *** START OF CHECK SECTION *** REPEAT TRAPERROR REPEAT PRINT ![MPEXCURRENTFILE];SEARCH=!RECORD;CONTEXT=2,2;& FORMAT=" TELLOP "+R[1:50];OUT=*TELERROR IF MPEXPRINTLINESFOUND > 0 THEN ESCAPE 1 ENDIF FORRECS RECORD = SMTERROR,OLD COMMENT *** $STDLIST IS O.K *** TELLOP ![SPOOL.JOBNUMBER] - ![SPOOL.JSNAME]& ,![SPOOL.USER].![SPOOL.ACCOUNT] TELLOP $STDLIST (![SPOOL.SPOOLFILENUM]) IS O.K TELLOP ECHO ![SPOOL.JOBNUMBER] - O.K IFERROR IF CIERROR = 1 THEN TELLOP ********************* TELLOP * A T T E N T I O N * TELLOP ********************* TELLOP TELLOP ![SPOOL.JOBNUMBER] - ![SPOOL.JSNAME]& ,![SPOOL.USER].![SPOOL.ACCOUNT] TELLOP MAY CONTAIN A POSSIBLE ERROR TELLOP XEQ TELERROR TELLOP TELLOP PLEASE CHECK THIS $STDLIST IMMEDIATELY ENDIF ENDIFERROR COMMENT *** ALTER $STDLISTS TO PRI OF 4 *** SPOOLF ![SPOOL.SPOOLFILENUM];PRI=4 NOMSG FORFILES O@.OUT.HPSPOOL(SPOOL.FILE="$STDLIST" AND & SPOOL.ISREADY AND SPOOL.OUTPRI >= 5) ENDIF COMMENT *** START BACKGROUND PAUSE *** IF SONALIVE(GOONPIN) = FALSE THEN RUN MAIN.PUB.VESOFT;PARM=1;INFO="PAUSESMT";STDLIST=$NULL;GOON SETVAR GOONPIN MPEXPIN ENDIF COMMENT *** START WAIT *** INPUT OPTION < *SMTMESS SETVAR OPTION RTRIM(OPTION) ENDWHILE The PAUSESMT routine is simply :- PAUSE 60 ECHO CHECK >*SMTMESS To shutdown the SMT job you need a little command file that I have called TOSMT This little command file that can interface directly with the SMT job and instruct it to either STOP or CHECK. PARM SMTMODE FILE SMTMESS=SMTMESS,OLD;SHR;GMULTI ECHO !SMTMODE > *SMTMESS With this command file typing TOSMT STOP will shutdown the SMT job almost immediately. Doing a TOSMT CHECK will instruct SMT to do a check of available $STDLISTs now. Some essential lines for the SMTERROR file are: '(FSERR' OR '(CIERR' OR 'LDERR' UPS(R[1:4]) <> ':EOJ' AND RECNUM = VEFINFO(FNUM).EOF - 2 'NOT STORED' OR 'NOT RESTORED' The above is a good example of the power and flexibility of the MPEX/3000 software from VESOFT. A further enhancement to the package which we have developed is to allow viewing and printing of $STDLISTs for all jobs which have run in the last 30 days. This is a function key driven online system, again written entirely in MPEX/3000 If you require any assistance in implementing any of the above at your site, please call me (Adrian Partridge) at VESOFT in the UK, on UK - 121-352-0707.