- 积分
- 16968
- 贡献
-
- 精华
- 在线时间
- 小时
- 注册时间
- 2012-2-16
- 最后登录
- 1970-1-1
|
楼主 |
发表于 2013-12-21 17:34:18
|
显示全部楼层
本帖最后由 sywyx 于 2013-12-24 14:59 编辑
提交作业卡相关事项:
LoadLeveler -------------------- 作业管理
Introduction to LoadLevelerIntroduction
This tutorial provides an introduction to IBM's batch scheduling system - LoadLeveler. The Institute uses the Maui Scheduler instead of the default LoadLeveler scheduler. The Maui Scheduler is a backfill scheduler, which means it schedules a time for jobs requiring a large number of processors first, and then uses smaller jobs to fill in the gaps created by reserving a time for the large job. Jobs using an accurate wall clock time limit can be scheduled promptly. The Maui Scheduler also uses "fair-share" when scheduling jobs. Groups requesting to use the machine will get a higher priority when jobs are scheduled up to their fair-share and a lower priority when their jobs exceed their fair-share. Fair-share is based on the number of Service Units that each group has been awarded.
I. Overview of LoadLevelerLoadLeveler is a batch scheduling system available from IBM for the SP, Power4 and workstation clusters for submitting serial and parallel jobs. LoadLeveler matches job requirements with the machine resources. These job requirements are specified in a job command file.
The collection of computers available for job scheduling with LoadLeveler is called a pool. This term should not be confused with node pools discussed in the POE tutorial. LoadLeveler pools may consist of all the nodes on a machine or only a subset. On our machine, the LoadLeveler pool consists of all the batch nodes. Every machine in a pool has a LoadLeveler daemon running on it.
II. Queue DefinitionLoadleveler is used on the Institute's IBM SP, and on the Institute's IBM Power4.
II-A. IBM SP Queue Definition Currently, there are two SPs. The SP consists of 74 WinterHawk+ nodes, 11 NightHawk nodes, 4 NightHawk nodes for interactive use and 4 additional Winterhawk+ nodes for file servers. Among the 74 WinterHawk+ nodes, each node has 4 processor (375 MHz, Pwr3). 13 of the nodes have 8 GB memory. 62 of the nodes have 4 GB memory. Among the 15 NightHawk nodes, each node has 4 processors (222 MHz, pwr3) sharing 16 GB memory.
The following table summarizes the queues on the SP.
Queue | Description | q1 | The default queue on the WinterHawk+ nodes. For this queue, you do not have to specify a queue (i.e. #@ class = q1) in your command file. All but 14 nodes have 4 GB, the remaining 14 nodes have 8 GB of memory. The maximum ConsumableMemory is 3.5 GB and 7.5 GB (some of the memory is reserved for the operating system.) Some of the nodes are configured so that the maximum wall clock time is 24 hours, the remaining nodes have a maximum wall clock time of 150 hours. Jobs requiring no more than 24 hours will run on all the nodes. The jobs that require more than 24 hours will run only on nodes set up specifically for longer running jobs. Type llavail to get information on the current availability, including number of nodes set aside for short jobs. To use the larger memory WinterHawk+ nodes request consumable memory greater than 3.5 GB per node. For example, for a single process job #@ resources = ConsumableMemory(4000) will schedule the job on the larger memory WinterHawk+ node. If you ask for more than one task per node and the total memory per node is greater than 3.5 GB per node, the job will go to the larger memory nodes. | nighthawk | For the NightHawk nodes. To use this queue, add #@ class = nighthawk to your command file. Please do not use these nodes unless you have large memory requirements. The maximum ConsumableMemory is 15.5 GB. The maximum wall clock time is 300 hours. However, only 7 nodes are set up to handle 300 hour jobs. The other nodes are reserved for jobs requesting at most 150 hours. Before you submit jobs, load the nhloadl module (module add nhloadl). To then use the other queues, unload this module (module delete nhloadl) |
Note: The wall_clock_limit specified in the script is the maximum wall clock time of each process. For example with 4 processors and a 10 hour limit, a perfectly balanced job that uses 100% of the processors will use a total of 40 CPU-hours.
To submit jobs that use commercial software that uses a license manager (e.g. matlab, hypermesh), You need to specify the Ethernet feature in your command file. The nodes with the Ethernet feature are
· All of the NightHawk nodes.
· All of the interactive nodes.
· Nodes reserved for jobs running 24 hours or less.
There are no Ethernet nodes for jobs requesting more than 24 hours.
LoadLeveler uses a "backfill" scheduler, which means smaller jobs are used to fill in gap between larger jobs. By specifing the accurate wall clock time limits, you help LoadLeveler schedule your jobs promptly.
II-B. IBM Power4 Queue Definition
There is one queue for the IBM Power4. For more information, see the IBM Power4 Quickstart Guide.
Note: The maximum number of jobs that a person can have in the queue on the IBM Power4 is 32.
III. Basic Tasks
The following tasks are discussed:
· Creating a Command File
· Submitting a Job
· Monitoring a Job's Status
· Holding and Releasing a Job
· Displaying a Machine's Status
· Canceling a Job
Creating a Command File
As stated above, all LoadLeveler jobs must be submitted with a command file. This file resembles a shell script and specifies information such as executable name, class, resource requirements, input/output files, number of processors required, and job type. An example is: test1.cmd. The following example uses an input file test1.in. Use an input file if you need to use the command line to enter input to the program. If your program requires no command line input delete the line
#@ input =test1.in
from the below example.
#!/bin/csh
# Note: That the above line is needed if C shell commands follow
# the @queue command, as is the case below.
#
#@ initialdir = /homes/sp1/user_name/Loadleveler_examples
#@ input = test1.in
#@ output = test1.out
#@ error = test1.err
#@ wall_clock_limit = 10:00:00
#@ resources = ConsumableMemory(500)
#@ queue
test1
echo job done
The following rules apply to job command files:
· Statements begin with #@
· Statements such as #@ class = nighthawk or #@ requirements = (Memory > 4096 need to be placed before the #@ queue statement.
· You need to specify the maximum amount of memory per process. In the above examle, 500 MB of memory is requested.
· Comment lines begin with a #
· keywords are case insensitive
· The job command file may be given any name, but it is common to add the suffix ".cmd"
For all the queues, you are required to specify the maximum amount of wall clock time in your command file. Jobs not specifying a wall clock time limit will be rejected.
You need to specify the maximum amount of memory per process that your job requires. To request 500 MB of memory per process, use
#@ resources = ConsumableMemory(500)
to your command file. ConsumableMemory is per process, so if you are running two tasks per node and requesting 500 MB of ConsumableMemory, you are really asking for 1000 MB of memory per node.
Submitting a Job
To submit jobs, simply type: llsubmit 'name of command file'
i.e. llsubmit test1.cmd
Loadleveler will respond with:
llsubmit: The job "sp81css0.msi.umn.edu.672" has been submitted.
The first part of this - sp81css0 (node 81 using css0 i.e. SP switch interface) - is the name of the machine that submitted the job, and the second part - 672 - counts the number of jobs submitted from sp81.
Monitoring a Job's Status
Our prefered command for monitoring jobs is showq ( A Maui Scheduler command). It displays active jobs, idle jobs and non-queued jobs.
runesha@sp65 [~] % showq
ACTIVE JOBS ---------- JOBNAME USERNAME STATE PROC REMAINING STARTTIME
sp81css0.114704.0 umemoto Running 8 1:03:33 Tue Jul 16 12:57:21
sp81css0.114722.0 tyler Running 1 1:57:59 Tue Jul 16 14:51:47
sp81css0.114616.0 duany Running 16 4:24:50 Mon Jul 15 19:18:38
sp81css0.114196.0 mcgrathm Running 4 1:01:35:27 Sun Jul 14 16:29:15
sp83css0.223993.0 khchang Running 1 1:02:11:17 Thu Jul 11 11:05:05
...
100 Active Jobs 293 of 312 Processors Active (93.91%)
74 of 78 Nodes Active (94.87%)
IDLE JOBS ---------- JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME
sp83css0.226796.0 druguet Idle 28 10:00:00 Tue Jul 16 13:58:08
sp81css0.114696.0 fagernes Idle 9 23:00:00 Tue Jul 16 11:54:59
sp83css0.226743.0 seethara Idle 16 12:00:00 Tue Jul 16 00:09:05
sp81css0.114679.0 olaf Idle 1 1:00:00:00 Tue Jul 16 09:48:53
sp83css0.226810.0 tyler Idle 1 1:00:00:00 Tue Jul 16 14:52:56
...
47 Idle Jobs
NON-QUEUED JOBS -------- JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME
sp83css0.207270.0 elorin BatchHold 32 20:00:00 Tue Jun 4 14:55:14
sp83css0.201907.0 pradeep BatchHold 64 21:00:00 Mon May 20 09:58:17
sp81css0.90952.0 pkiprof BatchHold 4 6:06:00:00 Thu May 23 10:42:32
sp81css0.114705.0 jjyi Idle 1 1:12:00:00 Tue Jul 16 13:09:23
...
Total Jobs: 180 Active Jobs: 100 Idle Jobs: 47 Non-Queued Jobs: 33
To view only your own jobs: For example to view jobs owned by runesha:
runesha@sp65 [~] % showq | grep runesha
Holding and Releasing a Job
Jobs may be put on hold using the llhold command by typing llhold job_id (i.e., sp81css0.672). Jobs can be released using the "-r" option: llhold -r job_id. As with the llprio command, llhold applies only to jobs in the queue.
Displaying a Machine's Status
You can display the status of the SP. You can often use this information to customize the size of your job so that it runs immediately. The two commands llavail and showbf are extremely useful. The command llavail summarizes the number of processors available on the nodes. For example,
sp68 [~] % llavail
CPUs
Class #Nodes Avail
q1 36 0
10 1
7 2
9 3
nighthawk 8 0
3 2
1 3
3 4
shows that in the default queue (q1), 10 nodes have 1 processor available, 7 nodes have 2 processors, and 9 nodes have three processors available. No nodes currently have all 4 processors available. On the NightHawk nodes, 2 nodes have 2 processors available, 4 nodes have 3 processors available, and 3 nodes have all 4 processors available.
The showbf command can be used to give some indications as to how long these processors are available. For example,
sp68 [~] % showbf
backfill window (user: 'user01' group: 'user01' partition: ALL) Thu Feb 8 09:50:56
92 procs available for 4:01:01:27
73 procs available with no timelimit
Shows that all the available processors can be used for up to 4 days, one hour, 1 minute and 27 seconds.
Therefore, if you submit a job requesting 10 nodes with one task per node, and specifying a walltime of 4 days, this job will run immediately. This is just one example of a configuration that would run immediately.
For information on the NightHawk nodes, use the showbf -m '>8000' command. This command asks for information on the nodes which have more than 8000 MB of memory available, which are the NightHawk nodes. For example,
sp68 [~] % showbf -m '>8000'
backfill window (user: 'schaudt' group: 'tech' partition: ALL) Thu Feb 8 09:56:00
20 procs available for 4:00:56:23
1 proc available with no timelimit
Used with the llavail command, a job requesting 7 nodes using two tasks per node and less thatn 4 days of walltime, will run immediately.
Canceling a Job
The llcancel command cancels running or queued jobs. Simply type llcancel job_id to cancel a job where job_id is say sp81css0.672. There are options to cancel jobs by user id, host name, or process id. Of course, only the LoadLeveler System Administrator can cancel other people's jobs. Type man llcancel for more information.
IV. Advanced Tasks
The following advanced tasks are discussed:
· Using the Command File like a Unix Shell
· Submitting Parallel MPI Jobs
· Submitting Parallel OpenMP Jobs
· Submitting Multiple Jobs
· Submitting Multiple Jobs with Dependency
· How node_usage Options Affect Charging of CPU Time
· How to Request a Dedicated Node for a Serial Job
Using the Command File like a Unix Shell
Suppose one needed to do some "pre-execution" set up work, for instance, copying input files to to a scratch directory. This can be done with the command file, just as you would with a an ordinary Unix shell file. Consider the following example:
#!/bin/csh
#
#@ initialdir = /homes/sp1/user_name/Loadleveler_examples
#@ input = test2.in
#@ output = test2.out
#@ error = test2.err
#@ wall_clock_limit = 10:00:00
#@ resources = ConsumableMemory(500)
#@ queue
# Beginning of script commands
cp /homes/sp1/user_name/Loadleveler_examples/test2.in /scratch/user_name/test2.in
cp /homes/sp1/user_name/Loadleveler_examples/test2 /scratch/user_name/test2
/scratch/user_name/test2
cp /scratch/user_name/test2.out /homes/sp1/user_name/Loadleveler_examples/test2.out
cp /scratch/user_name/test2.err /homes/sp1/user_name/Loadleveler_examples/test2.err
rm /scratch/user_name/test2
rm /scratch/user_name/test2.in
rm /scratch/user_name/test2.out
rm /scratch/user_name/test2.err
echo job done
In this example, the user's input file is copied to scratch space on the executing host before execution. This saves the performance cost of accessing the input file over the network (local disk is more efficient). The job is then run, output copied back and finally, cleanup of the scratchspace is performed. This command file copies the input file to a scratch directory on the executing host, executes the job, copies the output files back, and then removes all the files on scratch. By doing this pre-execution work, the performance cost of accessing the input file over the network is saved. Using the local disk space is more effiecient.
Submitting Parallel (OpenMP) Jobs
Here is an example of an OpenMP LoadLeveler submission script:
#!/bin/csh
#@ initialdir = /homes/sp1/user_name/Loadleveler_examples
#@ job_type = parallel
#@ output = test.out
#@ error = test.err
#@ wall_clock_limit = 10:00:00
#@ resources = ConsumableMemory(500)
#@ node = 1
#@ tasks_per_node = 2
#@ node_usage = shared
#@ queue
setenv OMP_NUM_THREADS 2
a.out < input
This script will alllow you to run the job using 2 CPUs on a single node.
The essential commands for OpenMP jobs are job_type and tasks_per_node . Since OpenMP is a shared memory parallel paradigm, the job can only be run on a single node, which requires node = 1.
Submitting Parallel (MPI) Jobs
Here is an example of an MPI Loadleveler submission script:
#!/bin/csh
#@ output = test.out
#@ error = test.err
#@ job_type = parallel
#@ wall_clock_limit = 10:00:00
#@ resources = ConsumableMemory(500)
#@ network.MPI = css0,shared,US
#@ blocking = unlimited
#@ total_tasks = 4
#@ node_usage = shared
#@ queue
/home/sp1/user_name/a.out
The above commend file requests a total of 2000MB of memory (500 MB times 4 tasks ).
The following Loadleveler kewords are used for MPI LoadLeveler submission scripts:
· network - specifies communication protocols, adapters, and their characteristics. The syntax is:
network.protocol = network_type[,usage][,mode]
For an MPI job with communication across the SP switch where processes can share the switch in User Space mode, network will be specified in the job command file as
network.MPI = css0,shared,US
Note that in the previous versions of LoadLeveler, it was not possible to use the switch in User Space mode with multiple processes from the same job step or different job steps communicating across the switch at the same time. The network specification example given above allows tasks from different job steps to share the switch. The shared option is the default. Thus
network.MPI = css0,shared,US
is equivalent to
network.MPI = css0,,US
The second option for usage is not_shared. However, if a user specifies the not-shared option, then only tasks from the same job step can communicate across the switch. In effect, a user using the switch not-shared option in User-Space mode prevents other communicating parallel jobs from being scheduled to the same nodes although serial jobs may still be scheduled.
· node - specifies the minimum and maximum number of nodes requested by a job step. The syntax is
node = [min][,max]
where min is the minimum and max is the maximum. The default value for min is 1. The default value for max is the min value. You may specify just one of min and max. For example
node =6,12
for a range of 6 to 12 nodes
and
node = ,14
for a maximum of 14 nodes.
Both min and max may have the same value, for example
node = 6,6
means the job will run on exactly 6 nodes. In this case, you can also just specify a single value, that is
node = 6
· node_usage - specifies whether this job step shares nodes with other job steps. The syntax is
node_usage = shared | not_shared
If set to shared (which is the default), the nodes can be shared with other tasks from other job steps. If set to not_shared, no other job steps are scheduled on this node. It is IMPORTANT to note that in not_shared mode, you ware CHARGED 4 times the wall_clock, as explained below under How node_usage Options Affect Charging of CPU Time . In shared mode, you are charged CPU time accumulated over the number of CPUs used.
· total_tasks - specifies the total number of tasks of a parallel job you want to run on all available nodes. If possible, use tasks_per_node instead. There can be problems scheduling jobs when total_tasks is used This keyword is used together with the node keyword. A single value must be specified for the node keyword. This keyword can not be used with the tasks_per_node keyword in the same job step.
Note that this keyword applies to the job step in which it is specified. If you are using the same command file for multiple job steps, specify this keyword (or the tasks_per_node keyword which is described above) for each job step. The syntax is
total_tasks = number
where number is the total number of tasks you want to run. For example, to run 3 tasks on each of 7 available nodes for a total of 21 tasks, include the following in your command file:
node = 7
total_tasks = 21
For uneven distribution of tasks per node, LoadLeveler allocates tasks on the nodes in a round-robin fashion. For example, for 5 tasks on 3 nodes, two tasks will run on each of the first two nodes and 1 task will run on the third node.
Submitting Multiple Jobs
A single command job file can be used to submit multiple jobs by using multiple #@queue statements. This is particularly useful if the same is used with multiple jobs that have different input and output files.
An example command file is:
#!/bin/csh
#@ initialdir = /homes/sp1/user_name/Loadleveler_examples
#@ executable = test
#@ input = test1.in
#@ output = test1.out
#@ error = test1.err
#@ wall_clock_limit = 10:00:00
#@ resources = ConsumableMemory(500)
#@ queue
#@ executable = test
#@ input = test2.in
#@ output = test2.out
#@ error = test2.err
#@ wall_clock_limit = 15:00:00
#@ resources = ConsumableMemory(500)
#@ queue
#@ executable = test
#@ input = test3.in
#@ output = test3.out
#@ error = test3.err
#@ wall_clock_limit = 10:00:00
#@ resources = ConsumableMemory(500)
#@ queue
Submitting Multiple Jobs with Dependency
A useful extension to the multiple job script described above is when the some of the jobs depend on the outcome of other jobs. For example, assuming in our 3 job step example above, we want (i) job step 2 to run after job step 1 and (ii) job step 3 to run after job step 2, the command file can be modified as follows:
An example command file is:
#!/bin/csh
#@ step_name = step1
#@ initialdir = /homes/sp1/user_name/Loadleveler_examples
#@ executable = test
#@ input = test1.in
#@ output = test1.out
#@ error = test1.err
#@ wall_clock_limit = 10:00:00
#@ resources = ConsumableMemory(500)
#@ queue
#@ step_name = step2
#@ dependency = step1 == 0
#@ executable = test
#@ input = test2.in
#@ output = test2.out
#@ error = test2.err
#@ wall_clock_limit = 20:00:00
#@ resources = ConsumableMemory(500)
#@ queue
#@ step_name = step3
#@ dependency = step2 == 0
#@ executable = test
#@ input = test3.in
#@ output = test3.out
#@ error = test3.err
#@ wall_clock_limit = 10:00:00
#@ resources = ConsumableMemory(500)
#@ queue
Please note the different input file names in the dependency example above. One can also use one input file for a multi-step job, however, the single input file must contain the inputs for all job steps (duplicate data if all job steps use the same data).
How node_usage Options Affect Charging of CPU Time
There are 4 CPUs per node of the SP. For effective usage of the machine, it is desirable to have up to four tasks per node. Thus the default setting for node_usage is shared. It is however recognised that for some applications, for example large-memory serial jobs and shared-memory multi-threaded (via directives or pthreads) jobs or even MPI jobs, a user may want to have a node exclusively to himself. This is where the node_usage option not_shared becomes useful. However, using this option is not without a price. To avoid system abuse and to ensure that the option is used only when necessary, a user that chooses to use the option will be CHARGED 4 times the wall-clock usage.
How to Request a Dedicated Node for a non-MPI (e.g. Serial or Shared-Memory Multi-threaded) Job
The LoadLeveler keyword node_usage can be used by a user to ensure that a node is not shared with another user by setting node_usage to not_shared. However, since the node_usage keyword is reserved only for job type parallel, that is with the keyword job_type set to parallel (parallel in this sense refers to MPI jobs), running a serial or shared-memory multi-threaded job with node_usage set to not_shared also demands setting job_type to parallel even though the job is not MPI parallel. The example script below shows how to run a serial job called serial_exec where the node assigned by LoadLeveler will be used in dedicated mode:
#!/bin/csh
#@ output = serial_out.$(jobid)
#@ error = serial_err.$(jobid)
#@ job_type = parallel
#@ wall_clock_limit = 10:00:00
#@ resources = ConsumableMemory(500)
#@ node = 1
#@ tasks_per_node = 1
#@ node_usage = not_shared
#@ queue
/home/sp1/user_name/serial_exec
VERY IMPORTANT: Use a node in dedicated mode only when you need it because you will be charged 4 times the wall-clock, as explained under How node_usage Options Affect Charging of CPU Time above.
|
|