SGE Array Job tasks mysteriously disappear
- From: Chris Dagdigian <dag (at) sonsorol.org>
- Date: Thu, 2 Mar 2006 19:03:00 -0500
Debugging odd failures on clusters can really be hard.
For SGE clusters the best place of debug/failure info is always going
to be in the STDOUT/STDERR files produced by the jobs themselves.
Nine times out of ten this is where you'll find the most useful info.
Since it seems that you are not getting anything useful from those
files, the next place to look is the sge_execd logs from the machines
where the array tasks ran. The execd spool files will either be local
to the compute node or under your $SGE_ROOT/<cell>/spool/
<machineName>" directory if you are running everything off of a
shared filesystem.
After the execd spool logs, the qmaster and schedd messages files may
also be of use although they rarely give good info on job level issues.
A third place to look is "/tmp" on the compute nodes -- when all else
fails and grid engine is in a panic situation and unable to spool
normally it will log to /tmp/ on the host.
Something you should also try:
- Alter the value for "loglevel" in your grid engine configuration
-- you may want to temporarily set "loglevel=log_info"
This was discussed in a recent SGE users mailing list The thread is
here:
http://gridengine.sunsource.net/servlets/BrowseList?
list=users&by=thread&from=8137
The sge_conf man page has this to say about loglevel:
loglevel
This parameter specifies the level of detail that Grid
Engine compo-
nents such as sge_qmaster(8) or sge_execd(8) use to
produce informa-
tive, warning or error messages which are logged to the
messages files
in the master and execution daemon spool directories (see
the descrip-
tion of the execd_spool_dir parameter above). The
following message
levels are available:
log_err
All error events being recognized are logged.
log_warning
All error events being recognized and all
detected signs of
potentially erroneous behavior are logged.
log_info
All error events being recognized, all detected signs
of poten-
tially erroneous behavior and a variety of
informative messages
are logged.
The final troubleshooting step is to look into the Grid Engine
"KEEP_ACTIVE" execd parameter setting -- this will temporarily
disable deletion of the active_jobs/ directories that Grid Engine
uses to stage info while the job is active. Normally these
directories are deleted when the job drains from the system. Quite a
bit of useful environment, pid, trace and other information can be
found in these directories. This is one you'll have to watch out for
though -- disabling the cleanup function could consume disk space
rapidly.
Regards,
Chris
On Mar 2, 2006, at 5:12 PM, Shane Brubaker wrote:
Hi, Shane from the JGI here.
We are finding some strange behavior in which a few tasks of an
array job never seem to complete.
The tasks do not go into an Error state, and they are listed as
finished with an exit status of 0, and they
have a valid start and end time for the task.
However, in the output log, the output clearly stops in between two
print statements near the top of the script.
Has anyone seen this? Any ideas?
_______________________________________________
Bioclusters maillist - Bioclusters (at) bioinformatics.org
https://bioinformatics.org/mailman/listinfo/bioclusters