By Date: <-- -->
By Thread: <-- -->

Bioclusters Digest, Vol 17, Issue 4



Hi, Shane here from the JGI, I wanted to post back and attempt to answer some of these questions about our "disappearing" array job tasks.
I don't know the answer to all these, but the question about NIS errors pops out. We have been having NIS and NFS problems quite a bit,
so I suspect that could be why.


Soon we will be moving our cluster onto a better network switch, and also have increased a cache size on our LDAP server. We've been
working to improve our NFS problems too. It seems like that may help - lately the problems seem to have gone away. I've also implemented
a "cleanup" step in our workflow system which re-submits missing tasks one at a time just in case.



Did I isolate it down to just an issue with the job array? Yes

Does this only happen with this program or all programs I execute?
        Various programs

What is the code doing?
Usually trying to run some perl code - for instance, in one case it was a perl program
which logs something to a database - but in between two print statements at the top it failed
before it could really get very far


Are there "core" files in my output directory?
        No

Are the binaries on an NFS server? If so is it having issues? Check the logs for NFS timeouts.
Yes, Yes, and Yes


Is a directory filling up /tmp /scratch what ever?
        I don't think we are using /tmp/scratch

What do the syslogs on the remote machine say?
        Did not see any unusual messages

Is there a network issue that I've caused by running too much stuff at the
same time, broken NIS/NFS?
        Yes

Is the OOM killer running on the remote node, have I filled up all the
memory?
        I think so, quite possibly.

Is it only happening on one node, some nodes or a subset?
        Can happen on various nodes.

Am I writing to a database and not catching an error?
        It was not getting to this stage.

Does it happen with a really simple example?
        Yes

Does it only happen on a Tuesday evening (system maint for example)
        I think so, it seems to happen sporadically in spikes.


Thanks again for your help!


At 04:04 PM 3/2/2006, you wrote:
Send Bioclusters mailing list submissions to
        bioclusters (at) bioinformatics.org

To subscribe or unsubscribe via the World Wide Web, visit
        https://bioinformatics.org/mailman/listinfo/bioclusters
or, via email, send a message with subject or body 'help' to
        bioclusters-request (at) bioinformatics.org

You can reach the person managing the list at
        bioclusters-owner (at) bioinformatics.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Bioclusters digest..."


Today's Topics:

   1. Announcement: Sun Discovery Cluster for the Life  Sciences
      (Stefan Unger)
   2. RE: Announcement: Sun Discovery Cluster for the   LifeSciences
      (Kathleen)
   3. SGE Array Job tasks mysteriously disappear (Shane Brubaker)
   4. RE: quick look see at fractal computing. (James Cuff)
   5. Re: SGE Array Job tasks mysteriously disappear (James Cuff)
   6. Re: SGE Array Job tasks mysteriously disappear (Chris Dagdigian)


----------------------------------------------------------------------

Message: 1
Date: Thu, 02 Mar 2006 13:59:59 -0800
From: Stefan Unger <Stefan.Unger (at) Sun.COM>
Subject: [Bioclusters] Announcement: Sun Discovery Cluster for the
        Life    Sciences
To: bioclusters (at) bioinformatics.org
Message-ID: <44076ADF.3070102 (at) sun.com>
Content-Type: text/plain; charset=windows-1252; format=flowed

I'm not sure if this is ok, or not. Please let me know:

************

Sun Microsystems^TM Announces the Discovery Cluster for the Life Sciences

Exceptional Price/Performance in a Pre-Assembled Rack




Sun Microsystems announces the "Discovery Cluster for the Life Sciences". The Discovery Cluster is a pre-assembled, base-level configuration of a Sun Grid Rack System (SGRS) with components selected especially for the Life Science HPC market.


The Discovery Cluster is Sun's solution approach to the compute needs for the drug discovery process. It is based on the Sun Fire^TM X2100 64-bit x64 server, powered by the AMD Opteron^TM dual core processor. The X2100 delivers up to one-and-a-half times the performance, and uses about one-third of the power of competing systems, yet costs a fraction of their price. Bioinformatics and molecular modeling benchmarks confirm the exceptional price/performance advantages of the Sun Fire X2100 over Intel Xeon based clusters. These highly reliable and energy efficient X2100 servers are also the fastest enterprise x64 servers in their class.


At under $94,000 (US list price) per fully populated, pre-assembled rack, the Discovery Cluster provides 1 TeraFlop of theoretical peak performances in three racks for under $282,000. In addition, the power, cooling and management requirements are substantially less than Intel Xeon based clusters.


The Discovery Cluster comes pre-assembled, with hardware, cabling, Solaris^TM 10 and Sun Grid Engine. Multiple operating systems (Solaris 10 x64, Linux (Red Hat, Suse), and Windows) are supported. Many alternative configurations are available, and Sun's solution partners provide a range of software options.


For more information, listen to a NetTalk webinar on the Sun Discovery Cluster for Life Sciences, featuring the designer of the Sun Fire "Galaxy" series servers, Andy Bechtolsheim, Sun Chief Architect and Senior Vice President, Network Systems. For more information visit www.sun.com/nettalk, <http://www.sun.com/nettalk>www.sun.com/discoverycluster <http://www.sun.com/discoverycluster>, or email discoverycluster (at) sun.com. <http://www.sun.com/nettalk>


Media contacts:


Stefan Unger, PhD

stefan.unger (at) sun.com <mailto:stefan.unger (at) sun.com>

Business Development Manager

Life Sciences



Ulrich Meier, PhD

ulrich.meier (at) sun.com <mailto:ulrich.meier (at) sun.com>

Industry Marketing Manager

Life Sciences


Sun, Sun Microsystems, the Sun logo, Sun Fire, Solaris are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries. AMD and Opteron are trademarks or registered trademarks of Advanced Micro Devices.


-- *!* Stefan Unger, PhD Business Development Manager Life Sciences 949-682-4388 (x41821) AccessLine http://www.sun.com/edu/commofinterest/compbio http://www.sun.com/lifesciences http://www.sun.com/discoverycluster CB-SIG: to JOIN/DROP/POST email compbio-sig-info (at) sun.com

* BioIT World, Boston, April 3-5, 2006
* CB-SIG and HPC Consortium, GridAsia, May 14-15, 2006

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NOTICE:  This email message is for the sole use of the intended
recipient(s) and may contain confidential and privileged
information.  Any unauthorized review, use, disclosure or
distribution is prohibited.  If you are not the intended
recipient, please contact the sender by reply email and destroy
all copies of the original message.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*!*




------------------------------

Message: 2
Date: Thu, 2 Mar 2006 15:08:54 -0700
From: "Kathleen" <kathleen (at) massivelyparallel.com>
Subject: RE: [Bioclusters] Announcement: Sun Discovery Cluster for the
        LifeSciences
To: "'Clustering,       compute farming & distributed computing in life
        science informatics'"   <bioclusters (at) bioinformatics.org>
Message-ID: <005c01c63e45$e682b8b0$0300a8c0 (at) KMElaptop>
Content-Type: text/plain;       charset="us-ascii"

Does it come pre-loaded with applications?  If so, which ones? -K



From: Stefan Unger [mailto:Stefan.Unger (at) Sun.COM]
Sent: Thursday, March 02, 2006 3:00 PM
To: bioclusters (at) bioinformatics.org
Subject: [Bioclusters] Announcement: Sun Discovery Cluster for the
LifeSciences

I'm not sure if this is ok, or not. Please let me know:

************

Sun Microsystems^TM Announces the Discovery Cluster for the Life Sciences

Exceptional Price/Performance in a Pre-Assembled Rack




Sun Microsystems announces the "Discovery Cluster for the Life Sciences". The Discovery Cluster is a pre-assembled, base-level configuration of a Sun Grid Rack System (SGRS) with components selected especially for the Life Science HPC market.


The Discovery Cluster is Sun's solution approach to the compute needs for the drug discovery process. It is based on the Sun Fire^TM X2100 64-bit x64 server, powered by the AMD Opteron^TM dual core processor. The X2100 delivers up to one-and-a-half times the performance, and uses about one-third of the power of competing systems, yet costs a fraction of their price. Bioinformatics and molecular modeling benchmarks confirm the exceptional price/performance advantages of the Sun Fire X2100 over Intel Xeon based clusters. These highly reliable and energy efficient X2100 servers are also the fastest enterprise x64 servers in their class.


At under $94,000 (US list price) per fully populated, pre-assembled rack, the Discovery Cluster provides 1 TeraFlop of theoretical peak performances in three racks for under $282,000. In addition, the power, cooling and management requirements are substantially less than Intel Xeon based clusters.


The Discovery Cluster comes pre-assembled, with hardware, cabling, Solaris^TM 10 and Sun Grid Engine. Multiple operating systems (Solaris 10 x64, Linux (Red Hat, Suse), and Windows) are supported. Many alternative configurations are available, and Sun's solution partners provide a range of software options.


For more information, listen to a NetTalk webinar on the Sun Discovery Cluster for Life Sciences, featuring the designer of the Sun Fire "Galaxy" series servers, Andy Bechtolsheim, Sun Chief Architect and Senior Vice President, Network Systems. For more information visit www.sun.com/nettalk, <http://www.sun.com/nettalk>www.sun.com/discoverycluster <http://www.sun.com/discoverycluster>, or email discoverycluster (at) sun.com. <http://www.sun.com/nettalk>


Media contacts:


Stefan Unger, PhD

stefan.unger (at) sun.com <mailto:stefan.unger (at) sun.com>

Business Development Manager

Life Sciences



Ulrich Meier, PhD

ulrich.meier (at) sun.com <mailto:ulrich.meier (at) sun.com>

Industry Marketing Manager

Life Sciences


Sun, Sun Microsystems, the Sun logo, Sun Fire, Solaris are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries. AMD and Opteron are trademarks or registered trademarks of Advanced Micro Devices.


-- *!* Stefan Unger, PhD Business Development Manager Life Sciences 949-682-4388 (x41821) AccessLine http://www.sun.com/edu/commofinterest/compbio http://www.sun.com/lifesciences http://www.sun.com/discoverycluster CB-SIG: to JOIN/DROP/POST email compbio-sig-info (at) sun.com

* BioIT World, Boston, April 3-5, 2006
* CB-SIG and HPC Consortium, GridAsia, May 14-15, 2006

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NOTICE:  This email message is for the sole use of the intended
recipient(s) and may contain confidential and privileged information.  Any
unauthorized review, use, disclosure or distribution is prohibited.  If you
are not the intended recipient, please contact the sender by reply email and
destroy all copies of the original message.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*!*


_______________________________________________ Bioclusters maillist - Bioclusters (at) bioinformatics.org https://bioinformatics.org/mailman/listinfo/bioclusters





------------------------------

Message: 3
Date: Thu, 02 Mar 2006 14:12:49 -0800
From: Shane Brubaker <brubaker2 (at) llnl.gov>
Subject: [Bioclusters] SGE Array Job tasks mysteriously disappear
To: bioclusters (at) bioinformatics.org
Message-ID: <6.0.0.22.2.20060302141100.037184e0 (at) mail.llnl.gov>
Content-Type: text/plain; charset="us-ascii"; format=flowed

Hi, Shane from the JGI here.

We are finding some strange behavior in which a few tasks of an array job
never seem to complete.

The tasks do not go into an Error state, and they are listed as finished
with an exit status of 0, and they
have a valid start and end time for the task.

However, in the output log, the output clearly stops in between two print
statements near the top of the script.


Has anyone seen this? Any ideas?


Thanks, Shane



------------------------------

Message: 4
Date: Thu, 2 Mar 2006 18:17:50 -0500 (EST)
From: James Cuff <jcuff (at) broad.mit.edu>
Subject: RE: [Bioclusters] quick look see at fractal computing.
To: Nick Robertson <nick (at) massivelyparallel.com>
Cc: "'Clustering, compute farming & distributed computing in life
science informatics'" <bioclusters (at) bioinformatics.org>, 'Kevin Howard'
<kevin (at) massivelyparallel.com>
Message-ID:
<Pine.OSF.4.64.0603021718060.91263 (at) phosphorus.broad.mit.edu>
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed


On Thu, 2 Mar 2006, Nick Robertson wrote:

> It is still unclear to me why your results are markedly different
> from NCBI and MPT, but it's probably related to search parameters or some
> other difference.

Ahem, that could be my bad, I guess I should have explained, I thought it
was clear from the example command line I supplied.

-nT is the answer you are looking for here.

I used it quickly here to show the missing sub optimals.  My reasoning
being that if MegaBlast with its large word size and greedy algorithm
approach could find the suboptimals, the standard version ought to nail
it.

I tend to use it automatically for near exact DNA/DNA searching, which is
what this example test was set to do.  So that clears up changes in the
ordering.

However, you are _still_ not reporting the sub optimal alignments in your
report.

This is clear alone from just the sizes of the two files you provided me
with via your website.  I guess it's just a printing error, you must be
calculating them.  Probably a simple tweak for you to fix.

node221 /2ndrun/ du -sh ncbi_results.txt
3.4M    ncbi_results.txt

node221 /2ndrun/ du -sh qid1597_results_1.txt
516K    qid1597_results_1.txt


The example gi|27657458|emb|AL844150.6| on that web link I sent before shows this.

MegaBlast (jcuff_results_1.txt) finds two such sub alignments, and regular
blast (jcuff2.blastn,ncbi_results.txt ) finds a whopping 16.

However qid1597_results_1.txt only shows the first alignment from bases
682 to 1330, with _no_ sub optimals being reported.

Thanks for the update.  We probably ought to kill this thread and take it
off line if you want to discuss it further.  I doubt it is very
interesting for folk.

Best,

J.


------------------------------

Message: 5
Date: Thu, 2 Mar 2006 18:34:40 -0500 (EST)
From: James Cuff <jcuff (at) broad.mit.edu>
Subject: Re: [Bioclusters] SGE Array Job tasks mysteriously disappear
To: "Clustering,        compute farming & distributed computing in life
        science informatics"    <bioclusters (at) bioinformatics.org>
Message-ID:
        <Pine.OSF.4.64.0603021824410.91263 (at) phosphorus.broad.mit.edu>
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed


Hi Shane,

So you might want to give us a bit more information.

As to seeing weird stuff on clusters, yeah we see a lot of it, *way* too
much of it sometimes :)

Here come a bunch of questions I would ask myself if it happened to me:

Did I isolate it down to just an issue with the job array?
Does this only happen with this program or all programs I execute?
What is the code doing?
Are there "core" files in my output directory?
Are the binaries on an NFS server?  If so is it having issues?  Check the
logs for NFS timeouts.
Is a directory filling up /tmp /scratch what ever?
What do the syslogs on the remote machine say?
Is there a network issue that I've caused by running too much stuff at the
same time, broken NIS/NFS?
Is the OOM killer running on the remote node, have I filled up all the
memory?
Is it only happening on one node, some nodes or a subset?
Am I writing to a database and not catching an error?
Does it happen with a really simple example?
Does it only happen on a Tuesday evening (system maint for example)

etc. etc.  It is a pain to debug things like this on a cluster, I feel
your pain.

Maybe have another look at what is going wrong and post back with some
more information.  There are lots of people who can probably help, at the
moment there is not really enough for us to go on, as you see it could be
lots of things.

Best,

J.

On Thu, 2 Mar 2006, Shane Brubaker wrote:

> Hi, Shane from the JGI here.
>
> We are finding some strange behavior in which a few tasks of an array job
> never seem to complete.
>
> The tasks do not go into an Error state, and they are listed as finished
> with an exit status of 0, and they have a valid start and end time for
> the task.
>
> However, in the output log, the output clearly stops in between two print
> statements near the top of the script.
>
>
> Has anyone seen this?  Any ideas?
>
>
> Thanks,
> Shane
>
> _______________________________________________
> Bioclusters maillist  -  Bioclusters (at) bioinformatics.org
> https://bioinformatics.org/mailman/listinfo/bioclusters
>



------------------------------

Message: 6
Date: Thu, 2 Mar 2006 19:03:00 -0500
From: Chris Dagdigian <dag (at) sonsorol.org>
Subject: Re: [Bioclusters] SGE Array Job tasks mysteriously disappear
To: "Clustering,        compute farming & distributed computing in life
        science informatics"    <bioclusters (at) bioinformatics.org>
Message-ID: <15B721DA-E4FA-44C7-BEB1-F99919DD39A1 (at) sonsorol.org>
Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed


Debugging odd failures on clusters can really be hard.

For SGE clusters the best place of debug/failure info is always going
to be in the STDOUT/STDERR files produced by the jobs themselves.

Nine times out of ten this is where you'll find the most useful info.

Since it seems that you are not getting anything useful from those
files, the next place to look is the sge_execd logs from the machines
where the array tasks ran. The execd spool files will either be local
to the compute node or under your $SGE_ROOT/<cell>/spool/
<machineName>" directory if you are running everything off of a
shared filesystem.

After the execd spool logs, the qmaster and schedd messages files may
also be of use although they rarely give good info on job level issues.

A third place to look is "/tmp" on the compute nodes -- when all else
fails and grid engine is in a panic situation and unable to spool
normally it will log to /tmp/ on the host.

Something you should also try:

  - Alter the value for "loglevel" in your grid engine configuration
-- you may want to temporarily set "loglevel=log_info"

This was discussed in a recent SGE users mailing list The thread is
here:
http://gridengine.sunsource.net/servlets/BrowseList?
list=users&by=thread&from=8137

The sge_conf man page has this to say about loglevel:

> loglevel
>        This parameter specifies the level of detail that  Grid
> Engine  compo-
>        nents  such  as  sge_qmaster(8) or sge_execd(8) use to
> produce informa-
>        tive, warning or error messages which are logged to the
> messages  files
>        in  the master and execution daemon spool directories (see
> the descrip-
>        tion of the execd_spool_dir parameter  above).  The
> following  message
>        levels are available:
>
>        log_err
>               All error events being recognized are logged.
>
>        log_warning
>               All  error  events  being  recognized  and all
> detected signs of
>               potentially erroneous behavior are logged.
>
>        log_info
>               All error events being recognized, all detected signs
> of  poten-
>               tially  erroneous behavior and a variety of
> informative messages
>               are logged.


The final troubleshooting step is to look into the Grid Engine "KEEP_ACTIVE" execd parameter setting -- this will temporarily disable deletion of the active_jobs/ directories that Grid Engine uses to stage info while the job is active. Normally these directories are deleted when the job drains from the system. Quite a bit of useful environment, pid, trace and other information can be found in these directories. This is one you'll have to watch out for though -- disabling the cleanup function could consume disk space rapidly.

Regards,
Chris




On Mar 2, 2006, at 5:12 PM, Shane Brubaker wrote:

> Hi, Shane from the JGI here.
>
> We are finding some strange behavior in which a few tasks of an
> array job never seem to complete.
>
> The tasks do not go into an Error state, and they are listed as
> finished with an exit status of 0, and they
> have a valid start and end time for the task.
>
> However, in the output log, the output clearly stops in between two
> print statements near the top of the script.
>
>
> Has anyone seen this?  Any ideas?


------------------------------

_______________________________________________
Bioclusters maillist  -  Bioclusters (at) bioinformatics.org
https://bioinformatics.org/mailman/listinfo/bioclusters


End of Bioclusters Digest, Vol 17, Issue 4 ******************************************

_______________________________________________ Bioclusters maillist - Bioclusters (at) bioinformatics.org https://bioinformatics.org/mailman/listinfo/bioclusters