[VoxBo] voxbo scheduler error : *** glibc detected *** double free or corruption (out): 0x0855a200 ***

Joonkoo Park joonkoo at umich.edu
Tue Feb 13 15:33:08 EST 2007


Hi Dan,

This is a new install.
I just noticed that voxbo works fine (at least worked fine for about 24
hours so far)
if I just run it on 1 server (which at the same time is the central file
server).
However, as soon as I add another machine (any of the 3 extra machines we
newly built),
the scheduler goes down with the *** glibc detected *** error message.

Also, telnetting to port 6004 works normal on the server I try to add.

This is when I try to run voxbo again (without -d) after the scheduler dies.

[voxbo at linux-tpolk03 root]$ voxbo
[I] 2007_02_13_15:21:29 added host linux-tpolk02 (
linux-tpolk02.psych.lsa.umich.edu)
[I] 2007_02_13_15:21:29 added host linux-tpolk03 (
linux-tpolk03.psych.lsa.umich.edu)
==============================================
[I] 2007_02_13_15:21:29 VoxBo scheduler (1.8.4Oct 26 2006) started on host
linux-tpolk03
[I] 2007_02_13_15:21:29 Queue located at /usr/local/VoxBo/queue
Now listening to port 6005
[I] 2007_02_13_15:21:29 emailing joonkoo at umich.edu
[I] 2007_02_13_15:21:29 sending job 00451-00005 to host linux-tpolk02
*** glibc detected *** double free or corruption (out): 0x093d9c00 ***
Aborted

Now, after I eliminate the added server (linux-tpolk02) and just run it on
linux-tpolk03,
this is the message i get:

[voxbo at linux-tpolk03 etc]$ voxbo
[I] 2007_02_13_15:28:16 added host linux-tpolk03 (
linux-tpolk03.psych.lsa.umich.edu)
==============================================
[I] 2007_02_13_15:28:16 VoxBo scheduler (1.8.4Oct 26 2006) started on host
linux-tpolk03
[I] 2007_02_13_15:28:16 Queue located at /usr/local/VoxBo/queue
Now listening to port 6005
[E] 2007_02_13_15:28:16 couldn't find host for running job 00000451-00005
[E] 2007_02_13_15:28:16 invalid host update from
linux-tpolk02.psych.lsa.umich.edu
[E] 2007_02_13_15:28:16 couldn't update 00000451-00005 on any host, even
linux-tpolk02
[E] 2007_02_13_15:28:16 couldn't update 00000451-00005 on any host, even
linux-tpolk02
[E] 2007_02_13_15:28:16 bad jobrunning message: 00000451-00000005, 20395
20396
[E] 2007_02_13_15:28:16 couldn't update 00000451-00005 on any host, even
linux-tpolk02
[E] 2007_02_13_15:28:16 couldn't update 00000451-00005 on any host, even
linux-tpolk02
[E] 2007_02_13_15:28:16 bad jobrunning message: 00000451-00000005, 20395
20400
[E] 2007_02_13_15:28:16 couldn't update 00000451-00005 on any host, even
linux-tpolk02
[E] 2007_02_13_15:28:16 couldn't update 00000451-00005 on any host, even
linux-tpolk02
[E] 2007_02_13_15:28:16 bad jobrunning message: 00000451-00000005, 20395
20406
[E] 2007_02_13_15:28:16 couldn't update 00000451-00005 on any host, even
linux-tpolk02
[E] 2007_02_13_15:28:16 couldn't update 00000451-00005 on any host, even
linux-tpolk02
[E] 2007_02_13_15:28:16 bad jobrunning message: 00000451-00000005, 20395
20465
[E] 2007_02_13_15:28:16 couldn't update 00000451-00005 on any host, even
linux-tpolk02
[E] 2007_02_13_15:28:16 couldn't update 00000451-00005 on any host, even
linux-tpolk02
[E] 2007_02_13_15:28:16 bad jobrunning message: 00000451-00000005, 20395
20467
[E] 2007_02_13_15:28:16 couldn't update 00000451-00005 on any host, even
linux-tpolk02
[E] 2007_02_13_15:28:16 couldn't update 00000451-00005 on any host, even
linux-tpolk02
[E] 2007_02_13_15:28:16 bad jobrunning message: 00000451-00000005, 20395
20469
[E] 2007_02_13_15:28:16 couldn't update 00000451-00005 on any host, even
linux-tpolk02
[E] 2007_02_13_15:28:16 couldn't update 00000451-00005 on any host, even
linux-tpolk02
[E] 2007_02_13_15:28:16 bad jobrunning message: 00000451-00000005, 20395
20470
[E] 2007_02_13_15:28:16 couldn't update 00000451-00005 on any host, even
linux-tpolk02
[E] 2007_02_13_15:28:16 couldn't update 00000451-00005 on any host, even
linux-tpolk02
[E] 2007_02_13_15:28:16 bad jobrunning message: 00000451-00000005, 20395
20471
[E] 2007_02_13_15:28:16 couldn't update 00000451-00005 on any host, even
linux-tpolk02
[E] 2007_02_13_15:28:16 couldn't update 00000451-00005 on any host, even
linux-tpolk02
[I] 2007_02_13_15:28:16 emailing joonkoo at umich.edu
[E] 2007_02_13_15:28:16 invalid host update from
linux-tpolk02.psych.lsa.umich.edu
[I] 2007_02_13_15:28:16 sending job 00452-00000 to host linux-tpolk03
[I] 2007_02_13_15:28:22 sending job 00452-00001 to host linux-tpolk03

In this case, I do not get the response from the scheduler by typing vq -l
or vq -c.

[voxbo at linux-tpolk03 etc]$ vq -l
[voxbo at linux-tpolk03 etc]$ vq -c
[E] voxq: couldn't retrieve list of servers

I now erase all the files in queue except vb.num, then run voxbo -d (only on
linux-tpolk03)
then it is able to work again.

It's frustrating...
and all the servers run on same version of linux redhat.

Thanks

Joon



On 2/13/07, Daniel Y Kimberg <kimberg at mail.med.upenn.edu> wrote:
>
> Joonkoo Park wrote:
>
> > We are newly setting up the voxbo environment with new linux systems.
> > Voxbo scheduler worked fine until it returned this error.
> >
> > more [voxdir]/etc/logs/voxbo.logs show this at the very end:
> > *** glibc detected *** double free or corruption (out): 0x0855a200 ***
>
> Did you do a completely new VoxBo install, or did you just add a few
> new machines to the cluster and then restart the scheduler?  Can you
> double-check that the newly-added machines are responding, by
> telnetting to port 6004 and typing TEST followed by a carriage return.
> You should get an ACK in return.  If not, something is not working
> network-wise.
>
> > After this point, voxbo scheduler will shut down automatically, and
> won't
> > start up with command "voxbo -d".
>
> It would help if you could run the scheduler without the -d flag and
> let me know if it produces any output before crashing.
>
> > Before this error and the shut-down,
> > I tried to modify [voxdir]/etc/defaults file
> > to include the path to idl:
> >
> > I modified
> > setenv PATH=/bin:/usr/bin:/usr/local/bin:/net/VoxBo/bin
> > to
> > setenv
> >
> PATH=/bin:/usr/bin:/usr/local/bin:/net/VoxBo/bin:/usr/local/rsi/idl_6.3/idl/bin
> >
> > But I doubt that this modification has caused the *** glibc detected ***
> > problem.
>
> I doubt it too.  I wonder if there's some issue with the host
> configuration with the new machines.  Can you double-check the server
> files in VoxBo/etc/servers to make sure they don't have conflicting
> names?  There's both a "name" field and a "hostname" field that need
> to be unique (I'm going to try to make this more intuitive in an
> upcoming release).  You might also want to try clearing out the queue
> directory entirely, except for the vb.num file, which is just a text
> file with the number of the last sequence run.
>
> dan
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.voxbo.org/pipermail/voxbo-general/attachments/20070213/b9ab6d2f/attachment.html 


More information about the voxbo-general mailing list