The story began with the discovery that the registrar (on the
registry_ha branch) hangs at startup. Here is the backtrace:
(gdb) bt
#0 0x00e85402 in ?? ()
#1 0x00ce5449 in semop () from /lib/libc.so.6
#2 0x00add601 in wait_semaphore (sem= (at) 0x9824764,
msec=4294967295, sops=0xb09140, n_sops=1)
at ../../src/fastdb/sync.cpp:190
#3 0x00add645 in dbSemaphore::wait (this=0x9824764,
msec=4294967295)
at ../../src/fastdb/sync.cpp:203
#4 0x00ac17b1 in dbDatabase::beginTransaction
(this=0x98246a8, lockType=dbExclusiveLock)
at ../../src/fastdb/database.cpp:4370
#5 0x00ac7f7d in dbDatabase::select (this=0x98246a8,
cursor=0xb762d728, query= (at) 0xb762d78c)
at ../../src/fastdb/database.cpp:3230
#6 0x00a8832c in dbAnyCursor::select (this=0xb762d728,
query= (at) 0xb762d78c,
aType=dbCursorForUpdate, paramStruct=0x0) at
../../include/fastdb/cursor.h:128
#7 0x00a883ae in dbAnyCursor::select (this=0xb762d728,
query= (at) 0xb762d78c, paramStruct=0x0)
at ../../include/fastdb/cursor.h:143
#8 0x00aa5bb9 in SIPDBManager::getDatabase (this=0x9824fa0,
tablename= (at) 0xb762dbcc)
at ../../src/sipdb/SIPDBManager.cpp:569
#9 0x00a968ba in HuntgroupDB (this=0x9842ec0,
name= (at) 0xb762dbcc)
at ../../src/sipdb/HuntgroupDB.cpp:51
#10 0x00a969f3 in HuntgroupDB::getInstance (name= (at) 0xb762dbcc)
at ../../src/sipdb/HuntgroupDB.cpp:451
#11 0x0806219b in SipRedirectorHunt::initialize
(this=0x9846858, configParameters= (at) 0xb762e134,
configDb= (at) 0xbf85fb6c, pSipUserAgent=0x9833598,
redirectorNo=4)
at ../../src/SipRedirectorHunt.cpp:58
#12 0x0805b9d8 in SipRedirectServer::initialize
(this=0x983da60, configDb= (at) 0xbf85fb6c)
at ../../src/SipRedirectServer.cpp:158
#13 0x0805bfa1 in SipRedirectServer (this=0x983da60,
pOsConfigDb=0xbf85fb6c,
pSipUserAgent=0x9833598) at
../../src/SipRedirectServer.cpp:65
#14 0x08054100 in SipRegistrar::startRedirectServer
(this=0x98248d8)
at ../../src/SipRegistrar.cpp:465
#15 0x0805446b in SipRegistrar::operationalPhase
(this=0x98248d8)
at ../../src/SipRegistrar.cpp:220
#16 0x080544b0 in SipRegistrar::run (this=0x98248d8, pArg=0x0)
at ../../src/SipRegistrar.cpp:110
#17 0x00225843 in OsTaskLinux::taskEntry (arg=0x98248d8)
at ../../src/os/linux/OsTaskLinux.cpp:678
#18 0x00324b80 in start_thread () from /lib/libpthread.so.0
#19 0x00ce3dee in clone () from /lib/libc.so.6
SipRedirectServer::initialize is trying to read the HuntgroupDB and
getting stuck. After a lengthy debugging session, I found the source of
the problem. In SipRegistrarServer::SipRegistrarServer, we added this
HA code to restore the local DbUpdateNumber:
RegistrationDB* imdb = mRegistrar.getRegistrationDB();
mDbUpdateNumber =
imdb->getMaxUpdateNumberForRegistrar(mRegistrar.primaryName());
This code accesses the DB and looks like it is leaving the DB state
variable monitor->nReaders set to 1, so that the next guy
to come through (the thread shown above) ends up thinking someone else
is reading the DB, and waits forever for a semaphore that will never be
signalled.
I fixed the immediate problem by moving that code into a new method,
SipRegistrarServer::restoreLocalUpdateNumber, which is called from the
RegistrarInitialSync thread. The code may still be broken, but now it
is only called when HA replication is configured, so mainline registrar
functionality is now working again. Previously this code was being
called in the SipRegistrarServer constructor regardless of the HA
configuration. (Since configurePeers hasn't been called yet at that
point, the mReplicationConfigured variable of SipRegistrar has not been
set yet, so the constructor can't tell whether HA is configured or
not.) Maybe there is a better way to do this -- it's a bit ugly to have
yet another init function to be called externally on
SipRegistrarServer. But it's nice to have this called in
RegistrarInitialSync because it is logically part of startup processing.
The big question now is why does this code screw up the DB state?
-Walter
|