By Date: <-- -->
By Thread: <-- -->

HA: fixed the registrar hanging at startup, but only for the non-HA case



Figured it out. The answer is that all the RegistrationDB methods have to call attach/detach to avoid messing up the DB state, as in, for example, RegistrationDB::getUnexpiredContacts. See highlighted lines below for the function that was causing the problem in this case. I'll create a simple smart pointer class to make this more robust and will fix the other new RegistrationDB methods as well.

-Walter

intll
RegistrationDB::getMaxUpdateNumberForRegistrar(const char* primaryRegistrar) const
{
   intll maxUpdateForPrimary = 0LL;

   if (m_pFastDB != NULL)
   {
      // ensure process/tls integrity
      m_pFastDB->attach();

      dbCursor<RegistrationRow> cursor;
      dbQuery query;
      query = "primary = ", primaryRegistrar, "order by update_number desc";
  
      int numRows = cursor.select(query);
      if (numRows > 0)
      {
         maxUpdateForPrimary = cursor->update_number;
      }

      // ensure process/tls integrity
      m_pFastDB->detach(0);

   }
   else
   {
      OsSysLog::add(FAC_DB, PRI_ERR, "RegistrationDB::getMaxUpdateNumberForRegistrar failed - no DB");
   }
  
   return maxUpdateForPrimary;
}


Walter Gillett wrote:
The story began with the discovery that the registrar (on the registry_ha branch) hangs at startup. Here is the backtrace:
(gdb) bt
#0  0x00e85402 in ?? ()
#1  0x00ce5449 in semop () from /lib/libc.so.6
#2  0x00add601 in wait_semaphore (sem= (at) 0x9824764, msec=4294967295, sops=0xb09140, n_sops=1)
    at ../../src/fastdb/sync.cpp:190
#3  0x00add645 in dbSemaphore::wait (this=0x9824764, msec=4294967295)
    at ../../src/fastdb/sync.cpp:203
#4  0x00ac17b1 in dbDatabase::beginTransaction (this=0x98246a8, lockType=dbExclusiveLock)
    at ../../src/fastdb/database.cpp:4370
#5  0x00ac7f7d in dbDatabase::select (this=0x98246a8, cursor=0xb762d728, query= (at) 0xb762d78c)
    at ../../src/fastdb/database.cpp:3230
#6  0x00a8832c in dbAnyCursor::select (this=0xb762d728, query= (at) 0xb762d78c,
    aType=dbCursorForUpdate, paramStruct=0x0) at ../../include/fastdb/cursor.h:128
#7  0x00a883ae in dbAnyCursor::select (this=0xb762d728, query= (at) 0xb762d78c, paramStruct=0x0)
    at ../../include/fastdb/cursor.h:143
#8  0x00aa5bb9 in SIPDBManager::getDatabase (this=0x9824fa0, tablename= (at) 0xb762dbcc)
    at ../../src/sipdb/SIPDBManager.cpp:569
#9  0x00a968ba in HuntgroupDB (this=0x9842ec0, name= (at) 0xb762dbcc)
    at ../../src/sipdb/HuntgroupDB.cpp:51
#10 0x00a969f3 in HuntgroupDB::getInstance (name= (at) 0xb762dbcc)
    at ../../src/sipdb/HuntgroupDB.cpp:451
#11 0x0806219b in SipRedirectorHunt::initialize (this=0x9846858, configParameters= (at) 0xb762e134,
    configDb= (at) 0xbf85fb6c, pSipUserAgent=0x9833598, redirectorNo=4)
    at ../../src/SipRedirectorHunt.cpp:58
#12 0x0805b9d8 in SipRedirectServer::initialize (this=0x983da60, configDb= (at) 0xbf85fb6c)
    at ../../src/SipRedirectServer.cpp:158
#13 0x0805bfa1 in SipRedirectServer (this=0x983da60, pOsConfigDb=0xbf85fb6c,
    pSipUserAgent=0x9833598) at ../../src/SipRedirectServer.cpp:65
#14 0x08054100 in SipRegistrar::startRedirectServer (this=0x98248d8)
    at ../../src/SipRegistrar.cpp:465
#15 0x0805446b in SipRegistrar::operationalPhase (this=0x98248d8)
    at ../../src/SipRegistrar.cpp:220
#16 0x080544b0 in SipRegistrar::run (this=0x98248d8, pArg=0x0)
    at ../../src/SipRegistrar.cpp:110
#17 0x00225843 in OsTaskLinux::taskEntry (arg=0x98248d8)
    at ../../src/os/linux/OsTaskLinux.cpp:678
#18 0x00324b80 in start_thread () from /lib/libpthread.so.0
#19 0x00ce3dee in clone () from /lib/libc.so.6
SipRedirectServer::initialize is trying to read the HuntgroupDB and getting stuck. After a lengthy debugging session, I found the source of the problem. In SipRegistrarServer::SipRegistrarServer, we added this HA code to restore the local DbUpdateNumber:
RegistrationDB* imdb = mRegistrar.getRegistrationDB();
mDbUpdateNumber = imdb->getMaxUpdateNumberForRegistrar(mRegistrar.primaryName());
This code accesses the DB and looks like it is leaving the DB state variable monitor->nReaders set to 1, so that the next guy to come through (the thread shown above) ends up thinking someone else is reading the DB, and waits forever for a semaphore that will never be signalled.

I fixed the immediate problem by moving that code into a new method, SipRegistrarServer::restoreLocalUpdateNumber, which is called from the RegistrarInitialSync thread. The code may still be broken, but now it is only called when HA replication is configured, so mainline registrar functionality is now working again. Previously this code was being called in the SipRegistrarServer constructor regardless of the HA configuration. (Since configurePeers hasn't been called yet at that point, the mReplicationConfigured variable of SipRegistrar has not been set yet, so the constructor can't tell whether HA is configured or not.) Maybe there is a better way to do this -- it's a bit ugly to have yet another init function to be called externally on SipRegistrarServer. But it's nice to have this called in RegistrarInitialSync because it is logically part of startup processing.

The big question now is why does this code screw up the DB state?

-Walter







_______________________________________________
sipx-dev mailing list
sipx-dev (at) list.sipfoundry.org
List Archive: http://list.sipfoundry.org/archive/sipx-dev