SAP Knowledge Base Article - Preview

3283487 - HANA System Replication status stuck in a 'syncing' and 'Preparing persistence' state on primary site

Symptom

    • The System Replication overview tile on HANA Studio and/or HANA Cockpit is showing a hanging replication status on the indexserver under "syncing" & "Preparing persistence" :

               

               



     
     

    • The "Related Alerts" tab reveals disk full events, closed replication channels and a growing log replay backlog :

               

               




      

    • The log volume on both primary and secondary sites continues to grow :

       
      Primary site, log segments are in a mostly "RETAINED_FREE" state 

      mnt / hdb00002.00003
      77 RetainedFree

      mnt / hdb00003.00003
      409 RetainedFree

       

      Secondary site log segments are "BACKEDUP"

      mnt / hdb00002.00003
      76 BackedUp

      mnt / hdb00003.00003
      663 BackedUp

               


              


       

     
      

    • Primary site shows continuous broken replication channels with secondary site in the indexserver
    indexserver_host.port.xxx.trc

    [47504]{-1}[-1/-1] 2022-12-10 13:08:49.665830 i EventHandler     EventManagerImpl.cpp(00675) : New event reported: 'SystemReplicationEvent: site=1, Site 1: Log shipping timeout occurred'
    [74097]{-1}[-1/-1] 2022-12-10 13:14:54.127726 i EventHandler     EventManagerImpl.cpp(00951) : Event 'SystemReplicationEvent: site=1, Site 1: Log shipping timeout occurred' set to state 'handled'
    [6222]{-1}[-1/-1] 2022-12-10 13:29:15.048868 i EventHandler     EventManagerImpl.cpp(00675) : New event reported: 'SystemReplicationEvent: site=1, Site 1: Log shipping timeout occurred'
    [101353]{-1}[-1/-1] 2022-12-10 13:35:26.048575 i EventHandler     EventManagerImpl.cpp(00951) : Event 'SystemReplicationEvent: site=1, Site 1: Log shipping timeout occurred' set to state 'handled'
    [44038]{-1}[-1/-1] 2022-12-10 14:10:16.016944 i EventHandler     EventManagerImpl.cpp(00675) : New event reported: 'SystemReplicationEvent: site=1, Site 1: Log shipping timeout occurred'
    [44923]{-1}[-1/-1] 2022-12-10 14:16:37.330570 i EventHandler     EventManagerImpl.cpp(00951) : Event 'SystemReplicationEvent: site=1, Site 1: Log shipping timeout occurred' set to state 'handled'
    [97531]{-1}[-1/-1] 2022-12-10 14:20:44.467939 i EventHandler     EventManagerImpl.cpp(00675) : New event reported: 'SystemReplicationEvent: site=1, Site 1: Log shipping timeout occurred'
    [75933]{-1}[-1/-1] 2022-12-10 14:26:56.391342 i EventHandler     EventManagerImpl.cpp(00951) : Event 'SystemReplicationEvent: site=1, Site 1: Log shipping timeout occurred' set to state 'handled'
    [126072]{-1}[-1/-1] 2022-12-10 14:41:18.204055 i EventHandler     EventManagerImpl.cpp(00675) : New event reported: 'SystemReplicationEvent: site=1, Site 1: Log shipping timeout occurred'
    [61492]{-1}[-1/-1] 2022-12-10 14:47:11.767504 i EventHandler     EventManagerImpl.cpp(00951) : Event 'SystemReplicationEvent: site=1, Site 1: Log shipping timeout occurred' set to state 'handled'
    [26573]{-1}[-1/-1] 2022-12-10 14:51:14.127506 i EventHandler     EventManagerImpl.cpp(00675) : New event reported: 'SystemReplicationEvent: site=1, Site 1: Log shipping timeout occurred'
    [1980]{-1}[-1/-1] 2022-12-10 14:57:21.368718 i EventHandler     EventManagerImpl.cpp(00951) : Event 'SystemReplicationEvent: site=1, Site 1: Log shipping timeout occurred' set to state 'handled'
    [84282]{-1}[-1/-1] 2022-12-10 15:21:43.920825 i EventHandler     EventManagerImpl.cpp(00675) : New event reported: 'SystemReplicationEvent: site=1, Site 1: Log shipping timeout occurred'
    [107070]{-1}[-1/-1] 2022-12-10 15:28:00.917153 i EventHandler     EventManagerImpl.cpp(00951) : Event 'SystemReplicationEvent: site=1, Site 1: Log shipping timeout occurred' set to state 'handled'
    [15992]{-1}[-1/-1] 2022-12-10 15:32:14.891279 i EventHandler     EventManagerImpl.cpp(00675) : New event reported: 'SystemReplicationEvent: site=1, Site 1: Log shipping timeout occurred'


    [15992]{-1}[-1/-1] 2022-12-10 15:32:08.651058 e sr_dataaccess    DisasterRecoveryPrimaryImpl.cpp(00595) : Closing connection to siteID 1. LogShipping was waiting for 42 seconds (logshipping_timeout = 30)! TimeoutInfo[count=2/2, 
    [21917]{-1}[-1/-1] 2022-12-10 15:32:14.033374 e SQLSession       sm_handler.cc(01305) : (sockfd:1238, part:6, channel:0x00007e55ed5b8d80, event:0x00007dd8eacc9680) was closed because consecutive requests detected (current state:1)


    [21917]{-1}[-1/-1] 2022-12-10 15:32:14.033712 e SQLSession       sm_handler.cc(03215) : ===== END OF SESSION DUMP ========================
    [15992]{-1}[-1/-1] 2022-12-10 15:32:14.891167 i sr_dataaccess    DisasterRecoveryPrimaryImpl.cpp(01011) : Replication session closed (site=1, remoteHost=xx.xx.xx.xxx, invalidateSession=0, eventReported=0) <<, ptr=0x00007f0bfb2e5c00
    [15992]{-1}[-1/-1] 2022-12-10 15:32:14.891279 i EventHandler     EventManagerImpl.cpp(00675) : New event reported: 'SystemReplicationEvent: site=1, Site 1: Log shipping timeout occurred'


    [47504]{-1}[-1/-1] 2022-12-10 13:08:49.665726 i sr_dataaccess    DisasterRecoveryPrimaryImpl.cpp(01011) : Replication session closed (site=1, remoteHost=, invalidateSession=0, eventReported=0) <<, ptr=0x00007f0bfb2e5c00
    [44593]{-1}[-1/-1] 2022-12-10 13:08:49.671662 i sr_dataaccess    DisasterRecoveryPrimaryImpl.cpp(01011) : Replication session closed (site=1, remoteHost=, invalidateSession=0, eventReported=1) <<, ptr=0x00007f0bfb2e5c00
    [74120]{-1}[-1/-1] 2022-12-10 13:19:09.767958 i sr_dataaccess    DisasterRecoveryPrimaryImpl.cpp(01011) : Replication session closed (site=1, remoteHost=, invalidateSession=0, eventReported=0) <<, ptr=0x00007f0bfb2e5c00

    ....

     
     
     

    • logshipping_max_retention_size appears to have been breached already and reclaiming log segments on the primary site. This would lead to missing log segments and log holes on secondary, making it impossible for a full sync with primary without a full replica. 

    [65073]{-1}[-1/-1] 2022-12-10 13:10:56.043070 i Logger           LogSegment.cpp(04566) : Try to free RETAINED_FREE segment as logshipping_max_retention_size reached
    [65073]{-1}[-1/-1] 2022-12-10 13:10:56.043123 i Logger           LogSegment.cpp(01180) : Free segment kept for replication, LogSegment[0/239:0xd26ccd3f00-0xd26dcd3c40(0x3fff5000B)/GUID=bc9d7b9f-0078-20221119-184430-28601c2481/PrevGUID=bc9d7b9f-0078-20221119-184430-28601c2430/PersGUID=7a598376-0078-20221116-171532-2ed0000011,TS=2022-12-07 20:38:17.123060,Hole=0xd26ccd3f00,IsEncrypted=1/Free/0x0]@0x00007f1af736a900

      

     

    • Secondary site has been constantly crashing and restarting on the indexserver, resulting in continuous replication channel closures to the primary site.

    [93315]{-1}[-1/-1] 2022-12-10 18:16:19.487360 i Basis            TraceStream.cpp(00718) : ==== Starting hdbindexserver, version 2.00.055.00.1615413201 (fa/hana2sp05), build linuxx86_64 05fd01c3749db22de9f4eea75e45ed541c4b9a22 2021-03-10 22:59:24 ld4554 gcc (SAP release 20200227, based on SUSE gcc9-9.2.1+r275327-1.3.7) 9.2.1 20190903 [gcc-9-branch revision 275330]



    [120582]{-1}[-1/-1] 2022-12-10 18:22:19.039328 i PersistenceManag PersistenceManagerImpl.cpp(06722) : triggerSavepointDuringLogReplay vid=3 - start LogReplaySavepointJob with spv=152460, snapshotForSecondary=true
    [120515]{-1}[-1/-1] 2022-12-10 18:22:19.061961 i CSARedo          RedoMerge.cpp(02754) : MergeInfoPersistenceManager instance for log replay created: instance @ 0x00007f1ad011f000, associated persistence manager @ 0x00007f435562a018
    [120515]{-1}[-1/-1] 2022-12-10 18:22:19.061990 i CSARedo          RedoMerge.cpp(01309) : Merge and optimize compression configuration during log replay: active = 1, merge type = D2M_CONCURRENT_MERGE, optimize compression type = CONCURRENT_OPTIMIZE_COMPRESSION, cancel jobs before going online = 1, delta row limit for enforced merges = 500000000, delta row limit for enforced synchronous merges = 1610612736, timeout for index handle lock = 120000, ignored tables = 
    [120515]{-1}[-1/-1] 2022-12-10 18:22:19.061998 i CSARedo          Configuration.cpp(00166) : redo replay settings: UseDmlExecutorDataRedo = 1, AllowCachingOfRedoDmlContext = 1, UseDmlExecutorBatchRedo = 1, UseDmlExecutorBatchSortRedo = 1, UseDmlExecutorBatchUpdateRedo = 0, AllowMergeAndOcDuringRecoveryToFinishAnytime = 1, MaxRowsToLog = 0, UsePreferredNumaNodeForPreload = 1, MaxSizeToLog = 209715200, ForceDeltaLoad = 0, DisabledTableConsistencySubChecks = 12, ExpectWillCrash = 0, WarmupperLoadDeltaFragments = 1, TreatFirstNonDdlRedoOfContainerAsNoOpenCch = 1, MaxLoadUnitConversionThreads = 8, OptimizeReplicaForOnlineDdlInLogReplay = 0, UseOperationModeLogReplayReadAccess = 0, FirstDataItemToTrace = 0, NumDataItemsToTrace = 10, ParallelWriteThreshold = 2000

    ......

    [120536]{-1}[-1/-1] 2022-12-10 18:25:39.213024 e CSARedo          RedoHandler.cpp(01395) : Exception during REDO of ; TableName='SAPABAP1:P2RX_CRT' (tableOID=45210), ContainerID=0xfe000fd40f
    [120536]{-1}[-1/-1] 2022-12-10 18:25:39.213090 e CSARedo          RedoMerge.cpp(04319) : Emergency info, cId = 0xfe000fd40f
    [120536]{-1}[-1/-1] 2022-12-10 18:25:39.213092 e CSARedo          RedoMerge.cpp(04320) : Emergency info, tId = 3860329916
    [120536]{-1}[-1/-1] 2022-12-10 18:25:39.213094 e CSARedo          RedoMerge.cpp(03185) : Emergency info, pos = 900259584831, prev redo did not finish cch = 0, no open cch = 0, allow concurrnt execution = 1, synchronized with cleanup lock = 0
    [120536]{-1}[-1/-1] 2022-12-10 18:25:39.213096 e CSARedo          RedoMerge.cpp(03194) : Emergency info, queue = 42
    [120536]{-1}[-1/-1] 2022-12-10 18:25:39.213097 e CSARedo          RedoMerge.cpp(03199) : Emergency info, type = 33, redo data version = 1, cId = 0xfe000fd40f
    [120536]{-1}[-1/-1] 2022-12-10 18:25:39.213098 e CSARedo          RedoMerge.cpp(03208) : Emergency info, table = SID::SAPABAP1:P2RX_CRT (t 45210)
    [120536]{-1}[-1/-1] 2022-12-10 18:25:39.213110 e CSARedo          RedoMerge.cpp(04329) : Emergency info, table name from container = SID::SAPABAP1:P2RX_CRT (t 45210)
    [120536]{-1}[-1/-1] 2022-12-10 18:25:39.213129 e Logger           RecoveryHandlerImpl.cpp(02809) : ERROR REPLAYING LOG, last potential recoverable position: 0xd19ba6fb72, message: exception  1: no.3100025  (CSAccessor/Redo/OverflowCheck.cpp:148) TID: 120536
      Allowed rowcount exceeded: was rowPosCount [2147483624] by current rowCount [129] in [SID::SAPABAP1:P2RX_CRT (t 45210)] with fragmentSet
    Fragment ID: 1889
    Fragment Type: delta
    Fragment Row Count: 12989544
    Fragment Base RowID: 10202338188
    Fragment Max RowID: 10215327731
    Fragment Column Count: 16
    Fragment ID: 1891
    Fragment Type: main
    Fragment Row Count: 2134494081
    Fragment Base RowID: 1
    Frag
    exception throw location:
     1: 0x00007f442dd26b71 in CSAccessor::Exception::RowPosOverflowException::RowPosOverflowException(char const*, int, int, unsigned long, ltt::basic_string<char, ltt::char_traits<char>, ltt::integral_constant<int, 64> > const&, TrexBase::IndexName const&)+0x40 at Exception.cpp:276 (libhdbcsaccessor.so)
     2: 0x00007f442dcd741c in CSAccessor::Redo::OverflowCheck::throwRowPosOverflowException(char const*, int, char const*, int, unsigned long)+0x4a8 at OverflowCheck.cpp:221 (libhdbcsaccessor.so)
     3: 0x00007f442dcd7a49 in CSAccessor::Redo::OverflowCheck::incSumOfRowsForInsert(unsigned long)+0x85 at OverflowCheck.cpp:148 (libhdbcsaccessor.so)
     4: 0x00007f442dcd7a62 in CSAccessor::Redo::OverflowCheck::incSumOfRowsForUpdateUpsert(unsigned long)+0x10 at OverflowCheck.cpp:200 (libhdbcsaccessor.so)
     5: 0x00007f442dccb5dd in CSAccessor::Redo::RedoHandler::redoDmlInternal(DataAccess::PersistenceSession&, ltt::refcounted_handle<UnifiedTable::TableContainer>&, void const*, unsigned long, CSAccessor::Dml::RedoDmlContext&, bool) const+0x5e9 at RedoHandler.cpp:1238 (libhdbcsaccessor.so)
     6: 0x00007f442dcce8c3 in CSAccessor::Redo::RedoHandler::redoDmlBatchInternal(CSAccessor::Redo::RedoDmlBatchAccess const&, DataAccess::PersistenceSession&, DataContainer::ContainerID const&, ltt::vector<UnifiedTable::RedoData const*> const&, ltt::vector<unsigned long> const&, unsigned char, ltt::allocator&) const+0xbd0 at RedoHandler.cpp:778 (libhdbcsaccessor.so)
     7: 0x00007f442dccf9c1 in CSAccessor::Redo::RedoHandler::redoDmlBatch(ltt::vector_iterator<DataAccess::RedoLogHandler::RedoBatchEntry const>, ltt::vector_iterator<DataAccess::RedoLogHandler::RedoBatchEntry const>, CSAccessor::Redo::RedoDmlBatchAccess const&, DataAccess::RedoLogHandler::RedoBatchStatisticsCallback&) const+0x460 at RedoHandler.cpp:716 (libhdbcsaccessor.so)
     8: 0x00007f442dcd6230 in CSAccessor::Redo::RedoHandler::redoBatch(ltt::vector<DataAccess::RedoLogHandler::RedoBatchEntry>&, DataAccess::RedoLogHandler::RedoBatchStatisticsCallback&) const+0xad0 at RedoHandler.cpp:594 (libhdbcsaccessor.so)
     9: 0x00007f4417e67677 in DataRecovery::RecoveryQueue::run(void*&)+0x2383 at RecoveryQueue.cpp:562 (libhdbdataaccess.so)
    10: 0x00007f4414a8df85 in Execution::Thread::staticMainImp(Execution::Thread*)+0x531 at Thread.cpp:522 (libhdbbasis.so)
    11: 0x00007f4414a7d9e8 in Execution::pthreadFunctionWrapper(Execution::Thread*)+0x1e4 at ThreadInterposition.cpp:639 (libhdbbasis.so)
    12: 0x00007f44140df71a in start_thread+0xc6 (libpthread.so.0)
    13: 0x00007f44127bb17d in __clone+0x69 (libc.so.6)
    exception type information:
     - 0: public CSAccessor::Exception::RowPosOverflowException@0x7f442dde01c0 SI
       - 0: public CSAccessor::Exception::CSAccessorException@0x7f442dddf4c8 SI
         - 0: public ltt::exception@0x7f4414044c40
    [120536]{-1}[-1/-1] 2022-12-10 18:25:39.221213 e Basis            Crash.cpp(00759) : Crash at DataAccess/DataRecovery/impl/RecoveryHandlerImpl.cpp:2814
    Reason:
    exception  1: no.1000000  (DataAccess/DataRecovery/impl/RecoveryHandlerImpl.cpp:2814) TID: 120536
        Error during log replay. See trace for details.
    exception throw location:
     1: 0x00007f44184e379a in DataRecovery::RecoveryHandlerImpl::reportReplayError(DataRecovery::RecoveryQueue*, DataAccess::LogRecoveryCallback::ErrorReason, unsigned long, ltt::exception const&) [clone .constprop.0]+0x14c at RecoveryHandlerImpl.cpp:2814 (libhdbdataaccess.so)
     2: 0x00007f44184f8459 in DataRecovery::RecoveryQueue::run(void*&) [clone .cold]+0x37 at RecoveryQueue.cpp:624 (libhdbdataaccess.so)
     3: 0x00007f4414a8df85 in Execution::Thread::staticMainImp(Execution::Thread*)+0x531 at Thread.cpp:522 (libhdbbasis.so)
     4: 0x00007f4414a7d9e8 in Execution::pthreadFunctionWrapper(Execution::Thread*)+0x1e4 at ThreadInterposition.cpp:639 (libhdbbasis.so)
     5: 0x00007f44140df71a in start_thread+0xc6 (libpthread.so.0)
     6: 0x00007f44127bb17d in __clone+0x69 (libc.so.6)
    exception type information:
     - 0: public Diagnose::AssertError@0x7f4414ff0ee0 SI
       - 0: public ltt::logic_error@0x7f44140446a0 SI
         - 0: public ltt::exception@0x7f4414044c40
    Possible root cause:
    --- Pending assertion:
    exception  1: no.1000000  (DataAccess/DataRecovery/impl/RecoveryHandlerImpl.cpp:2814) TID: 120536
        Error during log replay. See trace for details.


    Read more...

    Environment

    SAP HANA, platform edition 2.0

    SAP HANA, platform edition 1.0

    Product

    SAP HANA 1.0, platform edition ; SAP HANA, platform edition 2.0

    Keywords

    retained_free, initializing, log volume full,  replication stuck , KBA , HAN-DB-HA , SAP HANA High Availability (System Replication, DR, etc.) , HAN-DB-PER , SAP HANA Database Persistence , Problem

    About this page

    This is a preview of a SAP Knowledge Base Article. Click more to access the full version on SAP for Me (Login required).

    Search for additional results

    Visit SAP Support Portal's SAP Notes and KBA Search.