Issues after upgrade - Questions for the community / dev team

Questions for the community / dev team

  1. Has anyone else seen registrar-path stalls specifically on 20.19.0 (or recent 20.x) that don’t trigger taskprocessor alerts?
  2. Is there a known issue between VitalPBX 4.5.3 R4 changes and Asterisk 20.19’s PJSIP that could cause inbound REGISTER serializers to stall while qualify keeps running?
  3. Would you recommend a downgrade path (back to the 4.5.3 R3 PBX Engine, whatever version of Asterisk that shipped) while we capture forensic data? In a hospital environment we obviously want the most stable code possible.
  4. Are there any internal counters or pjsip show subcommands that would be especially valuable to capture during an active incident, beyond what I listed above?
  5. Any specific pjsip.conf tunables you’d suggest investigating — for instance, default_outbound_endpoint, mwi_disable_initial_unsolicited, or per-transport thread settings — that interact in a non-obvious way on the new engine?

I’m happy to share full logs, tarballs from the next capture, and any config snippets via private message or ticket if helpful. Will update this thread when the next incident triggers an automated capture.

Thanks in advance — and thanks to the VitalPBX team for the platform, this is the first issue we’ve hit in years of running it.

Juliano

Run core show taskprocessors and look for any pjsip/distributor or endpoint serializer with a queue length that is non-zero and slowly creeping up.

For what it’s worth, here’s the Asterisk full changelog:
ChangeLog for asterisk-20.19.0 ChangeLog for asterisk-20.19.0

Thanks
I will analyse

I recommend executing the command vitalpbx optimize-serve to optimize the Task Pool settings based on your server specifications.

RESOLVED — Root cause: Vitxi panel running un-indexed CDR queries that block MySQL → cascading PJSIP taskprocessor degradation

Hi all,

Quick update on this thread. After ~10 days of investigation, automated forensic capture, and slow-log analysis, I found the root cause and the system is now stable. Posting the full diagnosis here so that anyone running into the same pattern can fix it without going through the same archaeology I did.

TL;DR: This was not an Asterisk 20.19 regression. It was the Vitxi/ombutel panel running per-extension CDR lookups without a proper composite index, causing 1.5-second MySQL full-table scans on a ~2M row CDR table. Multiplied across ~400 extensions, this generates daily windows where MySQL is saturated for several minutes. During those windows the Asterisk devicestate, mwi, and pjsip/distributor taskprocessors back up, contacts fail to renew on time, and endpoints cascade into Unreachable. The core reload I discussed in my previous reply was a secondary trigger — it knocked an already-pressured system over the edge by breaking PJSIP transports.

The fix is one MySQL index. Reduced the offending query from 1,948,679 rows examined / 1.5s to 54 rows examined / 11ms — a 136× speedup. Combined with PJSIP profile adjustments and a stasis taskpool increase, the system has been rock-stable since.

Full details below in case it helps the dev team confirm a pattern or other operators recognize the symptoms.


The smoking gun

I enabled MariaDB slow query log (long_query_time = 1, log_queries_not_using_indexes = on) and let it run. Within hours the log showed the same query pattern repeating, one execution per extension, twice per day:

SELECT uniqueid
FROM   asterisk.cdr
WHERE  tenant      = 'vitalpbx'
  AND  (source = '498' OR destination = '498')
  AND  calltype   = '3'
  AND  calldate BETWEEN from_unixtime(1779850800)
                    AND from_unixtime(1779937199)
ORDER BY cdr_id DESC
LIMIT 1;

Slow log entries for the same query, different extensions, over several days:

Query_time: 1.585925   Rows_examined: 1948679   Rows_sent: 0
Query_time: 1.612436   Rows_examined: 1955142   Rows_sent: 0
Query_time: 1.605568   Rows_examined: 1958752   Rows_sent: 0
Query_time: 1.613179   Rows_examined: 1961435   Rows_sent: 0
Query_time: 1.752981   Rows_examined: 1963787   Rows_sent: 0
Query_time: 1.553399   Rows_examined: 1964511   Rows_sent: 0

Every execution scanned the entire CDR table (~99% of the 1.97M rows) and returned nothing. With ~400 extensions and this pattern running per-extension, that’s ~10 minutes of cumulative MySQL pressure per cycle. Two cycles per day.

User@Host: ombutel[ombutel] @ localhost — the Vitxi panel backend.

Why no existing index helped

The cdr table already has a healthy set of indexes shipped with VitalPBX:

KEY tenant                          (tenant, calldate)
KEY tenant_2                        (tenant, uniqueid)
KEY tenant_3                        (tenant, uniqueid, calltype)
KEY cdr_tenant_source_uniqueid      (tenant, source, uniqueid)
KEY cdr_tenant_destination_uniqueid (tenant, destination, uniqueid)

None of them work for this specific predicate. The OR (source = ? OR destination = ?) combined with calltype and a calldate range prevents the optimizer from using tenant + calldate (the two source/destination indexes don’t carry calldate, and tenant_3 doesn’t carry calldate either). The optimizer falls back to a full scan.

The cascade mechanism

This took the most time to understand because the symptom (PJSIP endpoint cascade) is several layers removed from the cause (one slow MySQL query). The path is:

  1. Vitxi cron triggers ~400 sequential CDR lookups
  2. Each lookup holds an asterisk ODBC connection for ~1.5s
  3. Asterisk’s res_odbc consumers (used by realtime + CDR backends + MWI mailbox lookups) start queuing behind those connections
  4. Stasis taskprocessors that touch DB-backed operations slow down — particularly stasis/m:devicestate:all-* (HighTime growing into the seconds range) and pjsip/mwi-*
  5. devicestate HighWater is 500 by default; once MaxDepth reaches the ceiling, device state events start being dropped — BLFs go out of sync and re-subscriptions trigger more load
  6. pjsip/distributor-* serializers begin building InQueue (typically up to ~25 in my captures), well below the 500 high-water mark that would have triggered a taskprocessor overload NOTICE — but high enough that REGISTER processing latency increases past the contact expiration window
  7. Contacts fail to renew, expire silently, and the system reports a cascade of Unreachable endpoints with no preceding WARNING or ERROR in the log

Two things that surprised me:

  • The taskprocessor overload alert never fires because the queues don’t hit the high-water threshold — they just stay elevated long enough to push registrations past their renewal deadline. The classic alarm path is silent during this failure mode.
  • The mwi-* taskprocessor was the most impacted in my system (HighTime of 5.28 seconds on a single task) because each MWI subscription requires a DB-backed mailbox lookup. Disabling MWI Subscription Enforcement in the PJSIP profile cut a huge amount of subscription churn and helped a lot — but that was a symptom mitigation, not the root cause.

How core reload (and probably “Apply Changes”) makes things worse

This is what I posted earlier in this thread. On Asterisk 20.19.0, core reload (or module reload res_pjsip.so) prints:

NOTICE res_pjsip/config_transport.c: Transport 'transport-udp-...'
       is not fully reloadable, not reloading: protocol, bind, TLS,
       TCP, ToS, or CoS options.

…on every PJSIP transport. The “not reloading” outcome leaves transports in a state where the registrar path can’t recover properly under pressure. On an already-loaded system (e.g. one currently inside one of the MySQL-saturation windows above), this is the final straw. On a healthy system, the same reload often appears to be a no-op — which is why the failure mode is intermittent and looked spontaneous to me originally.

I now treat any core reload or module reload res_pjsip.so as effectively requiring a full Asterisk restart, and I strongly suspect (but haven’t confirmed with the source) that the panel’s “Apply Changes” can dispatch a similar reload path. Worth checking with the dev team.

The right command for logger changes is asterisk -rx 'logger reload', not core reload. I should have known that, and I’m posting it explicitly here because I’d guess I’m not the only operator who has run core reload “just to be safe” after editing a config file.

The fix

1. Composite index on the CDR table

ALTER TABLE asterisk.cdr
ADD INDEX cdr_tenant_calltype_calldate (tenant, calltype, calldate);

This took about 8 seconds to build on the ~2M row table. No service interruption. After the index:

Before:                                After:
type:           ALL                    type:           range
key:            NULL                   key:            cdr_tenant_calltype_calldate
key_len:        NULL                   key_len:        248
rows:           ~1,948,679             rows:           54
Extra:          Using where            Extra:          Using index condition; Using where; Using filesort

Query_time:     ~1.5s                  Query_time:     ~11ms

The OR (source = ? OR destination = ?) is now applied to a few dozen rows post-index-lookup instead of to two million rows in a full scan. The filesort for ORDER BY cdr_id DESC LIMIT 1 over 54 rows is trivial.

Slow query log has been zero entries for this pattern since the index was added.

2. PJSIP profile changes

In the VitalPBX panel → PJSIP Settings → Profile:

Setting Old New
Qualify Timeout 3 s 10 s
Qualify Frequency 120 s 240 s
MWI Subscription Enforcement On Off

The MWI enforcement change was the biggest single behavioral improvement — eliminated ~400 permanent MWI subscriptions and cut the corresponding pjsip/mwi-* and stasis/p:mwi:all/*@vitalpbx-voicemail-* taskprocessor activity by ~95%.

3. Stasis taskpool

/etc/asterisk/stasis.conf:

[taskpool]
minimum_size    = 5
initial_size    = 10
idle_timeout_sec = 20
max_size        = 200    ; was default 50

Gives the stasis layer more workers to absorb device-state and event bursts.

4. MariaDB slow log made persistent

/etc/mysql/mariadb.conf.d/99-slow-query.cnf:

[mysqld]
slow_query_log              = 1
slow_query_log_file         = /var/lib/mysql/hospital-slow.log
long_query_time             = 1
log_queries_not_using_indexes = 1

I recommend this on every VitalPBX install of any size — the cost is nearly zero and it’s the only practical way to discover this class of problem before it manifests as a registration outage.

Validation

Metric Pre-fix Post-fix Δ
stasis/m:devicestate MaxDepth 499–500 / 500 (at ceiling) 32–34 / 500 93 % headroom restored
Endpoints Unavailable (steady state) 86, climbing during incidents 52 (baseline ghosts) back to normal
pjsip/distributor-* MaxDepth up to 25 InQueue during cascade 0 InQueue continuously resolved
MySQL queries > 1 s / day (CDR) ~14 0 resolved
Slow query for that specific predicate 1.5 s 11 ms 136×
Incidents in 48 h since fix n/a 0 :white_check_mark:

Why I’m posting this here

Three reasons:

  1. For other operators: if you’re on VitalPBX + Vitxi with a CDR table somewhere above ~500k rows and you see PJSIP endpoint cascades with no preceding log warnings, check your MariaDB slow log first. The query above is what to grep for.

  2. For the VitalPBX dev team: would it make sense to ship cdr_tenant_calltype_calldate (or equivalent) as part of the default schema? The current set of indexes handles most reporting patterns but misses this one, and the cost of the additional index is small relative to the operational damage when it bites a busy deployment. Also worth considering whether the Vitxi/ombutel “last call per extension” rollup could be restructured as a single aggregate query rather than N per-extension scans — the current pattern is O(extensions × table_size), which doesn’t scale.

  3. For the Asterisk community: the core reload / transport-not-fully-reloadable interaction on 20.19 is real and easy to trip into. Whatever the right path is for the dev team to handle that (better warning, refuse reload, automatic full reload of res_pjsip…), I think it’s worth a look. I’m happy to share my forensic tarballs with anyone investigating.

Forensic monitoring kept in place

For future incidents I’m keeping the automated capture script I described earlier running. It has two triggers:

  • Cause-side trigger (max distributor InQueue ≥ 8) — catches the cascade before endpoints expire
  • Effect-side trigger (Δ Unavailable ≥ 30 in 120 s) — catches it after, in case the cause trigger somehow misses

On capture, it tar.gz’s core show taskprocessors, pjsip show endpoints/contacts/registrations/transports, core show threads/locks, gstack of the Asterisk PID, MySQL processlist + InnoDB status + slow log tail, network / firewall / dmesg / journal, and the last chunk of full. Plus a WhatsApp notification via our internal API so we hear about it within seconds. Happy to share the script — let me know if it’d help.


Thanks again to Miguel and everyone who responded earlier in the thread. The pointer to look at the taskprocessors was exactly what got me onto the trail — the rest unfolded from there.

— Juliano, Netplay Telecomunicações (Brazil)

1 Like