Making Background Automation Observable

The Andorra CAP feed was added to the public Alert Hub sources register as ad-meteo-xx. The expected behaviour was straightforward: the running node should poll the configured register, create the source row, and make it visible in the CAP Aggregator sources list.

When that did not happen, the most important problem was not only that a source was missing. It was that the operator could not answer a more basic question:

Did the system even try?

There were logs and metrics around parts of the pipeline, but no durable operator-facing record of the authority register sync run: which instance acquired the lease, which URL it fetched, whether parsing worked, whether the source code was seen, and why a source was created, updated, skipped, or failed.

A small system audit model

The response is deliberately modest. The implementation adds a compact system_audit_run and system_audit_event model rather than a large audit framework.

Run records group multi-step work such as authority register sync. Event records capture individual observations such as:

application instance started
scheduled register sync started
register URL fetch started or completed
source code created, updated, skipped, or failed
admin manual sync, export, or prune action

The event records are filterable by time, severity, category, event type, subject, instance, and result code. That makes the important support workflow direct: filter for REGISTER_SYNC and subject key ad-meteo-xx, then inspect what happened.

Instance identity matters

The existing code had several node identity concepts. Some paths used HOSTNAME, some used older node-id properties, and Snowflake id leasing had its own owner string.

That is confusing during a rollout. A useful audit line needs a stable operational name and a way to distinguish a restarted process from the previous process with the same name.

The new model uses:

instance_name: the configured or environment-derived operational name
instance_run_id: a generated id for this process start
hostname: retained as diagnostic context

The numeric Snowflake node id remains separate. It is an id-generation concern, not the human-readable instance name.

Keeping the log useful

The goal is not to fill the database with routine noise.

Startup, shutdown, and daily register scan events are low volume and useful. Per-source events are recorded for creates, updates, skips, and failures. A dry-run probe can answer whether a specific source code appears in the register without storing an unchanged “seen” row for every source on every daily scan.

The admin UI also includes export and prune controls. Scheduled housekeeping prunes INFO audit events older than the configured cutoff, defaulting to seven days. More severe records can be kept longer. Manual prune actions are restricted to admins, require preview/confirmation, and are themselves audited.

What this changes operationally

After this work, the failure mode should be different.

If a future source appears in the public register but not in the source list, an operator should be able to see:

when the register sync last ran
which instance ran it
which URL was fetched
whether the fetch, parse, and persistence path succeeded
whether the missing source code appeared in the payload
whether a manual dry-run probe sees the source now

That is a better answer than searching transient logs after the fact. It also gives the system status page a shared view of the last authority register sync and the next eligible run.

Remaining caveats

This is still an operational audit log, not a compliance archive. It is intentionally pruned, and very large payloads are not stored. Forced container termination can also prevent graceful shutdown events from being written, so shutdown records are evidence rather than proof.

The permission model remains simple for now: ADMIN is the privileged role for export, prune, and manual sync actions. A separate security audit backlog item will review the wider route and permission model before introducing more roles.

The useful test for this change is practical: the next time a source is expected to auto-load, the operator should be able to answer “what happened?” from the admin UI without shell access and without guessing which replica had the lease.

A small system audit model

Instance identity matters

Keeping the log useful

What this changes operationally

Remaining caveats

Related posts

Measuring Controlled Vocabulary Term Salience

Andorra Adds a CAP Feed

Controlled Vocabularies for CAP