Written by Technical Team | Last updated 25.10.2025 | 11 minute read
High availability and intelligent failover are no longer “nice to haves” in NHS integration infrastructure — they are operational safety nets for clinical services. Trust integration engines such as Rhapsody, InterSystems Ensemble, InterSystems HealthShare, Mirth Connect, and others sit at the centre of clinical messaging. They route pathology results into the EPR, push discharge summaries to GP systems, broker PAS and theatre updates, and exchange critical referral, order, and observation data between local and national systems. If that flow stops, clinicians lose visibility, backlogs build, and patient safety can be put at risk.
High availability in this context refers to keeping integration services continuously usable, even when components fail. True availability is not simply a server staying “up”; it is the continued ability to transform, route, and deliver messages in line with clinical workflows and national obligations. In an NHS Trust environment, this means: are ADT messages still being processed? Are pathology results still being delivered to the right downstream systems? Are Spine-facing services still sending and receiving messages as required? An integration engine that is technically running but unable to process key interfaces is functionally unavailable.
Failover is the controlled, automated movement of service workload to an alternative instance or node when something goes wrong. In a well-designed Trust integration architecture, this might include automatic promotion of a secondary Rhapsody node if the primary node becomes unreachable, controlled redirection of inbound traffic to a standby Mirth Connect cluster, or rerouting of high-priority interfaces via a pre-approved contingency path. The aim is not merely recovering eventually; it is absorbing disruption without clinicians noticing it at all.
The reality in many Trusts is that integration engines have evolved over time, often under pressure, and carry a mix of legacy flows, tactical workarounds, and bespoke point-to-point logic created to “get something live” for an urgent project. This organic growth can introduce single points of failure in interface logic, message queues, VPN links, or even firewall rules. Where internal teams are stretched, proactive resilience engineering often trails behind immediate delivery work. Digital health managed services step in here: they harden the operating model around the integration engine, not just the engine itself, by applying structured monitoring, governance, capacity planning, and recovery design.
From a board or CIO perspective, improving availability is not just an infrastructure task. It is a strategic continuity objective: keeping clinical systems interoperable 24/7, protecting data quality, and sustaining the Trust’s ability to meet statutory reporting, elective recovery targets, and patient flow KPIs. From an integration lead’s perspective, it is about sleeping at night knowing that if a node fails at 02:00, messages will continue to flow, alerts will trigger, and a known, tested runbook is already in motion — even if in-house resource is not physically on site.
A modern managed service for Trust integration engines is not limited to “being on the end of the phone”. It is a structured, standards-driven operational model where the provider assumes responsibility for stability, monitoring, incident handling, and lifecycle maintenance of one or more integration engines. For many Trusts, this is attractive because internal integration teams are typically small, highly skilled, and already committed to live project delivery, upgrades, FHIR enablement, and national compliance work. Outsourcing availability engineering does not replace that team — it protects it.
At its core, managed service support introduces discipline. It gives the Trust clearly defined SLAs, escalation paths, performance thresholds, and service review cycles. That structure is essential for availability, because availability is measurable. If failover is untested, or backups are inconsistent, or queues silently climb past agreed thresholds, you do not have high availability — you have luck. A managed service formalises the difference between “it usually works” and “it is contractually assured and operationally proven”.
In practice, Trusts gain resilience in several connected ways:
The value for high availability is cumulative. When monitoring is live, escalation is defined, capacity is understood, and configuration states are controlled across primary and secondary nodes, the Trust gains true service continuity in a way that internal teams alone often struggle to maintain under pressure.
There is also an important workforce dimension. High availability depends on repeatable response. A Trust may have exceptional in-house integration specialists, but few can maintain 24/7 cover, cross-platform expertise, and still drive forward interoperability projects. A managed service brings pooled expertise across multiple engine technologies and multiple Trust environments. That breadth is essential when diagnosing obscure failure conditions: unexpected TLS handshake errors, malformed HL7 segments, or queue deadlocks triggered by message bursts after downtime. With pooled expertise and tested runbooks, these are handled as known operational events rather than as first-time emergencies.
From a strategic perspective, managed services also support internal governance. NHS Trusts are accountable for DSPT compliance, IG obligations, and adherence to change management expectations. A structured managed service strengthens the Trust’s audit position by providing documented monitoring coverage, incident trails, and configuration baselines that demonstrate not only how availability is maintained but how it is evidenced. That is increasingly relevant to boards and regulators who expect verifiable operational resilience in digital health infrastructure.
At the technical level, high availability for an integration engine is not a single feature. It is an architectural characteristic that needs to be designed, deployed, and maintained. Managed service providers focused on NHS integration engines work across four main layers: infrastructure, application, interface, and operations. All four matter, because failure can occur at any of them.
The infrastructure layer forms the foundation: clustered or load-balanced nodes, redundant VMs or containers, resilient storage, and network path redundancy. Where Trusts run integration engines on-premises, this may involve dual data centres or at minimum separate hosts with replicated configuration and message persistence. In hybrid or cloud-backed models, this may include orchestrated failover between availability zones. However, infrastructure redundancy alone does not ensure message continuity. You can fail over the compute node, but if message state or transformation logic is not synchronised, risk re-emerges during switchover.
The application layer concerns the integration engine itself: clustering, state synchronisation, and licensing. Not every engine supports identical resilience models. Rhapsody, for example, can run in active-active or active-passive configurations; InterSystems platforms support mirrored instances and distributed deployments; Mirth Connect can cluster with shared repositories but requires careful configuration. A managed service that understands these nuances can design and continuously validate the correct topology for each engine, even in mixed estates where several platforms coexist.
The interface layer is where theoretical resilience often fails in practice. Interfaces differ: some are synchronous and require immediate responses; others are asynchronous or batched. Managed services map and classify these interfaces by criticality, dependency, retry behaviour, and recovery complexity. This classification drives prioritised failover plans. During an outage, not all interfaces are equal — ADT and pathology flows may require instant restoration, while bulk analytics feeds can wait. Embedding that prioritisation into the failover plan maintains safe patient care under incident conditions.
Finally, the operational layer brings resilience to life. High availability is only as strong as the runbooks that execute under stress. Managed services deliver rehearsed failover procedures, pre-approved access controls, and named escalation contacts who know the Trust’s architecture. During a 3 a.m. outage, there is no time to locate a missing VPN credential or firewall exception. Proper onboarding into a managed service ensures readiness well before issues arise.
Within this architecture, three practices make the greatest impact:
A mature managed service treats these patterns as living processes — revisited, refined, and documented through every service review cycle. This keeps high availability from being a one-time project deliverable and makes it part of business-as-usual operations.
High availability is only meaningful if failover works under real pressure. Intelligent failover preserves message integrity, maintains data continuity, and restores service state with full auditability. In today’s NHS, where 24/7 acute care, electronic prescribing, digital pathology, and virtual wards all depend on constant data exchange, Trusts increasingly rely on managed service partners to build and run these intelligent failover patterns.
Intelligent failover begins with observability. Managed services monitor health indicators across the integration estate and detect degradation before clinicians are affected. Detection is contextual — a brief CPU spike may be harmless, but halted ACK responses from a critical downstream system trigger targeted rerouting rather than a full system failover. This selectivity maintains stability while isolating faults.
Backup and message persistence strategies form the next layer. Failures can interrupt transactions mid-flow, creating partially processed messages. Without robust persistence and replay policies, data can be lost or duplicated. Managed services define authoritative queues, retention durations, and duplicate-suppression mechanisms, ensuring a clean data state after recovery.
Structured disaster recovery (DR) exercises transform theory into practice. Managed service teams regularly rehearse failover scenarios, produce after-action reports, and refine runbooks based on lessons learned. This ensures that real-world incidents unfold in a predictable, controlled way.
Security is inseparable from availability. Role-based access, credential management, VPN integrity, and audit logging must support — not obstruct — recovery. Managed service onboarding verifies RBAC alignment with NHS security frameworks (including DSPT and ISO 27001), ensuring that escalation engineers can act immediately and compliantly during incidents.
Finally, effective failover relies on clear communication. Integration incidents ripple across many teams — integration specialists, clinical system owners, IT operations, and sometimes clinical leads. Managed services pre-define escalation routes and update templates so the right people receive concise, relevant updates. This clarity prevents overreaction and maintains trust during high-pressure moments.
Modern NHS Trusts must simultaneously maintain operational stability, meet interoperability mandates, support new digital initiatives, and uphold governance standards. Integration engines sit at the core of these demands. Partnering with a digital health managed service enables Trusts to enhance high availability and failover capability while freeing internal teams to focus on innovation and patient-facing projects.
The measurable benefits include:
Ultimately, managed service partnerships enable NHS Trusts to convert integration from a fragile dependency into a resilient, continuously available foundation for digital health. Clinical teams benefit from uninterrupted data flow; IT leaders gain confidence in compliance, performance, and continuity; and patients experience safer, more connected care.
Is your team looking for help with Trust Integration Engines? Click the button below.
Get in touch