Monitoring and Alerts
Reliable operation of a production Hyperswitch deployment requires comprehensive observability across infrastructure, application behavior, and transaction performance.
Observability in a Hyperswitch installation is based on two primary data sources:
Logs
Logs provide a chronological record of events and operations occurring within the system.
Used for debugging, auditing, troubleshooting failures, and security analysis.
Metrics
Metrics are numerical measurements collected over time representing system performance and behavior.
Used for performance monitoring, capacity planning, alerting, and operational analysis.
Both logs and metrics together provide the necessary visibility to maintain operational reliability, detect anomalies, and troubleshoot incidents in production environments.
Observability Stack
Hyperswitch integrates with a standard observability stack that combines logging, metrics collection, and visualization.
Promtail
Log scraping agent that collects logs from application pods
Grafana Loki
Centralized log storage and query engine
OpenTelemetry Collector
Collects and exports application metrics
CloudWatch (or equivalent system monitoring platform)
Collects infrastructure and system metrics
Grafana
Visualization platform used to build monitoring dashboards
This stack enables unified dashboards where operators can correlate:
application metrics
system resource utilization
transaction behavior
error patterns
application logs
This correlation significantly reduces the time required to diagnose issues in production systems.
Monitoring Architecture
Application Metrics
Hyperswitch services emit application metrics such as:
request rates
latency
authorization success rates
error rates
These metrics are collected through the OpenTelemetry Collector, which forwards them to a metrics backend such as Prometheus.
Infrastructure Metrics
Infrastructure metrics such as:
node CPU usage
memory consumption
disk utilization
network usage
are typically collected from the cloud provider monitoring system (for example CloudWatch or equivalent).
Log Collection
Application and system logs are collected by Promtail, which:
scrapes logs from containers and nodes
sends them to Loki for centralized log storage.
Visualization Layer
Grafana serves as the unified observability interface and provides:
infrastructure dashboards
application performance dashboards
transaction success rate monitoring
latency analysis
error analysis
Grafana can visualize both:
metrics (Prometheus / cloud metrics)
logs (Loki)
Optional Enterprise Remote Monitoring
For Enterprise Edition merchants, a subset of metrics may be exported using Prometheus Remote Write to a Hyperswitch-managed monitoring environment to enable:
proactive operational support
performance analysis
infrastructure optimization recommendations.
Monitoring Dashboards
The monitoring package includes pre-built Grafana dashboards that provide real-time visibility into infrastructure health and payment processing performance.
The dashboards typically cover the following areas.
Kubernetes Cluster Monitoring
Infrastructure-level monitoring ensures that the underlying cluster remains healthy and capable of handling transaction workloads.
Tracked metrics include:
CPU utilization by pods
Memory utilization by pods
CPU utilization by nodes
Memory utilization by nodes
Pod restart counts
Node health status
These metrics help detect:
resource exhaustion
pod instability
infrastructure bottlenecks
before they affect transaction processing.
Application Monitoring
Application dashboards provide visibility into API behavior and system performance.
Tracked metrics include:
API route performance
Request throughput by endpoint
API latency (P95 / P99)
HTTP status code distribution
Error rates across services
These metrics allow operators to quickly identify:
degraded API performance
abnormal error patterns
traffic spikes
Payment and Transaction Monitoring
Hyperswitch dashboards also provide deep insight into payment processing performance.
Tracked metrics include:
Total transactions initiated
Total transactions processed
Authorization success rate
Transaction status distribution
Error codes returned by processors
Payment method distribution
Payment method level authorization rate
Acquirer level authorization rate
Metric segmentation by organization, account, and profile
These dashboards allow operators to detect:
processor outages
authorization rate degradation
routing inefficiencies
payment method performance issues
Monitoring Configuration
Grafana dashboards are deployed as part of the Hyperswitch Helm installation.
The deployment process involves the following steps:
Deploy the monitoring stack components as part of the Kubernetes cluster.
Import the provided Grafana dashboard JSON files.
Configure the appropriate data sources for each dashboard.
Each dashboard requires selecting the appropriate data source at the top of the dashboard configuration.

Once configured, data typically begins appearing within a few minutes, provided that the monitoring agents are running correctly and metrics are being collected.
Remote Monitoring (Enterprise Edition)
For merchants using the Enterprise Edition of Hyperswitch, optional remote monitoring support is available.
In this configuration, selected operational metrics are exported from the merchant's environment to a monitoring system maintained by the Hyperswitch team.
This allows the Hyperswitch team to:
monitor transaction patterns and infrastructure utilization
assist in troubleshooting production issues
identify performance optimization opportunities
provide proactive recommendations to improve authorization rates
Remote Monitoring Configuration
Remote monitoring is implemented using the Prometheus Remote Write API.
The merchant's Prometheus server (or compatible monitoring system such as VictoriaMetrics) pushes selected metrics to a Prometheus instance maintained by the Hyperswitch team.
This configuration allows merchants to retain full control over which metrics are shared externally.
Setup Process
Before configuring remote monitoring, ensure that:
Prometheus is running correctly in the Kubernetes cluster
application metrics are visible in Grafana dashboards
The configuration process typically involves:
Obtaining the Prometheus remote-write configuration snippet from the Hyperswitch team.
Updating the Prometheus configuration to export selected metrics.
Reloading the Prometheus configuration.
Configuration reload can be triggered by sending a SIGHUP signal to the Prometheus process.
Infrastructure rules may also need to be updated to allow outbound connections to the Hyperswitch monitoring endpoint.
After configuration, verify that:
remote write metrics are being exported successfully
the Hyperswitch team confirms receipt of the metrics
Note: Remote monitoring is available only for merchants subscribing to the Enterprise Edition.
Alerting
Alerting should be configured to proactively notify operators about abnormal system conditions.
Alerts should be categorized into Infrastructure Alerts, Application Alerts, and Business Alerts.
Infrastructure Alerts
Infrastructure alerts monitor system health and resource utilization.
CPU or memory usage > 80%
Sev 1
CPU or memory usage > 70%
Sev 2
CPU or memory usage > 60%
Sev 3
Health check endpoint returning non-2xx
Sev 1
Infrastructure alerts should apply to:
application pods
database instances
Redis instances
cluster nodes
Application Alerts
Application alerts monitor API behavior.
5xx API responses
Sev 1 (if >1 request)
Abnormal 4xx response rate
Configurable based on merchant traffic
These alerts help identify:
application crashes
integration failures
misconfigurations
Business Alerts
Business alerts monitor the performance of payment processing.
Success rate drop > 30%
Sev 1
Success rate drop > 20%
Sev 2
Success rate drop > 10%
Sev 3
These alerts should be configured for:
overall authorization success rate
payment method level success rate
processor/acquirer level success rate
Logging and Log Archival
Logging is essential for production environments for the following purposes:
troubleshooting and debugging
security monitoring
compliance and auditing
operational analysis
It is recommended to centralize logs using platforms such as:
Elasticsearch / Logstash / Kibana (ELK Stack)
Splunk
Grafana Loki
The following practices should be implemented in production environments.
Log Retention
Application logs must be archived and retained in accordance with compliance requirements such as PCI DSS.
Retention policies should be defined based on organizational security policies.
Log Rotation
Automated log rotation must be configured to prevent excessive disk usage and ensure long-term system stability.
Security and Access Control
Logs should be protected through:
encryption at rest
encryption in transit
restricted access for authorized personnel only
Although Hyperswitch does not log sensitive payment data, logs must still be handled securely to maintain compliance and prevent information leakage.
Log Review
Access logs and operational logs should be periodically reviewed to detect:
unauthorized access
suspicious system activity
operational anomalies
This process is commonly required during security audits and PCI compliance reviews.
Last updated
Was this helpful?

