Application Monitoring
Application monitoring instruments software systems to measure response times, error rates, throughput, and user experience. You implement this monitoring when deploying new applications, when existing applications lack visibility, or when service level targets require measurement. The procedures in this page produce dashboards showing application health, alerts for degraded performance, and diagnostic data for troubleshooting.
- Application Performance Monitoring (APM)
- Instrumentation that traces requests through application code, measuring execution time for each component and capturing errors with stack traces.
- Synthetic Monitoring
- Automated scripts that simulate user interactions at scheduled intervals, measuring availability and performance from external vantage points.
- Real User Monitoring (RUM)
- JavaScript instrumentation in browsers or SDKs in mobile applications that captures actual user experience metrics including page load times, interaction delays, and errors.
- Distributed Tracing
- Correlation of requests across multiple services using trace identifiers, enabling end-to-end visibility through microservices architectures.
Prerequisites
Before implementing application monitoring, confirm you have the following access and information.
You need administrative access to the application deployment environment, whether that is SSH access to servers, Kubernetes cluster credentials with permissions to deploy DaemonSets and modify deployments, or platform-as-a-service console access. For APM agents, you need the ability to modify application startup parameters or add dependencies to the application package. For RUM, you need the ability to modify HTML templates or JavaScript bundles served to users.
Gather the application architecture documentation showing the services involved, their communication patterns, and the technology stack for each component. You need this to select appropriate instrumentation methods. Identify the programming languages and frameworks in use: a Python Django application requires different APM configuration than a Java Spring Boot service or a Node.js Express application.
Obtain baseline performance data if available. Review existing logs for response time patterns, check infrastructure monitoring for resource utilisation during normal operation, and collect any available user feedback about performance. If no baseline exists, the first monitoring implementation establishes one.
Verify network connectivity between application hosts and the monitoring platform. APM agents transmit telemetry data over HTTPS, requiring outbound access on port 443 to the collector endpoint. For self-hosted monitoring platforms like SigNoz or Jaeger, confirm the collector service is reachable from application hosts.
Install the monitoring platform or confirm access to the monitoring service. The procedures below reference open source options (SigNoz, Prometheus, Grafana) and note where commercial alternatives differ. Ensure the platform is operational before instrumenting applications.
| Prerequisite | Verification command | Expected result |
|---|---|---|
| Monitoring platform reachable | curl -s https://collector.example.org/health | HTTP 200 response |
| Application deployment access | kubectl auth can-i update deployments | ”yes” |
| Language runtime identified | python --version or java -version | Version string |
| Network egress permitted | nc -zv collector.example.org 443 | ”Connection succeeded” |
Procedure
Application monitoring implementation proceeds through layers: APM for internal application behaviour, synthetic monitoring for external availability, RUM for actual user experience, and custom instrumentation for business-specific metrics. Implement APM first as it provides the foundation for understanding application behaviour, then add synthetic and RUM based on requirements.
Configuring Application Performance Monitoring
APM agents attach to your application runtime, intercepting method calls and HTTP requests to measure timing and capture errors. The agent transmits this telemetry to a collector service that aggregates traces and makes them queryable.
- Select the APM agent matching your application’s primary language. For OpenTelemetry-based monitoring with SigNoz or Jaeger, download the language-specific SDK. For Python applications:
pip install opentelemetry-distro opentelemetry-exporter-otlp \ --break-system-packages opentelemetry-bootstrap -a installFor Java applications, download the agent JAR:
curl -L -o opentelemetry-javaagent.jar \ https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/latest/download/opentelemetry-javaagent.jarFor Node.js applications:
npm install @opentelemetry/auto-instrumentations-node \ @opentelemetry/exporter-trace-otlp-http- Configure the agent to connect to your collector. Create an environment file or modify your application’s startup configuration. The critical parameters are the collector endpoint and the service name that identifies this application in traces:
OTEL_SERVICE_NAME=grants-management-api OTEL_EXPORTER_OTLP_ENDPOINT=https://collector.example.org:4318 OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf OTEL_TRACES_SAMPLER=parentbased_traceidratio OTEL_TRACES_SAMPLER_ARG=0.1The sampler configuration controls what percentage of traces the system retains. A value of 0.1 samples 10% of traces, appropriate for high-traffic applications generating over 1000 requests per minute. For lower-traffic applications or during initial debugging, increase to 1.0 to capture all traces.
- Modify the application startup to load the agent. For Python with a WSGI server:
opentelemetry-instrument gunicorn grants_app.wsgi:application \ --workers 4 --bind 0.0.0.0:8000For Java applications, add the agent to the JVM arguments:
java -javaagent:/opt/opentelemetry-javaagent.jar \ -Dotel.service.name=case-management-service \ -Dotel.exporter.otlp.endpoint=https://collector.example.org:4318 \ -jar case-management.jarFor Node.js, require the instrumentation before your application code:
// tracing.js - load before application const { NodeSDK } = require('@opentelemetry/sdk-node'); const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node'); const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');
const sdk = new NodeSDK({ traceExporter: new OTLPTraceExporter({ url: 'https://collector.example.org:4318/v1/traces', }), instrumentations: [getNodeAutoInstrumentations()], serviceName: 'beneficiary-portal', });
sdk.start();Start Node.js with: node --require ./tracing.js app.js
- For containerised deployments in Kubernetes, add the agent configuration to your deployment manifest. This example shows a Python application with the OpenTelemetry agent:
apiVersion: apps/v1 kind: Deployment metadata: name: grants-api spec: template: spec: containers: - name: grants-api image: registry.example.org/grants-api:1.4.2 command: ["opentelemetry-instrument"] args: ["gunicorn", "grants_app.wsgi:application", "--workers", "4", "--bind", "0.0.0.0:8000"] env: - name: OTEL_SERVICE_NAME value: "grants-management-api" - name: OTEL_EXPORTER_OTLP_ENDPOINT value: "http://otel-collector.monitoring:4318" - name: OTEL_TRACES_SAMPLER value: "parentbased_traceidratio" - name: OTEL_TRACES_SAMPLER_ARG value: "0.1" resources: requests: memory: "256Mi" cpu: "100m" limits: memory: "512Mi" cpu: "500m"- Restart the application with the agent attached. Monitor the application logs during startup to confirm the agent initialises without errors:
# For systemd-managed services sudo systemctl restart grants-api journalctl -u grants-api -f | grep -i opentelemetrySuccessful initialisation shows messages indicating the exporter connected and instrumentation loaded. Error messages at this stage indicate configuration problems; see Troubleshooting.
- Generate test traffic and verify traces appear in your monitoring platform. Make several requests to the application:
for i in {1..10}; do curl -s -o /dev/null -w "%{http_code} %{time_total}s\n" \ https://grants.example.org/api/health doneIn SigNoz or Jaeger, navigate to the traces view and filter by the service name you configured. Traces should appear within 30 seconds of the requests.
+------------------------------------------------------------------+| APM DATA FLOW |+------------------------------------------------------------------+| || +-------------------+ +-------------------+ || | Application | | Application | || | Instance 1 | | Instance 2 | || | | | | || | +---------------+ | | +---------------+ | || | | APM Agent | | | | APM Agent | | || | | (in-process) | | | | (in-process) | | || | +-------+-------+ | | +-------+-------+ | || +---------+---------+ +---------+---------+ || | | || | OTLP/HTTP | OTLP/HTTP || | (port 4318) | (port 4318) || v v || +---------+----------------------------+---------+ || | OTEL Collector | || | | || | +----------+ +----------+ +----------+ | || | | Receiver |-->| Processor|-->| Exporter | | || | +----------+ +----------+ +----------+ | || +------------------------+-----------------------+ || | || v || +------------------------+-----------------------+ || | Monitoring Backend | || | (SigNoz / Jaeger / Tempo / Commercial APM) | || | | || | +------------+ +------------+ +----------+ | || | | Trace | | Metrics | | Query | | || | | Storage | | Storage | | Engine | | || | +------------+ +------------+ +----------+ | || +------------------------------------------------+ |+------------------------------------------------------------------+Figure 1: APM telemetry flows from agents through the collector to the monitoring backend
Configuring Synthetic Monitoring
Synthetic monitoring executes scripted checks against your application from external locations, measuring availability and response time independent of real user traffic. This provides consistent baseline measurements and detects outages before users report them.
Identify the critical user journeys to monitor synthetically. For a grants management system, this includes the login flow, grant application submission, and report generation. For each journey, document the HTTP requests involved and the expected response characteristics.
Start with availability checks for key endpoints before implementing complex transaction monitors:
# synthetic-checks.yaml for Prometheus Blackbox Exporter modules: http_2xx: prober: http timeout: 10s http: valid_http_versions: ["HTTP/1.1", "HTTP/2.0"] valid_status_codes: [200] method: GET follow_redirects: true preferred_ip_protocol: "ip4"
http_post_json: prober: http timeout: 15s http: method: POST headers: Content-Type: application/json body: '{"test": true}' valid_status_codes: [200, 201]- Deploy the Blackbox Exporter for Prometheus-based synthetic monitoring. On a monitoring host or as a Kubernetes deployment:
# Download and install curl -LO https://github.com/prometheus/blackbox_exporter/releases/download/v0.24.0/blackbox_exporter-0.24.0.linux-amd64.tar.gz tar xzf blackbox_exporter-0.24.0.linux-amd64.tar.gz sudo mv blackbox_exporter-0.24.0.linux-amd64/blackbox_exporter /usr/local/bin/
# Create configuration directory sudo mkdir -p /etc/blackbox_exporter sudo mv synthetic-checks.yaml /etc/blackbox_exporter/config.yaml
# Create systemd service sudo tee /etc/systemd/system/blackbox_exporter.service << 'EOF' [Unit] Description=Blackbox Exporter After=network.target
[Service] Type=simple User=prometheus ExecStart=/usr/local/bin/blackbox_exporter \ --config.file=/etc/blackbox_exporter/config.yaml Restart=always
[Install] WantedBy=multi-user.target EOF
sudo systemctl daemon-reload sudo systemctl enable --now blackbox_exporter- Configure Prometheus to scrape the Blackbox Exporter for each target endpoint. Add to your Prometheus configuration:
# prometheus.yml - add to scrape_configs scrape_configs: - job_name: 'synthetic-http' metrics_path: /probe params: module: [http_2xx] static_configs: - targets: - https://grants.example.org/health - https://grants.example.org/api/v1/status - https://beneficiary-portal.example.org/ - https://case-management.example.org/api/health relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - target_label: __address__ replacement: localhost:9115 # Blackbox exporter address- For multi-step transaction monitoring requiring login or form submission, use a dedicated synthetic monitoring tool. Checkly (commercial with free tier) or the open source Playwright-based approach provides browser-based synthetic tests:
// synthetic-login-test.js using Playwright const { chromium } = require('playwright'); const { PrometheusExporter } = require('@opentelemetry/exporter-prometheus');
async function testLoginFlow() { const startTime = Date.now(); const browser = await chromium.launch(); const page = await browser.newPage();
try { // Navigate to login page await page.goto('https://grants.example.org/login'); const navigationTime = Date.now() - startTime;
// Fill credentials await page.fill('#username', process.env.SYNTHETIC_USER); await page.fill('#password', process.env.SYNTHETIC_PASS);
// Submit and wait for dashboard const submitStart = Date.now(); await page.click('#login-button'); await page.waitForSelector('#dashboard', { timeout: 30000 }); const loginTime = Date.now() - submitStart;
const totalTime = Date.now() - startTime;
// Export metrics console.log(`navigation_time_ms=${navigationTime}`); console.log(`login_time_ms=${loginTime}`); console.log(`total_time_ms=${totalTime}`); console.log(`status=success`);
} catch (error) { console.log(`status=failure`); console.log(`error=${error.message}`); } finally { await browser.close(); } }
testLoginFlow();Schedule this script to run every 5 minutes from a monitoring host outside your application network. Use cron or a container scheduler:
*/5 * * * * monitoring /usr/local/bin/node /opt/synthetic/login-test.js >> /var/log/synthetic/login.log 2>&1- Verify synthetic checks execute and metrics appear. Query Prometheus for the probe metrics:
probe_success{job="synthetic-http"} probe_duration_seconds{job="synthetic-http"}All targets should show probe_success value of 1. Response times appear in probe_duration_seconds.
Configuring Real User Monitoring
Real user monitoring captures performance data from actual user sessions, revealing how application performance varies across geographies, devices, and network conditions. RUM data reflects the true user experience rather than synthetic simulations.
- Add the RUM instrumentation script to your application’s HTML. For OpenTelemetry-based RUM compatible with SigNoz:
<!-- Add to <head> section of your base template --> <script src="https://unpkg.com/@opentelemetry/api@1.4.1/build/bundles/opentelemetry-api.min.js"></script> <script src="https://unpkg.com/@opentelemetry/sdk-trace-web@1.15.0/build/bundles/opentelemetry-sdk-trace-web.min.js"></script> <script src="https://unpkg.com/@opentelemetry/instrumentation-document-load@0.31.0/build/bundles/opentelemetry-instrumentation-document-load.min.js"></script> <script src="https://unpkg.com/@opentelemetry/instrumentation-fetch@0.41.0/build/bundles/opentelemetry-instrumentation-fetch.min.js"></script> <script src="https://unpkg.com/@opentelemetry/exporter-trace-otlp-http@0.41.0/build/bundles/opentelemetry-exporter-trace-otlp-http.min.js"></script>
<script> const provider = new opentelemetry.sdk.trace.WebTracerProvider({ resource: new opentelemetry.resources.Resource({ 'service.name': 'beneficiary-portal-frontend', }), });
provider.addSpanProcessor( new opentelemetry.sdk.trace.BatchSpanProcessor( new opentelemetry.exporter.OTLPTraceExporter({ url: 'https://collector.example.org:4318/v1/traces', }) ) );
provider.register();
// Register auto-instrumentations opentelemetry.instrumentation.registerInstrumentations({ instrumentations: [ new opentelemetry.instrumentation.DocumentLoadInstrumentation(), new opentelemetry.instrumentation.FetchInstrumentation({ propagateTraceHeaderCorsUrls: [/example\.org/], }), ], }); </script>- For single-page applications built with React, Vue, or Angular, integrate RUM instrumentation into the application build rather than loading scripts dynamically. For a React application:
import { WebTracerProvider } from '@opentelemetry/sdk-trace-web'; import { BatchSpanProcessor } from '@opentelemetry/sdk-trace-base'; import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http'; import { ZoneContextManager } from '@opentelemetry/context-zone'; import { registerInstrumentations } from '@opentelemetry/instrumentation'; import { getWebAutoInstrumentations } from '@opentelemetry/auto-instrumentations-web'; import { Resource } from '@opentelemetry/resources';
const provider = new WebTracerProvider({ resource: new Resource({ 'service.name': 'grants-portal-spa', 'deployment.environment': process.env.NODE_ENV, }), });
provider.addSpanProcessor( new BatchSpanProcessor( new OTLPTraceExporter({ url: process.env.REACT_APP_OTEL_ENDPOINT + '/v1/traces', }) ) );
provider.register({ contextManager: new ZoneContextManager(), });
registerInstrumentations({ instrumentations: [ getWebAutoInstrumentations({ '@opentelemetry/instrumentation-fetch': { propagateTraceHeaderCorsUrls: [ new RegExp(process.env.REACT_APP_API_URL), ], }, }), ], });
export default provider;Import this module at your application entry point before other code:
import './telemetry'; // Must be first import import React from 'react'; import ReactDOM from 'react-dom'; import App from './App';
ReactDOM.render(<App />, document.getElementById('root'));- Configure Cross-Origin Resource Sharing headers on your collector to accept RUM data from browser origins. In your OTEL Collector configuration:
receivers: otlp: protocols: http: endpoint: 0.0.0.0:4318 cors: allowed_origins: - https://grants.example.org - https://beneficiary-portal.example.org allowed_headers: - content-type - x-requested-with max_age: 7200Deploy the updated application and verify RUM data arrives. Open your application in a browser, navigate through several pages, and check the monitoring platform for traces with the frontend service name. Browser developer tools (Network tab) should show successful POST requests to your collector endpoint.
Create dashboards for core web vitals derived from RUM data. The key metrics to track:
# Largest Contentful Paint (target: under 2.5 seconds) histogram_quantile(0.75, sum(rate(browser_lcp_bucket[5m])) by (le, page_path) )
# First Input Delay (target: under 100ms) histogram_quantile(0.75, sum(rate(browser_fid_bucket[5m])) by (le, page_path) )
# Cumulative Layout Shift (target: under 0.1) histogram_quantile(0.75, sum(rate(browser_cls_bucket[5m])) by (le, page_path) )Configuring API Monitoring
API endpoints require specific monitoring beyond general APM to track request volumes, error rates by endpoint, and latency distributions. This instrumentation enables identification of slow or failing endpoints affecting downstream consumers.
- Expose application metrics in Prometheus format from your API. Most web frameworks have middleware that generates standard HTTP metrics. For Python Flask or Django with the prometheus-client library:
from prometheus_client import Counter, Histogram, generate_latest from flask import request, Response import time
REQUEST_COUNT = Counter( 'http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'] )
REQUEST_LATENCY = Histogram( 'http_request_duration_seconds', 'HTTP request latency', ['method', 'endpoint'], buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0] )
def before_request(): request.start_time = time.time()
def after_request(response): latency = time.time() - request.start_time endpoint = request.endpoint or 'unknown' REQUEST_COUNT.labels( method=request.method, endpoint=endpoint, status=response.status_code ).inc() REQUEST_LATENCY.labels( method=request.method, endpoint=endpoint ).observe(latency) return response
def metrics_endpoint(): return Response(generate_latest(), mimetype='text/plain')Register these in your Flask application:
from flask import Flask from metrics import before_request, after_request, metrics_endpoint
app = Flask(__name__) app.before_request(before_request) app.after_request(after_request) app.add_url_rule('/metrics', 'metrics', metrics_endpoint)- For Java Spring Boot applications, add the Micrometer Prometheus registry:
<dependency> <groupId>io.micrometer</groupId> <artifactId>micrometer-registry-prometheus</artifactId> </dependency> management: endpoints: web: exposure: include: prometheus,health metrics: tags: application: case-management-api distribution: percentiles-histogram: http.server.requests: true slo: http.server.requests: 100ms,250ms,500ms,1s- Configure Prometheus to scrape your API metrics endpoints:
scrape_configs: - job_name: 'grants-api' scrape_interval: 15s static_configs: - targets: ['grants-api.example.org:8000'] metrics_path: /metrics
- job_name: 'case-management-api' scrape_interval: 15s kubernetes_sd_configs: - role: endpoints namespaces: names: ['production'] relabel_configs: - source_labels: [__meta_kubernetes_service_name] regex: case-management-api action: keep - source_labels: [__meta_kubernetes_endpoint_port_name] regex: metrics action: keep- Create API-specific dashboards showing the RED metrics (Rate, Errors, Duration) for each endpoint. Example Grafana panel queries:
# Request rate by endpoint (requests per second) sum(rate(http_requests_total[5m])) by (endpoint)
# Error rate by endpoint (percentage) sum(rate(http_requests_total{status=~"5.."}[5m])) by (endpoint) / sum(rate(http_requests_total[5m])) by (endpoint) * 100
# 95th percentile latency by endpoint histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint) )Configuring Database Performance Monitoring
Database queries often dominate application response time. Database monitoring identifies slow queries, connection pool exhaustion, and capacity constraints before they affect users.
- Enable the database’s built-in performance metrics. For PostgreSQL, install the pg_stat_statements extension:
-- Connect as superuser CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
-- Verify installation SELECT * FROM pg_stat_statements LIMIT 1;Configure PostgreSQL to track query statistics in postgresql.conf:
shared_preload_libraries = 'pg_stat_statements' pg_stat_statements.track = all pg_stat_statements.max = 10000 track_activity_query_size = 2048- Deploy the Prometheus PostgreSQL exporter to expose database metrics:
# Create connection string file echo "DATA_SOURCE_NAME=postgresql://monitor:password@localhost:5432/grants?sslmode=disable" > /etc/default/postgres_exporter
# Download and install exporter curl -LO https://github.com/prometheus-community/postgres_exporter/releases/download/v0.13.2/postgres_exporter-0.13.2.linux-amd64.tar.gz tar xzf postgres_exporter-0.13.2.linux-amd64.tar.gz sudo mv postgres_exporter-0.13.2.linux-amd64/postgres_exporter /usr/local/bin/
# Create systemd service sudo tee /etc/systemd/system/postgres_exporter.service << 'EOF' [Unit] Description=Prometheus PostgreSQL Exporter After=network.target
[Service] Type=simple User=prometheus EnvironmentFile=/etc/default/postgres_exporter ExecStart=/usr/local/bin/postgres_exporter \ --collector.stat_statements Restart=always
[Install] WantedBy=multi-user.target EOF
sudo systemctl daemon-reload sudo systemctl enable --now postgres_exporter- For MySQL or MariaDB, deploy the MySQL exporter:
-- Create monitoring user CREATE USER 'exporter'@'localhost' IDENTIFIED BY 'strong_password_here'; GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'localhost'; FLUSH PRIVILEGES; # Configure and start exporter echo "[client] user=exporter password=strong_password_here" > /etc/.mysqld_exporter.cnf chmod 600 /etc/.mysqld_exporter.cnf
mysqld_exporter --config.my-cnf=/etc/.mysqld_exporter.cnf- Create dashboards for database performance. Key metrics to display:
# Active connections vs maximum pg_stat_activity_count{state="active"} pg_settings_max_connections
# Query execution time (top slow queries) topk(10, pg_stat_statements_mean_time_seconds)
# Cache hit ratio (target: above 99%) pg_stat_database_blks_hit / (pg_stat_database_blks_hit + pg_stat_database_blks_read)
# Transaction rate rate(pg_stat_database_xact_commit[5m])+-----------------------------------------------------------+| APPLICATION MONITORING LAYERS |+-----------------------------------------------------------+| || +-------------------------------------------------+ || | USER EXPERIENCE LAYER | || | +------------------+ +------------------+ | || | | Real User | | Synthetic | | || | | Monitoring | | Monitoring | | || | | - Page load | | - Availability | | || | | - Interactions | | - Transaction | | || | | - Core Web Vitals| | success | | || | +------------------+ +------------------+ | || +-------------------------------------------------+ || | || v || +-------------------------------------------------+ || | APPLICATION LAYER | || | +------------------+ +------------------+ | || | | APM Tracing | | API Metrics | | || | | - Request traces | | - Request rate | | || | | - Error capture | | - Error rate | | || | | - Dependencies | | - Latency p50/95 | | || | +------------------+ +------------------+ | || +-------------------------------------------------+ || | || v || +-------------------------------------------------+ || | DATA LAYER | || | +------------------+ +------------------+ | || | | Database Metrics | | Cache Metrics | | || | | - Query time | | - Hit ratio | | || | | - Connections | | - Memory usage | | || | | - Lock waits | | - Evictions | | || | +------------------+ +------------------+ | || +-------------------------------------------------+ || | || v || +-------------------------------------------------+ || | INFRASTRUCTURE LAYER | || | (see Infrastructure Monitoring) | || +-------------------------------------------------+ |+-----------------------------------------------------------+Figure 2: Application monitoring spans multiple layers, each providing different visibility
Adding Custom Instrumentation
Auto-instrumentation captures framework-level metrics but misses business-specific operations. Custom instrumentation adds spans for important business logic, enabling correlation between technical metrics and business outcomes.
Identify business-critical operations that warrant explicit instrumentation. Examples include payment processing, beneficiary registration, grant application submission, and report generation. These operations should appear as named spans in traces.
Add custom spans around business operations. For Python with OpenTelemetry:
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
def process_grant_application(application_id: str, applicant_data: dict): with tracer.start_as_current_span("process_grant_application") as span: span.set_attribute("application.id", application_id) span.set_attribute("applicant.type", applicant_data.get("type"))
# Validation phase with tracer.start_as_current_span("validate_application"): validation_result = validate_application_data(applicant_data) span.set_attribute("validation.passed", validation_result.is_valid)
# Eligibility check with tracer.start_as_current_span("check_eligibility"): eligibility = check_funding_eligibility(applicant_data) span.set_attribute("eligibility.score", eligibility.score)
# Storage with tracer.start_as_current_span("store_application"): stored = store_application(application_id, applicant_data)
span.set_attribute("application.status", "submitted") return stored- Add custom metrics for business KPIs. Create counters and histograms for operations you need to track:
from prometheus_client import Counter, Histogram
APPLICATIONS_SUBMITTED = Counter( 'grant_applications_submitted_total', 'Total grant applications submitted', ['grant_type', 'applicant_type', 'status'] )
APPLICATION_PROCESSING_TIME = Histogram( 'grant_application_processing_seconds', 'Time to process grant application', ['grant_type'], buckets=[1, 5, 10, 30, 60, 120, 300] )
def submit_application(grant_type: str, applicant_type: str, data: dict): with APPLICATION_PROCESSING_TIME.labels(grant_type=grant_type).time(): result = process_application(data)
APPLICATIONS_SUBMITTED.labels( grant_type=grant_type, applicant_type=applicant_type, status=result.status ).inc()
return result- For JavaScript frontends, add custom spans for user interactions:
import { trace } from '@opentelemetry/api';
const tracer = trace.getTracer('grants-portal');
async function submitGrantApplication(formData) { const span = tracer.startSpan('submit_grant_application');
try { span.setAttribute('grant.type', formData.grantType); span.setAttribute('form.field_count', Object.keys(formData).length);
const validationSpan = tracer.startSpan('validate_form', { parent: span }); const isValid = validateForm(formData); validationSpan.setAttribute('validation.passed', isValid); validationSpan.end();
if (!isValid) { span.setAttribute('submission.status', 'validation_failed'); span.end(); return { success: false, error: 'Validation failed' }; }
const response = await fetch('/api/applications', { method: 'POST', body: JSON.stringify(formData) });
span.setAttribute('submission.status', response.ok ? 'success' : 'failed'); span.setAttribute('http.status_code', response.status); span.end();
return await response.json(); } catch (error) { span.recordException(error); span.setAttribute('submission.status', 'error'); span.end(); throw error; } }Verification
After implementing each monitoring layer, verify data flows correctly through the system and dashboards display meaningful information.
Confirm APM traces appear by executing a request and locating it in your tracing backend. The trace should show the complete request path including database queries, external HTTP calls, and custom spans:
# Generate a traced requestcurl -H "X-Request-ID: test-$(date +%s)" https://grants.example.org/api/applications/
# In SigNoz or Jaeger, search for traces from the last 15 minutes# Filter by service name and verify the trace shows:# - HTTP handler span# - Database query spans# - Any external service callsVerify synthetic monitoring executes and records results:
# Query Prometheus for synthetic check resultscurl -s 'http://prometheus.example.org:9090/api/v1/query?query=probe_success' | jq '.data.result[] | {target: .metric.instance, success: .value[1]}'
# Expected output shows all targets with success value "1"Confirm RUM data arrives from browser sessions:
# Check collector logs for incoming browser traceskubectl logs -l app=otel-collector -n monitoring | grep "service.name=.*frontend"
# Verify traces appear in the UI with browser-specific attributes:# - browser.name# - browser.version# - user_agent.originalTest API metrics export by querying the metrics endpoint directly:
curl -s https://grants.example.org/metrics | grep http_request
# Expected output includes:# http_requests_total{method="GET",endpoint="applications",status="200"} 1547# http_request_duration_seconds_bucket{method="GET",endpoint="applications",le="0.1"} 1423Validate database metrics collection:
# Query Prometheus for database metricscurl -s 'http://prometheus.example.org:9090/api/v1/query?query=pg_stat_activity_count' | jq '.data.result'
# Verify pg_stat_statements metrics show query datacurl -s 'http://prometheus.example.org:9090/api/v1/query?query=topk(5,pg_stat_statements_mean_time_seconds)' | jq '.data.result'Troubleshooting
| Symptom | Cause | Resolution |
|---|---|---|
| No traces appear in monitoring platform | Agent not initialised or collector unreachable | Check application logs for agent startup messages; verify network connectivity to collector with curl -v https://collector.example.org:4318/v1/traces |
| Traces appear but show only single span | Auto-instrumentation not detecting framework | Verify framework is supported; add explicit instrumentation for unsupported libraries |
| High latency reported but application feels fast | Clock skew between services | Synchronise time with NTP across all hosts; verify with chronyc tracking |
| RUM data missing from some users | Content Security Policy blocking collector | Add collector domain to CSP connect-src directive; check browser console for CSP errors |
| Synthetic checks fail intermittently | Network instability or timeout too short | Increase timeout in Blackbox Exporter config; add retry logic; check from multiple locations |
| Database exporter shows no data | Insufficient privileges for monitoring user | Grant required permissions: GRANT pg_monitor TO exporter_user; for PostgreSQL |
| Metrics endpoint returns 404 | Metrics middleware not registered or wrong path | Verify middleware registration in application startup; check configured metrics path |
| Trace sampling drops important requests | Sampling rate too low | Increase OTEL_TRACES_SAMPLER_ARG value; implement custom sampler for critical paths |
| Custom spans not appearing in traces | Span not properly parented or ended | Ensure spans are created within context of parent; call span.end() in all code paths including exceptions |
| Browser traces not correlating with backend | Trace context not propagating in CORS requests | Configure propagateTraceHeaderCorsUrls in RUM instrumentation; add traceparent to allowed headers in backend CORS config |
| Collector rejecting data with 413 error | Batch size exceeds collector limit | Reduce batch size in span processor configuration: maxExportBatchSize: 256 |
| Metrics cardinality explosion causing storage issues | Unbounded labels on metrics (user IDs, request IDs) | Remove high-cardinality labels; use tracing for request-specific data instead of metrics |
| APM agent causing application startup failure | Agent version incompatible with runtime | Check agent compatibility matrix; downgrade agent or upgrade runtime |
| Missing database query details in traces | Statement sanitisation removing query text | Configure agent to include query text: OTEL_INSTRUMENTATION_DB_STATEMENT=true (caution: may log sensitive data) |