Chapter 12: Operations & Maintenance

Day-to-day operations, monitoring dashboards, certificate lifecycle management, maintenance schedules, and incident response procedures for identity authentication systems

Effective operations and maintenance (O&M) of a network identity authentication system requires a combination of proactive monitoring, scheduled maintenance activities, and well-rehearsed incident response procedures. The system's role as a critical security control means that any degradation in availability or accuracy has immediate impact on both security posture and user productivity. This chapter provides the operational framework, monitoring requirements, maintenance schedules, and incident response playbooks needed to sustain the system at production quality throughout its operational life.

12.1 Operations Center and Monitoring Dashboard

The identity authentication system must be monitored continuously from a centralized operations center. The monitoring dashboard should provide real-time visibility into authentication success rates, active sessions, certificate expiry status, RADIUS server health, and VLAN assignment statistics. The image below illustrates a professional operations center setup with the recommended dashboard layout.

Network Identity Authentication Operations Center Dashboard

Figure 12.1: Operations Center — Enterprise authentication monitoring dashboard showing real-time auth success/failure rates (98.5%), active sessions (15,402), certificate expiry calendar, RADIUS server health metrics for three nodes, VLAN assignment statistics, and SIEM log view. Engineer monitoring all systems from a multi-screen workstation.

Dashboard Panel	Key Metrics	Alert Threshold	Data Source
Authentication Overview	Auth/s, success rate, failure rate, top failure reasons	Success rate < 99.5% → P2 alert; < 99% → P1 alert	RADIUS accounting logs; SIEM
Active Sessions	Total active sessions, by VLAN, by user type, by location	Sudden drop > 20% → P2 alert (possible outage)	RADIUS accounting; switch SNMP
Certificate Expiry Calendar	Certs expiring in 30/7/1 days; expired certs	Any cert expiring in < 7 days → P2; expired → P1	CA database; LDAP cert attributes
RADIUS Server Health	CPU%, RAM%, auth/s per node, queue depth, error rate	CPU > 80% → P2; any node down → P1	SNMP; RADIUS internal metrics
OCSP/CRL Status	OCSP responder availability; CRL freshness; response time	OCSP unavailable → P1; CRL stale > 2× validity → P1	OCSP monitoring probe; CRL download check
SIEM Authentication Events	Auth event volume, anomaly score, failed auth by source IP	Brute-force pattern detected → P1 security alert	SIEM correlation engine

12.2 Scheduled Maintenance Activities

Scheduled maintenance activities must be planned, documented, and executed during approved maintenance windows. All maintenance activities should be performed with a rollback plan in place. The following table defines the recommended maintenance schedule for all major components of the identity authentication system.

Activity	Frequency	Duration	Maintenance Window Required?	Procedure Reference
RADIUS server OS patching	Monthly	30–60 min/node	Yes (rolling; no downtime if HA)	Runbook: RADIUS-PATCH-001
CA / PKI software updates	Quarterly	2–4 hours	Yes (offline root CA: annual)	Runbook: PKI-UPDATE-001
Certificate revocation list (CRL) publication	Daily (auto)	Automated; < 1 min	No	Automated via CA schedule
RADIUS configuration backup	Daily (auto)	Automated; < 5 min	No	Automated via backup script
HA failover test	Quarterly	30 min	Yes (off-peak hours)	Runbook: HA-TEST-001
Certificate inventory audit	Monthly	1–2 hours	No	Runbook: CERT-AUDIT-001
Security hardening review	Semi-annual	4–8 hours	No (read-only audit)	Runbook: SEC-REVIEW-001
Disaster recovery drill	Annual	4–8 hours	Yes (full DR environment)	Runbook: DR-DRILL-001
Hardware inspection and cleaning	Annual	2–4 hours	Yes	Runbook: HW-INSPECT-001

12.3 Incident Response Playbooks

Incident response playbooks define the step-by-step actions that operations staff must take when specific incident types are detected. Playbooks must be reviewed and updated at least annually, and all operations staff must be trained on the playbooks before being authorized to respond to incidents independently. The following table summarizes the key playbooks for the most critical incident types.

Incident Type	Priority	Initial Response (0–15 min)	Escalation (15–60 min)	Resolution Target
RADIUS server cluster outage (all nodes down)	P1 — Critical	1) Verify outage scope; 2) Activate break-glass accounts; 3) Page on-call engineer; 4) Notify stakeholders	1) Attempt restart; 2) Failover to DR site; 3) Engage vendor support if needed	RTO: 30 min; RPO: 4 hours
Mass certificate expiry (batch expiry event)	P1 — Critical	1) Identify scope of expiry; 2) Trigger emergency renewal via SCEP/EST; 3) Notify affected users	1) Manual renewal for critical certs; 2) Extend validity via CA if renewal fails; 3) Engage PKI team	All certs renewed within 4 hours
Brute-force / credential stuffing attack	P1 — Security	1) Block source IPs at firewall; 2) Lock targeted accounts; 3) Notify CISO; 4) Preserve logs	1) Engage SOC for full investigation; 2) Check for successful breaches; 3) Reset compromised credentials	Attack contained within 15 min; investigation complete within 24 hours
OCSP responder unavailable	P2 — High	1) Verify OCSP availability from multiple locations; 2) Check if CRL is available as fallback; 3) Monitor auth failure rate	1) Restart OCSP service; 2) Failover to secondary OCSP; 3) Enable CRL fallback if needed	OCSP restored within 1 hour; CRL fallback within 15 min
AD/LDAP connectivity failure	P2 — High	1) Verify AD connectivity from RADIUS; 2) Check if cached credentials allow auth; 3) Notify AD team	1) Failover to secondary DC; 2) Verify LDAPS certificate; 3) Check firewall rules	Connectivity restored within 1 hour; cached auth as interim

12.4 Certificate Lifecycle Operations

Certificate lifecycle management is one of the most operationally intensive aspects of running an identity authentication system. The following table defines the operational procedures for each stage of the certificate lifecycle, from enrollment through revocation and archival. Automation should be implemented for all stages where possible to reduce operational burden and eliminate human error.

Lifecycle Stage	Trigger	Automated?	Procedure	Verification
Initial Enrollment	New device/user onboarding	Yes (SCEP/EST/NDES)	MDM/SCEP pushes cert request to CA; CA issues cert; cert deployed to endpoint	Verify cert in endpoint cert store; test auth
Renewal (scheduled)	Cert within renewal window (e.g., 30 days before expiry)	Yes (auto-renewal)	SCEP/EST renewal request sent automatically; new cert issued; old cert replaced	Monitor renewal success rate in CA logs; alert on failures
Revocation (user offboarding)	HR offboarding trigger or IT request	Partial (manual approval)	IT submits revocation request; CA revokes cert; CRL/OCSP updated within 5 min	Verify OCSP returns revoked; test auth with revoked cert fails
Revocation (device lost/stolen)	Security incident report	No (manual; immediate)	Security team submits emergency revocation; CA revokes immediately; OCSP updated	OCSP returns revoked within 5 min; verify auth fails
CA Certificate Renewal	Sub-CA cert within 1 year of expiry	No (manual; planned)	Plan renewal 6 months in advance; issue new sub-CA cert; update trust stores on all RADIUS servers and endpoints	All RADIUS servers trust new sub-CA; test auth with new-CA-issued cert
Archival	Cert expired and no longer needed	Yes (automated cleanup)	Expired certs archived to cold storage after 90 days; removed from active database	Verify archived certs accessible for audit; verify active DB clean

12.5 Capacity Management and Growth Planning

Capacity management ensures that the identity authentication system continues to meet performance requirements as the organization grows. Capacity reviews should be conducted quarterly, and capacity upgrades should be planned at least 6 months in advance to allow for procurement and testing lead times. The following table provides the capacity growth triggers and recommended actions.

Resource	Current Capacity Metric	Warning Threshold	Critical Threshold	Recommended Action
RADIUS throughput	Peak auth/s vs. rated capacity	> 60% utilization	> 80% utilization	Add RADIUS node to cluster; re-balance load
CA certificate database	Issued certs vs. CA license/capacity	> 70% of capacity	> 85% of capacity	Archive expired certs; expand CA license or add sub-CA
LDAP/AD query load	LDAP queries/s vs. DC capacity	> 50% DC CPU	> 70% DC CPU	Add read-only DC; implement RADIUS LDAP caching
SIEM log storage	Daily log volume vs. storage capacity	> 70% storage used	> 85% storage used	Expand storage; implement log tiering (hot/warm/cold)
Network bandwidth (RADIUS)	RADIUS traffic vs. link capacity	> 40% link utilization	> 60% link utilization	Upgrade link; implement RADIUS proxy to reduce WAN traffic