Chapter 12: Operations & Maintenance

Day-to-day operations, monitoring dashboards, certificate lifecycle management, maintenance schedules, and incident response procedures for identity authentication systems

Effective operations and maintenance (O&M) of a network identity authentication system requires a combination of proactive monitoring, scheduled maintenance activities, and well-rehearsed incident response procedures. The system's role as a critical security control means that any degradation in availability or accuracy has immediate impact on both security posture and user productivity. This chapter provides the operational framework, monitoring requirements, maintenance schedules, and incident response playbooks needed to sustain the system at production quality throughout its operational life.

12.1 Operations Center and Monitoring Dashboard

The identity authentication system must be monitored continuously from a centralized operations center. The monitoring dashboard should provide real-time visibility into authentication success rates, active sessions, certificate expiry status, RADIUS server health, and VLAN assignment statistics. The image below illustrates a professional operations center setup with the recommended dashboard layout.

Network Identity Authentication Operations Center Dashboard
Figure 12.1: Operations Center — Enterprise authentication monitoring dashboard showing real-time auth success/failure rates (98.5%), active sessions (15,402), certificate expiry calendar, RADIUS server health metrics for three nodes, VLAN assignment statistics, and SIEM log view. Engineer monitoring all systems from a multi-screen workstation.
Dashboard PanelKey MetricsAlert ThresholdData Source
Authentication OverviewAuth/s, success rate, failure rate, top failure reasonsSuccess rate < 99.5% → P2 alert; < 99% → P1 alertRADIUS accounting logs; SIEM
Active SessionsTotal active sessions, by VLAN, by user type, by locationSudden drop > 20% → P2 alert (possible outage)RADIUS accounting; switch SNMP
Certificate Expiry CalendarCerts expiring in 30/7/1 days; expired certsAny cert expiring in < 7 days → P2; expired → P1CA database; LDAP cert attributes
RADIUS Server HealthCPU%, RAM%, auth/s per node, queue depth, error rateCPU > 80% → P2; any node down → P1SNMP; RADIUS internal metrics
OCSP/CRL StatusOCSP responder availability; CRL freshness; response timeOCSP unavailable → P1; CRL stale > 2× validity → P1OCSP monitoring probe; CRL download check
SIEM Authentication EventsAuth event volume, anomaly score, failed auth by source IPBrute-force pattern detected → P1 security alertSIEM correlation engine

12.2 Scheduled Maintenance Activities

Scheduled maintenance activities must be planned, documented, and executed during approved maintenance windows. All maintenance activities should be performed with a rollback plan in place. The following table defines the recommended maintenance schedule for all major components of the identity authentication system.

ActivityFrequencyDurationMaintenance Window Required?Procedure Reference
RADIUS server OS patchingMonthly30–60 min/nodeYes (rolling; no downtime if HA)Runbook: RADIUS-PATCH-001
CA / PKI software updatesQuarterly2–4 hoursYes (offline root CA: annual)Runbook: PKI-UPDATE-001
Certificate revocation list (CRL) publicationDaily (auto)Automated; < 1 minNoAutomated via CA schedule
RADIUS configuration backupDaily (auto)Automated; < 5 minNoAutomated via backup script
HA failover testQuarterly30 minYes (off-peak hours)Runbook: HA-TEST-001
Certificate inventory auditMonthly1–2 hoursNoRunbook: CERT-AUDIT-001
Security hardening reviewSemi-annual4–8 hoursNo (read-only audit)Runbook: SEC-REVIEW-001
Disaster recovery drillAnnual4–8 hoursYes (full DR environment)Runbook: DR-DRILL-001
Hardware inspection and cleaningAnnual2–4 hoursYesRunbook: HW-INSPECT-001

12.3 Incident Response Playbooks

Incident response playbooks define the step-by-step actions that operations staff must take when specific incident types are detected. Playbooks must be reviewed and updated at least annually, and all operations staff must be trained on the playbooks before being authorized to respond to incidents independently. The following table summarizes the key playbooks for the most critical incident types.

Incident TypePriorityInitial Response (0–15 min)Escalation (15–60 min)Resolution Target
RADIUS server cluster outage (all nodes down)P1 — Critical1) Verify outage scope; 2) Activate break-glass accounts; 3) Page on-call engineer; 4) Notify stakeholders1) Attempt restart; 2) Failover to DR site; 3) Engage vendor support if neededRTO: 30 min; RPO: 4 hours
Mass certificate expiry (batch expiry event)P1 — Critical1) Identify scope of expiry; 2) Trigger emergency renewal via SCEP/EST; 3) Notify affected users1) Manual renewal for critical certs; 2) Extend validity via CA if renewal fails; 3) Engage PKI teamAll certs renewed within 4 hours
Brute-force / credential stuffing attackP1 — Security1) Block source IPs at firewall; 2) Lock targeted accounts; 3) Notify CISO; 4) Preserve logs1) Engage SOC for full investigation; 2) Check for successful breaches; 3) Reset compromised credentialsAttack contained within 15 min; investigation complete within 24 hours
OCSP responder unavailableP2 — High1) Verify OCSP availability from multiple locations; 2) Check if CRL is available as fallback; 3) Monitor auth failure rate1) Restart OCSP service; 2) Failover to secondary OCSP; 3) Enable CRL fallback if neededOCSP restored within 1 hour; CRL fallback within 15 min
AD/LDAP connectivity failureP2 — High1) Verify AD connectivity from RADIUS; 2) Check if cached credentials allow auth; 3) Notify AD team1) Failover to secondary DC; 2) Verify LDAPS certificate; 3) Check firewall rulesConnectivity restored within 1 hour; cached auth as interim

12.4 Certificate Lifecycle Operations

Certificate lifecycle management is one of the most operationally intensive aspects of running an identity authentication system. The following table defines the operational procedures for each stage of the certificate lifecycle, from enrollment through revocation and archival. Automation should be implemented for all stages where possible to reduce operational burden and eliminate human error.

Lifecycle StageTriggerAutomated?ProcedureVerification
Initial EnrollmentNew device/user onboardingYes (SCEP/EST/NDES)MDM/SCEP pushes cert request to CA; CA issues cert; cert deployed to endpointVerify cert in endpoint cert store; test auth
Renewal (scheduled)Cert within renewal window (e.g., 30 days before expiry)Yes (auto-renewal)SCEP/EST renewal request sent automatically; new cert issued; old cert replacedMonitor renewal success rate in CA logs; alert on failures
Revocation (user offboarding)HR offboarding trigger or IT requestPartial (manual approval)IT submits revocation request; CA revokes cert; CRL/OCSP updated within 5 minVerify OCSP returns revoked; test auth with revoked cert fails
Revocation (device lost/stolen)Security incident reportNo (manual; immediate)Security team submits emergency revocation; CA revokes immediately; OCSP updatedOCSP returns revoked within 5 min; verify auth fails
CA Certificate RenewalSub-CA cert within 1 year of expiryNo (manual; planned)Plan renewal 6 months in advance; issue new sub-CA cert; update trust stores on all RADIUS servers and endpointsAll RADIUS servers trust new sub-CA; test auth with new-CA-issued cert
ArchivalCert expired and no longer neededYes (automated cleanup)Expired certs archived to cold storage after 90 days; removed from active databaseVerify archived certs accessible for audit; verify active DB clean

12.5 Capacity Management and Growth Planning

Capacity management ensures that the identity authentication system continues to meet performance requirements as the organization grows. Capacity reviews should be conducted quarterly, and capacity upgrades should be planned at least 6 months in advance to allow for procurement and testing lead times. The following table provides the capacity growth triggers and recommended actions.

ResourceCurrent Capacity MetricWarning ThresholdCritical ThresholdRecommended Action
RADIUS throughputPeak auth/s vs. rated capacity> 60% utilization> 80% utilizationAdd RADIUS node to cluster; re-balance load
CA certificate databaseIssued certs vs. CA license/capacity> 70% of capacity> 85% of capacityArchive expired certs; expand CA license or add sub-CA
LDAP/AD query loadLDAP queries/s vs. DC capacity> 50% DC CPU> 70% DC CPUAdd read-only DC; implement RADIUS LDAP caching
SIEM log storageDaily log volume vs. storage capacity> 70% storage used> 85% storage usedExpand storage; implement log tiering (hot/warm/cold)
Network bandwidth (RADIUS)RADIUS traffic vs. link capacity> 40% link utilization> 60% link utilizationUpgrade link; implement RADIUS proxy to reduce WAN traffic