Chapter 12: Operations & Maintenance
Effective operations and maintenance (O&M) of a network identity authentication system requires a combination of proactive monitoring, scheduled maintenance activities, and well-rehearsed incident response procedures. The system's role as a critical security control means that any degradation in availability or accuracy has immediate impact on both security posture and user productivity. This chapter provides the operational framework, monitoring requirements, maintenance schedules, and incident response playbooks needed to sustain the system at production quality throughout its operational life.
12.1 Operations Center and Monitoring Dashboard
The identity authentication system must be monitored continuously from a centralized operations center. The monitoring dashboard should provide real-time visibility into authentication success rates, active sessions, certificate expiry status, RADIUS server health, and VLAN assignment statistics. The image below illustrates a professional operations center setup with the recommended dashboard layout.
| Dashboard Panel | Key Metrics | Alert Threshold | Data Source |
|---|---|---|---|
| Authentication Overview | Auth/s, success rate, failure rate, top failure reasons | Success rate < 99.5% → P2 alert; < 99% → P1 alert | RADIUS accounting logs; SIEM |
| Active Sessions | Total active sessions, by VLAN, by user type, by location | Sudden drop > 20% → P2 alert (possible outage) | RADIUS accounting; switch SNMP |
| Certificate Expiry Calendar | Certs expiring in 30/7/1 days; expired certs | Any cert expiring in < 7 days → P2; expired → P1 | CA database; LDAP cert attributes |
| RADIUS Server Health | CPU%, RAM%, auth/s per node, queue depth, error rate | CPU > 80% → P2; any node down → P1 | SNMP; RADIUS internal metrics |
| OCSP/CRL Status | OCSP responder availability; CRL freshness; response time | OCSP unavailable → P1; CRL stale > 2× validity → P1 | OCSP monitoring probe; CRL download check |
| SIEM Authentication Events | Auth event volume, anomaly score, failed auth by source IP | Brute-force pattern detected → P1 security alert | SIEM correlation engine |
12.2 Scheduled Maintenance Activities
Scheduled maintenance activities must be planned, documented, and executed during approved maintenance windows. All maintenance activities should be performed with a rollback plan in place. The following table defines the recommended maintenance schedule for all major components of the identity authentication system.
| Activity | Frequency | Duration | Maintenance Window Required? | Procedure Reference |
|---|---|---|---|---|
| RADIUS server OS patching | Monthly | 30–60 min/node | Yes (rolling; no downtime if HA) | Runbook: RADIUS-PATCH-001 |
| CA / PKI software updates | Quarterly | 2–4 hours | Yes (offline root CA: annual) | Runbook: PKI-UPDATE-001 |
| Certificate revocation list (CRL) publication | Daily (auto) | Automated; < 1 min | No | Automated via CA schedule |
| RADIUS configuration backup | Daily (auto) | Automated; < 5 min | No | Automated via backup script |
| HA failover test | Quarterly | 30 min | Yes (off-peak hours) | Runbook: HA-TEST-001 |
| Certificate inventory audit | Monthly | 1–2 hours | No | Runbook: CERT-AUDIT-001 |
| Security hardening review | Semi-annual | 4–8 hours | No (read-only audit) | Runbook: SEC-REVIEW-001 |
| Disaster recovery drill | Annual | 4–8 hours | Yes (full DR environment) | Runbook: DR-DRILL-001 |
| Hardware inspection and cleaning | Annual | 2–4 hours | Yes | Runbook: HW-INSPECT-001 |
12.3 Incident Response Playbooks
Incident response playbooks define the step-by-step actions that operations staff must take when specific incident types are detected. Playbooks must be reviewed and updated at least annually, and all operations staff must be trained on the playbooks before being authorized to respond to incidents independently. The following table summarizes the key playbooks for the most critical incident types.
| Incident Type | Priority | Initial Response (0–15 min) | Escalation (15–60 min) | Resolution Target |
|---|---|---|---|---|
| RADIUS server cluster outage (all nodes down) | P1 — Critical | 1) Verify outage scope; 2) Activate break-glass accounts; 3) Page on-call engineer; 4) Notify stakeholders | 1) Attempt restart; 2) Failover to DR site; 3) Engage vendor support if needed | RTO: 30 min; RPO: 4 hours |
| Mass certificate expiry (batch expiry event) | P1 — Critical | 1) Identify scope of expiry; 2) Trigger emergency renewal via SCEP/EST; 3) Notify affected users | 1) Manual renewal for critical certs; 2) Extend validity via CA if renewal fails; 3) Engage PKI team | All certs renewed within 4 hours |
| Brute-force / credential stuffing attack | P1 — Security | 1) Block source IPs at firewall; 2) Lock targeted accounts; 3) Notify CISO; 4) Preserve logs | 1) Engage SOC for full investigation; 2) Check for successful breaches; 3) Reset compromised credentials | Attack contained within 15 min; investigation complete within 24 hours |
| OCSP responder unavailable | P2 — High | 1) Verify OCSP availability from multiple locations; 2) Check if CRL is available as fallback; 3) Monitor auth failure rate | 1) Restart OCSP service; 2) Failover to secondary OCSP; 3) Enable CRL fallback if needed | OCSP restored within 1 hour; CRL fallback within 15 min |
| AD/LDAP connectivity failure | P2 — High | 1) Verify AD connectivity from RADIUS; 2) Check if cached credentials allow auth; 3) Notify AD team | 1) Failover to secondary DC; 2) Verify LDAPS certificate; 3) Check firewall rules | Connectivity restored within 1 hour; cached auth as interim |
12.4 Certificate Lifecycle Operations
Certificate lifecycle management is one of the most operationally intensive aspects of running an identity authentication system. The following table defines the operational procedures for each stage of the certificate lifecycle, from enrollment through revocation and archival. Automation should be implemented for all stages where possible to reduce operational burden and eliminate human error.
| Lifecycle Stage | Trigger | Automated? | Procedure | Verification |
|---|---|---|---|---|
| Initial Enrollment | New device/user onboarding | Yes (SCEP/EST/NDES) | MDM/SCEP pushes cert request to CA; CA issues cert; cert deployed to endpoint | Verify cert in endpoint cert store; test auth |
| Renewal (scheduled) | Cert within renewal window (e.g., 30 days before expiry) | Yes (auto-renewal) | SCEP/EST renewal request sent automatically; new cert issued; old cert replaced | Monitor renewal success rate in CA logs; alert on failures |
| Revocation (user offboarding) | HR offboarding trigger or IT request | Partial (manual approval) | IT submits revocation request; CA revokes cert; CRL/OCSP updated within 5 min | Verify OCSP returns revoked; test auth with revoked cert fails |
| Revocation (device lost/stolen) | Security incident report | No (manual; immediate) | Security team submits emergency revocation; CA revokes immediately; OCSP updated | OCSP returns revoked within 5 min; verify auth fails |
| CA Certificate Renewal | Sub-CA cert within 1 year of expiry | No (manual; planned) | Plan renewal 6 months in advance; issue new sub-CA cert; update trust stores on all RADIUS servers and endpoints | All RADIUS servers trust new sub-CA; test auth with new-CA-issued cert |
| Archival | Cert expired and no longer needed | Yes (automated cleanup) | Expired certs archived to cold storage after 90 days; removed from active database | Verify archived certs accessible for audit; verify active DB clean |
12.5 Capacity Management and Growth Planning
Capacity management ensures that the identity authentication system continues to meet performance requirements as the organization grows. Capacity reviews should be conducted quarterly, and capacity upgrades should be planned at least 6 months in advance to allow for procurement and testing lead times. The following table provides the capacity growth triggers and recommended actions.
| Resource | Current Capacity Metric | Warning Threshold | Critical Threshold | Recommended Action |
|---|---|---|---|---|
| RADIUS throughput | Peak auth/s vs. rated capacity | > 60% utilization | > 80% utilization | Add RADIUS node to cluster; re-balance load |
| CA certificate database | Issued certs vs. CA license/capacity | > 70% of capacity | > 85% of capacity | Archive expired certs; expand CA license or add sub-CA |
| LDAP/AD query load | LDAP queries/s vs. DC capacity | > 50% DC CPU | > 70% DC CPU | Add read-only DC; implement RADIUS LDAP caching |
| SIEM log storage | Daily log volume vs. storage capacity | > 70% storage used | > 85% storage used | Expand storage; implement log tiering (hot/warm/cold) |
| Network bandwidth (RADIUS) | RADIUS traffic vs. link capacity | > 40% link utilization | > 60% link utilization | Upgrade link; implement RADIUS proxy to reduce WAN traffic |