1. Design-Prinzipien
1.1 Leitprinzipien
| Prinzip | Beschreibung |
| On-Premises First | Alle Workloads laufen im eigenen Rechenzentrum. Kein Public-Cloud-Hosting für Applikationen oder Daten. |
| VLAN-Segmentierung | Strenge Netzwerksegmentierung nach Funktion. Jede Zone erhält ein eigenes VLAN mit dedizierten Firewall-Regeln. |
| Defense in Depth | Mehrschichtige Sicherheitsarchitektur: Perimeter-Firewall → VLAN-Firewall → Kubernetes NetworkPolicies → Applikations-Authentifizierung. |
| Infrastructure as Code | Gesamte Infrastruktur wird deklarativ verwaltet: Terraform für vSphere-VMs, Helm/Kustomize für K8s, Ansible für Basiskonfiguration. |
| Observability | Vollständige Beobachtbarkeit aller Schichten: Metriken (Prometheus), Logs (Loki), Traces (Jaeger), Dashboards (Grafana). |
1.2 Architektur-Übersicht
HAFS-Group Rechenzentrum – Übersicht
+-------------------------------------------------------------------------+
| HAUCK AUFHAEUSER RECHENZENTRUM |
| |
| +--------------+ +--------------+ +--------------+ +------------+ |
| | VLAN 10 | | VLAN 20 | | VLAN 30 | | VLAN 40 | |
| | Management | | Kubernetes | | Datenbank | | Identity | |
| | | | | | & Storage | | | |
| | vCenter | | RKE2 CP x3 | | PostgreSQL | | AD DCs | |
| | Rancher | | Workers | | MongoDB | | Entra ID | |
| | Bastion | | (System/ | | MinIO | | Connect | |
| | | | App/AI/ | | Redis | | | |
| | | | Data) | | Elastic | | | |
| | | | | | Vault | | | |
| +------+-------+ +------+-------+ +------+-------+ +-----+------+ |
| | | | | |
| =======|=================|=================|=================|======== |
| | Core Switches (L3) | | |
| =======|=================|=================|=================|======== |
| | | | | |
| +------+-------+ +------+-------+ |
| | VLAN 50 | | VLAN 60 | |
| | DMZ | | Monitoring | |
| | | | | |
| | NGINX | | Prometheus | |
| | Ingress | | Grafana | |
| | WAF | | Loki | |
| | | | Jaeger | |
| +------+-------+ +--------------+ |
| | |
| +------+-------+ |
| | Perimeter |---- Internet (nur M365 Graph API + Anthropic API) |
| | Firewall | |
| +--------------+ |
+-------------------------------------------------------------------------+
2. VMware vSphere Infrastruktur
2.1 ESXi Host Design
| Parameter | Wert |
| Hypervisor | VMware ESXi 8.0 U2+ |
| Anzahl ESXi Hosts | Minimum 4 (N+1 Redundanz) |
| CPU pro Host | 2x Intel Xeon Gold 6348 (28 Cores) |
| RAM pro Host | 512 GB DDR4 ECC |
| Lokaler Storage | 2x 960 GB NVMe SSD (ESXi Boot + Cache) |
| Netzwerk | 4x 25 GbE (2x Fabric A, 2x Fabric B) |
2.2 vCenter Server
| Parameter | Wert |
| vCenter Version | vCenter Server 8.0 U2+ |
| Deployment | vCenter Server Appliance (VCSA) |
| Platzierung | VLAN 10 – Management |
| HA | vCenter HA (Active/Passive/Witness) |
| Lizenz | vSphere Enterprise Plus |
2.3 Cluster Design
vSphere Cluster: HAFS-PROD
+-------------------------------------------------+
| vSphere Cluster: HAFS-PROD |
| |
| +-----------+ +-----------+ +-----------+ |
| | ESXi-01 | | ESXi-02 | | ESXi-03 | |
| | 56C/512GB | | 56C/512GB | | 56C/512GB | |
| +-----------+ +-----------+ +-----------+ |
| |
| +-----------+ |
| | ESXi-04 | <-- N+1 Reserve |
| | 56C/512GB | |
| +-----------+ |
| |
| Features: DRS, HA, vMotion, FT (selektiv) |
| Admission Control: 25% Reserve |
+-------------------------------------------------+
2.4 Resource Pools
| Resource Pool | CPU-Anteil | RAM-Anteil | Verwendung |
| RP-K8s-ControlPlane | 12 vCPU | 48 GB | RKE2 Control Plane Nodes |
| RP-K8s-System | 16 vCPU | 64 GB | System-Worker (Ingress, Monitoring) |
| RP-K8s-Application | 32 vCPU | 128 GB | Applikations-Worker |
| RP-K8s-AI | 24 vCPU | 96 GB | AI/ML Worker |
| RP-K8s-Data | 16 vCPU | 64 GB | Data-Worker (Elasticsearch, Analytics) |
| RP-Database | 24 vCPU | 128 GB | PostgreSQL, MongoDB, Redis |
| RP-Infrastructure | 16 vCPU | 64 GB | vCenter, Rancher, Bastion, AD |
2.5 Storage-Architektur
| Storage-Tier | Technologie | Kapazität | Verwendung |
| Tier 1 – Performance | SAN (FC 32 Gbit/s), All-Flash | 10 TB | Datenbanken, etcd, kritische PVs |
| Tier 2 – Standard | SAN (FC 16 Gbit/s), Hybrid | 20 TB | Applikationsdaten, Container Images |
| Tier 3 – Capacity | NFS (10 GbE) | 50 TB | Backups, Logs, Archivierung, MinIO |
Storage-Topologie & Kubernetes StorageClasses
+--------------------------------------------+
| Storage-Topologie |
| |
| ESXi Hosts --FC 32G--> SAN Array (Tier 1) |
| --FC 16G--> SAN Array (Tier 2) |
| --10GbE---> NFS Filer (Tier 3) |
| |
| K8s StorageClasses: |
| - hafs-fast -> Tier 1 (SAN) |
| - hafs-standard -> Tier 2 (SAN) |
| - hafs-capacity -> Tier 3 (NFS) |
+--------------------------------------------+
3. RKE2 Kubernetes Cluster Design
3.1 Cluster-Konfiguration
| Parameter | Wert |
| Kubernetes Distribution | RKE2 (Rancher Kubernetes Engine 2) |
| Kubernetes Version | v1.29.x (LTS-Channel) |
| Management | Rancher v2.8+ (UI + GitOps) |
| CNI Plugin | Calico (NetworkPolicy + BGP-fähig) |
| Container Runtime | containerd (in RKE2 integriert) |
| Ingress Controller | NGINX Ingress Controller (extern via DMZ) |
| Service Mesh | Optional: Istio (bei Bedarf für mTLS) |
| Certificate Management | cert-manager (interne CA) |
| Secret Management | HashiCorp Vault + External Secrets Operator |
3.2 Node-Topologie
RKE2 Cluster: hafs-prod (~12 VMs)
+-----------------------------------------------------------------+
| RKE2 Cluster: hafs-prod |
| |
| +------------------- Control Plane -------------------+ |
| | cp-01 cp-02 cp-03 | |
| | 4 vCPU 4 vCPU 4 vCPU | |
| | 16 GB RAM 16 GB RAM 16 GB RAM | |
| | 100 GB SSD 100 GB SSD 100 GB SSD | |
| | (etcd+API) (etcd+API) (etcd+API) | |
| +----------------------------------------------------| |
| |
| +---- System Workers ----+ +---- App Workers -----------+ |
| | sys-w01 sys-w02 | | app-w01 app-w02 app-w03 | |
| | 4vCPU 4vCPU | | 8vCPU 8vCPU 8vCPU | |
| | 16GB 16GB | | 32GB 32GB 32GB | |
| +------------------------+ +-----------------------------+ |
| |
| +---- AI Workers --------+ +---- Data Workers ----------+ |
| | ai-w01 ai-w02 | | data-w01 data-w02 | |
| | 8vCPU 8vCPU | | 8vCPU 8vCPU | |
| | 32GB 32GB | | 32GB 32GB | |
| +------------------------+ +-----------------------------+ |
+-----------------------------------------------------------------+
3.3 Node Pools, Labels und Taints
| Node Pool | Anzahl | Label | Taint | Verwendung |
| control-plane | 3 | node-role.kubernetes.io/master | node-role…/master:NoSchedule | etcd, API Server, Controller Manager, Scheduler |
| system | 2 | hafs.de/pool=system | hafs.de/pool=system:PreferNoSchedule | Ingress Controller, cert-manager, Monitoring Agents, CoreDNS |
| application | 3 | hafs.de/pool=application | – | Portal, Tickets, Notifications, Automation, Knowledge, Governance |
| ai | 2 | hafs.de/pool=ai | hafs.de/pool=ai:NoSchedule | Claude API Gateway, AI Orchestrator, NLP, Embeddings |
| data | 2 | hafs.de/pool=data | hafs.de/pool=data:NoSchedule | Elasticsearch, Analytics, SIEM Addon |
3.4 Namespace-Struktur
| Namespace | Beschreibung | Node Pool |
| kube-system | Kubernetes Core-Komponenten | system |
| ingress-nginx | NGINX Ingress Controller | system |
| cert-manager | TLS-Zertifikatsverwaltung (interne CA) | system |
| monitoring | Prometheus, Grafana, Loki, Jaeger, Alertmanager | system |
| hafs-portal | Self-Help Portal Frontend + BFF | application |
| hafs-tickets | Ticket-System (CRUD, Workflows, SLA) | application |
| hafs-ai | AI-Services (Claude Gateway, Orchestrator, NLP) | ai |
| hafs-security | Authentifizierung, Autorisierung, RBAC | application |
| hafs-governance | Compliance, Audit Trail, Data Governance | application |
| hafs-automation | Workflow Engine, Runbooks, Scheduled Tasks | application |
| hafs-knowledge | Knowledge Base, Dokumentenverwaltung | application |
| hafs-analytics | Reporting, Dashboards, Metriken | data |
| hafs-notifications | E-Mail, Push, In-App Benachrichtigungen | application |
| hafs-shared | Shared Libraries, Config, Common Services | application |
| hafs-siem-addon | SIEM-Integration (Log-Forwarding, Correlation) | data |
| hafs-iam-addon | IAM-Integration (AD Sync, Provisioning) | application |
| hafs-monitor-addon | Monitoring-Addon (Custom Exporters, Alerts) | system |
3.5 Kubernetes NetworkPolicies (Calico)
NetworkPolicy-Konzept: Default Deny + Whitelist
Namespace: hafs-portal (Beispiel)
+-----------------------------------------------------------------+
| 1) Default Deny All |
| podSelector: {} |
| policyTypes: [Ingress, Egress] |
| |
| 2) Allow Ingress from NGINX Ingress |
| from: namespaceSelector: ingress-nginx |
| ports: TCP/8080 |
| |
| 3) Allow Egress to hafs-ai |
| to: namespaceSelector: hafs-ai |
| ports: TCP/8080 |
| |
| 4) Allow Egress to kube-system (DNS) |
| to: namespaceSelector: kube-system |
| ports: UDP/53 |
+-----------------------------------------------------------------+
NetworkPolicy-Strategie
Jeder Namespace erhält eine Default-Deny-All-Policy. Erlaubter Traffic wird explizit über Whitelist-Regeln definiert. Calico als CNI ermöglicht zusätzlich GlobalNetworkPolicies für clusterweite Regeln.
4. VLAN-Architektur Detail
4.1 VLAN-Übersicht
| VLAN ID | Name | Subnetz | Gateway | Beschreibung |
| 10 | Management | 10.10.10.0/24 | 10.10.10.1 | vCenter, Rancher, Bastion Host, IPMI/iDRAC |
| 20 | Kubernetes | 10.10.20.0/23 | 10.10.20.1 | RKE2 Control Plane + Worker Nodes (/23 für Skalierung) |
| 30 | Datenbank | 10.10.30.0/24 | 10.10.30.1 | PostgreSQL, MongoDB, MinIO, Redis, Elasticsearch, Vault |
| 40 | Identity | 10.10.40.0/24 | 10.10.40.1 | AD Domain Controller, Entra ID Connect |
| 50 | DMZ | 10.10.50.0/24 | 10.10.50.1 | NGINX Ingress VIP, WAF, Reverse Proxy |
| 60 | Monitoring | 10.10.60.0/24 | 10.10.60.1 | Prometheus, Grafana, Loki, Jaeger |
4.2 VLAN-Diagramm
VLAN-Topologie mit Perimeter-Firewall
+---------------+
| Internet |
+-------+-------+
|
+-------+-------+
| Perimeter FW |
| (Palo Alto / |
| FortiGate) |
+-------+-------+
|
+--------------+-------------------+
| Core Switch L3 |
| (Cisco Nexus / Arista) |
+--+----+----+----+----+----+------+
| | | | | |
+----------+ | | | | +----------+
| | | | | |
+------+------+ +-----+----++ +--+---+-----+ +-------+-----+
| VLAN 10 | | VLAN 20 | | VLAN 30 | | VLAN 40 |
| Management | | K8s | | DB/Store | | Identity |
| .0/24 | | .0/23 | | .0/24 | | .0/24 |
+-------------+ +-----------+ +------------+ +-------------+
+-------------+ +-----------+
| VLAN 50 | | VLAN 60 |
| DMZ | | Monitor |
| .0/24 | | .0/24 |
+-------------+ +-----------+
4.3 VLAN 10 – Management
| Host | IP-Adresse | Funktion |
| vcenter.hafs.internal | 10.10.10.10 | vCenter Server Appliance |
| rancher.hafs.internal | 10.10.10.20 | Rancher Management Server |
| bastion.hafs.internal | 10.10.10.30 | SSH Bastion / Jump Host |
| ipmi-esxi01–04.hafs.internal | 10.10.10.41–.44 | ESXi Host IPMI |
Zugriff
Nur über Bastion Host mit MFA. Kein direkter Zugriff aus anderen VLANs.
4.4 VLAN 20 – Kubernetes
| Host | IP-Adresse | Rolle |
| rke2-cp-01 | 10.10.20.11 | Control Plane Node 1 |
| rke2-cp-02 | 10.10.20.12 | Control Plane Node 2 |
| rke2-cp-03 | 10.10.20.13 | Control Plane Node 3 |
| rke2-sys-w01 | 10.10.20.21 | System Worker 1 |
| rke2-sys-w02 | 10.10.20.22 | System Worker 2 |
| rke2-app-w01 | 10.10.20.31 | Application Worker 1 |
| rke2-app-w02 | 10.10.20.32 | Application Worker 2 |
| rke2-app-w03 | 10.10.20.33 | Application Worker 3 |
| rke2-ai-w01 | 10.10.20.41 | AI Worker 1 |
| rke2-ai-w02 | 10.10.20.42 | AI Worker 2 |
| rke2-data-w01 | 10.10.20.51 | Data Worker 1 |
| rke2-data-w02 | 10.10.20.52 | Data Worker 2 |
| rke2-api-vip | 10.10.20.100 | K8s API Server VIP (kube-vip) |
4.5 VLAN 30 – Datenbank & Storage
| Host / Service | IP-Adresse | Funktion |
| pg-primary | 10.10.30.11 | PostgreSQL Primary |
| pg-replica-01 | 10.10.30.12 | PostgreSQL Streaming Replica |
| pg-replica-02 | 10.10.30.13 | PostgreSQL Streaming Replica |
| mongo-rs-01–03 | 10.10.30.21–.23 | MongoDB Replica Set (3 Members) |
| minio-01–02 | 10.10.30.31–.32 | MinIO Object Storage |
| redis-master | 10.10.30.41 | Redis Primary (Sentinel-Cluster) |
| redis-replica-01 | 10.10.30.42 | Redis Replica |
| redis-sentinel-01 | 10.10.30.43 | Redis Sentinel |
| elastic-01–03 | 10.10.30.51–.53 | Elasticsearch Cluster (3 Nodes) |
| vault.hafs.internal | 10.10.30.61 | HashiCorp Vault (HA) |
4.6 VLAN 40 – Identity
| Host | IP-Adresse | Funktion |
| dc01.hafs.internal | 10.10.40.11 | Active Directory Domain Controller 1 |
| dc02.hafs.internal | 10.10.40.12 | Active Directory Domain Controller 2 |
| aadconnect.hafs.internal | 10.10.40.21 | Microsoft Entra ID Connect Server |
4.7 VLAN 50 – DMZ
| Host / Service | IP-Adresse | Funktion |
| ingress-vip | 10.10.50.10 | NGINX Ingress Controller VIP |
| waf-01 | 10.10.50.20 | Web Application Firewall |
4.8 VLAN 60 – Monitoring
| Host / Service | IP-Adresse | Funktion |
| prometheus | 10.10.60.11 | Prometheus + Alertmanager |
| grafana | 10.10.60.12 | Grafana Dashboards |
| loki | 10.10.60.13 | Loki Log Aggregation |
| jaeger | 10.10.60.14 | Jaeger Tracing Backend |
4.9 Firewall-Regeln zwischen VLANs
| Quelle (VLAN) | Ziel (VLAN) | Protokoll/Port | Beschreibung |
| VLAN 10 (Mgmt) | VLAN 20 (K8s) | TCP/6443 | Rancher → K8s API Server |
| VLAN 10 (Mgmt) | VLAN 20 (K8s) | TCP/22 | Bastion → Node SSH |
| VLAN 10 (Mgmt) | VLAN 30 (DB) | TCP/5432,27017,9200,8200 | Management → DB Administration |
| VLAN 10 (Mgmt) | VLAN 60 (Mon) | TCP/3000,9090 | Mgmt → Grafana/Prometheus UI |
| VLAN 20 (K8s) | VLAN 30 (DB) | TCP/5432 | App Pods → PostgreSQL |
| VLAN 20 (K8s) | VLAN 30 (DB) | TCP/27017 | App Pods → MongoDB |
| VLAN 20 (K8s) | VLAN 30 (DB) | TCP/6379 | App Pods → Redis |
| VLAN 20 (K8s) | VLAN 30 (DB) | TCP/9200 | Data Pods → Elasticsearch |
| VLAN 20 (K8s) | VLAN 30 (DB) | TCP/9000 | App Pods → MinIO S3 API |
| VLAN 20 (K8s) | VLAN 30 (DB) | TCP/8200 | App Pods → Vault |
| VLAN 20 (K8s) | VLAN 40 (ID) | TCP/389,636 | K8s → AD LDAP/LDAPS |
| VLAN 20 (K8s) | VLAN 40 (ID) | TCP/88,464 | K8s → Kerberos |
| VLAN 20 (K8s) | VLAN 60 (Mon) | TCP/9090 | Prometheus Scraping (Reverse) |
| VLAN 50 (DMZ) | VLAN 20 (K8s) | TCP/80,443 | Ingress → K8s Service Endpoints |
| VLAN 60 (Mon) | VLAN 20 (K8s) | TCP/9100,8080 | Prometheus → Node Exporter, Metrics |
| VLAN 60 (Mon) | VLAN 30 (DB) | TCP/9187,9216,9114 | Prometheus → DB Exporters |
| VLAN 40 (ID) | Internet | TCP/443 | Entra ID Connect → Microsoft 365 |
| VLAN 20 (K8s) | Internet | TCP/443 | AI Pods → api.anthropic.com (via Proxy) |
| VLAN 20 (K8s) | Internet | TCP/443 | App Pods → graph.microsoft.com (via Proxy) |
| * | * | * | Default: DENY ALL |
5. DNS-Architektur
5.1 Übersicht
DNS-Architektur (AD DNS + CoreDNS)
+-----------------------------------------------------------+
| DNS-Architektur |
| |
| +-----------------+ +-----------------+ |
| | AD DNS (DC01) |<---->| AD DNS (DC02) | |
| | 10.10.40.11 | Repl. | 10.10.40.12 | |
| | | | | |
| | Zone: | | Zone: | |
| | hafs.internal | | hafs.internal | |
| +--------+--------+ +--------+--------+ |
| | | |
| +-----------+-------------+ |
| | |
| v |
| +------------------------+ |
| | CoreDNS (in K8s) | |
| | cluster.local Zone | |
| | | |
| | Forward: hafs.internal| |
| | -> 10.10.40.11/12 | |
| | Forward: Internet | |
| | -> Nicht erlaubt | |
| +------------------------+ |
+-----------------------------------------------------------+
5.2 DNS-Zonen
| Zone | Typ | Server | Beschreibung |
| hafs.internal | AD-integriert | DC01, DC02 | Primäre interne Zone |
| 10.10.in-addr.arpa | AD-integriert | DC01, DC02 | Reverse-Lookup |
| cluster.local | CoreDNS | K8s CoreDNS Pods | Kubernetes-interne Service-Auflösung |
5.3 Wichtige DNS-Einträge (hafs.internal)
| FQDN | Typ | Wert | Beschreibung |
| portal.hafs.internal | A | 10.10.50.10 | Self-Help Portal (via Ingress) |
| api.hafs.internal | A | 10.10.50.10 | API Gateway (via Ingress) |
| rancher.hafs.internal | A | 10.10.10.20 | Rancher UI |
| grafana.hafs.internal | A | 10.10.50.10 | Grafana (via Ingress) |
| vault.hafs.internal | A | 10.10.30.61 | HashiCorp Vault |
| vcenter.hafs.internal | A | 10.10.10.10 | vCenter Server |
| k8s-api.hafs.internal | A | 10.10.20.100 | Kubernetes API VIP |
6. Internet Connectivity
6.1 Grundprinzip
Die gesamte Self-Help Portal Infrastruktur ist on-premises. Externe Internetverbindungen werden ausschließlich für zwei Zwecke benötigt:
- Microsoft 365 Graph API – Ticket-Erstellung, E-Mail-Versand, Benutzer-Synchronisierung
- Anthropic Claude API – KI-gestützte Ticket-Analyse und Chatbot-Funktionalität
6.2 Proxy-Architektur
Outbound-Proxy über Perimeter-Firewall
+------------------------------------------------------------+
| K8s Pod (hafs-ai / hafs-tickets) |
| | |
| | HTTP_PROXY=http://proxy.hafs.internal:3128 |
| | HTTPS_PROXY=http://proxy.hafs.internal:3128 |
| | NO_PROXY=.hafs.internal,.cluster.local,10.10.0.0/16 |
| | |
| +----------+----------------------------------------------+
| |
| v
| +------------------+
| | Forward Proxy | Squid / HAProxy
| | VLAN 20 | mit TLS-Inspection (optional)
| | 10.10.20.200 |
| +----------+-------+
| |
| v
| +------------------+
| | Perimeter FW | Allowlist-basiert
| +----------+-------+
| |
| v
| Internet
+------------------------------------------------------------+
6.3 Firewall Allowlist (Outbound)
| Ziel-FQDN | Port | Protokoll | Quell-VLAN | Zweck |
| api.anthropic.com | 443 | HTTPS | VLAN 20 | Claude API (Chatbot, Ticket-Analyse) |
| graph.microsoft.com | 443 | HTTPS | VLAN 20 | M365 Graph API |
| login.microsoftonline.com | 443 | HTTPS | VLAN 40 | Entra ID Authentifizierung |
| *.servicebus.windows.net | 443 | HTTPS | VLAN 40 | Entra ID Connect Sync |
| adminwebservice.microsoftonline.com | 443 | HTTPS | VLAN 40 | Entra ID Connect |
Default: DENY ALL Outbound
Alles andere ist blockiert. Die Default-Regel der Perimeter-Firewall ist DENY ALL outbound. Nur explizit gelistete FQDNs sind erlaubt.
6.4 Rate Limiting & Budget Controls
| API | Rate Limit | Monatliches Token-Budget | Alerting |
| Anthropic Claude | 100 req/min | Max. 5.000.000 Input Tokens | Alert bei 80% Budget |
| Microsoft Graph | 1.000 req/min | Keine Token-Limitierung | Alert bei >500 Fehler/Stunde |
7. Monitoring & Observability Stack
7.1 Stack-Übersicht
Observability Stack (On-Premises)
+--------------------------------------------------------------------+
| Observability Stack (On-Premises) |
| |
| +--------------+ +--------------+ +--------------+ |
| | Fluent Bit | | Fluent Bit | | Fluent Bit | DaemonSet |
| | (Node) | | (Node) | | (Node) | auf allen |
| +------+-------+ +------+-------+ +------+-------+ Nodes |
| | | | |
| +--------+--------+-----------------+ |
| | |
| v |
| +---------------+ |
| | Loki | Log-Aggregation |
| | (VLAN 60) | Retention: 90 Tage |
| +---------------+ |
| |
| +--------------------------------------------------------------+ |
| | Prometheus + Alertmanager | |
| | - K8s Metriken (kube-state-metrics, node-exporter) | |
| | - App Metriken (/metrics Endpoints) | |
| | - DB Metriken (postgres_exporter, mongodb_exporter) | |
| | - vSphere Metriken (vmware_exporter) | |
| | Retention: 180 Tage (Tier 2 Storage) | |
| +--------------------------------------------------------------+ |
| |
| +--------------------------------------------------------------+ |
| | OpenTelemetry Collector (Sidecar + Gateway) | |
| | --> Jaeger (Distributed Tracing) | |
| | Retention: 30 Tage | |
| +--------------------------------------------------------------+ |
| |
| +--------------------------------------------------------------+ |
| | Grafana | |
| | - Datasources: Prometheus, Loki, Jaeger | |
| | - Auth: LDAP (AD-integriert) | |
| | - Dashboards: K8s, Apps, DBs, vSphere, Business Metrics | |
| +--------------------------------------------------------------+ |
+--------------------------------------------------------------------+
7.2 Prometheus & Alertmanager
| Parameter | Wert |
| Deployment | StatefulSet im Namespace monitoring |
| Storage | 500 GB (hafs-standard StorageClass) |
| Retention | 180 Tage |
| Scrape Interval | 30s (Standard), 15s (kritische Services) |
| Alertmanager Replicas | 2 (HA) |
| Alert-Routen | E-Mail, MS Teams Webhook, PagerDuty (optional) |
7.3 Wichtige Alert Rules
| Alert | Bedingung | Severity |
| KubePodCrashLooping | Pod restarted >3x in 10 Min | critical |
| KubeNodeNotReady | Node NotReady >5 Min | critical |
| PostgreSQLReplicationLag | Replikation >30s hinter Primary | warning |
| ClaudeAPIErrorRate | >5% Fehlerrate in 5 Min | critical |
| DiskSpaceLow | <15% frei auf PV | warning |
| CertificateExpiringSoon | TLS-Zertifikat läuft in <14 Tagen ab | warning |
| HighMemoryUsage | Pod Memory >90% Limit | warning |
| IngressLatencyHigh | p99 Latenz >2s | warning |
| VaultSealedStatus | Vault sealed | critical |
| EtcdHighCommitDuration | etcd Commit >250ms | warning |
7.4 Grafana Dashboards
| Dashboard | Beschreibung |
| K8s Cluster Overview | Nodes, Pods, Deployments, Resource Utilization |
| HAFS Portal Application | Request Rate, Latenz, Error Rate, Aktive Sessions |
| AI Service Monitoring | Claude API Calls, Token Usage, Latenz, Kosten |
| Ticket System | Ticket-Erstellung, SLA Compliance, Backlog |
| Database Performance | PostgreSQL Connections, Query Time, Replication Lag |
| vSphere Infrastructure | ESXi CPU/RAM/Storage, VM Performance |
| Security & Compliance | Auth Failures, RBAC Violations, Certificate Status |
| Business KPIs | Ticket-Volumen, Self-Service Rate, User Satisfaction |
7.5 Loki (Log-Aggregation)
| Parameter | Wert |
| Deployment | StatefulSet (3 Replicas, Microservice Mode) |
| Storage Backend | MinIO (S3-kompatibel, VLAN 30) |
| Retention | 90 Tage |
| Ingestion Rate | Max. 10 MB/s |
| Log Labels | namespace, pod, container, app, level |
7.6 OpenTelemetry & Jaeger (Distributed Tracing)
| Parameter | Wert |
| Collector Deployment | DaemonSet (Agent) + Deployment (Gateway) |
| Sampling Rate | 10% (Standard), 100% (bei Fehler) |
| Trace Backend | Jaeger (Elasticsearch Storage) |
| Retention | 30 Tage |
| Instrumentation | OpenTelemetry SDK in allen HAFS-Microservices |
7.7 Fluent Bit Pipeline
Fluent Bit Log-Routing
Container Logs --> Fluent Bit (DaemonSet)
|
+--> Loki (Langzeit-Speicherung, 90 Tage)
+--> Elasticsearch (Volltextsuche, SIEM)
+--> stdout (Debug, nur in Dev)
8. Disaster Recovery & Business Continuity
8.1 Übersicht DR-Strategie
| Komponente | RPO (Recovery Point) | RTO (Recovery Time) | Methode |
| Kubernetes etcd | 1 Stunde | 30 Minuten | etcd Snapshot (stündlich) + S3 Upload |
| PostgreSQL | 5 Minuten | 15 Minuten | Streaming Replication + WAL Archiving |
| MongoDB | 5 Minuten | 15 Minuten | Replica Set (automatisches Failover) |
| Redis | 1 Stunde | 5 Minuten | Sentinel Failover + RDB Snapshots |
| Elasticsearch | 1 Stunde | 30 Minuten | Snapshot to MinIO (stündlich) |
| MinIO | 24 Stunden | 1 Stunde | Erasure Coding + Veeam Backup |
| VMs (gesamt) | 24 Stunden | 2 Stunden | Veeam Backup & Replication |
| Vault | 1 Stunde | 30 Minuten | Raft Snapshots + Offline Backup |
8.2 VMware HA & vMotion
VMware HA Cluster: HAFS-PROD
+--------------------------------------------------------+
| VMware HA Cluster: HAFS-PROD |
| |
| +---------+ +---------+ +---------+ +---------+ |
| | ESXi-01 | | ESXi-02 | | ESXi-03 | | ESXi-04 | |
| | Active | | Active | | Active | | Reserve | |
| +---------+ +---------+ +---------+ +---------+ |
| |
| Admission Control: 25% CPU/RAM Reserve |
| Host Monitoring: Aktiviert |
| VM Monitoring: Aktiviert (Application-Level) |
| vMotion: Live Migration bei Host-Wartung |
| DRS: Fully Automated (Threshold: 3) |
| Proactive HA: Hardware-Health Monitoring |
+--------------------------------------------------------+
| Feature | Konfiguration |
| VMware HA | Aktiviert, Host Isolation Response: Power Off |
| Admission Control | 25% Reserve (entspricht 1 Host Ausfall) |
| vMotion | Dediziertes VMkernel auf 25 GbE |
| DRS | Fully Automated, Migration Threshold 3 |
| Proactive HA | Aktiviert (IPMI/iLO Sensor Monitoring) |
8.3 Veeam Backup & Replication
| Backup-Job | Schedule | Retention | Storage Tier |
| VM Full Backup | Sonntag 02:00 | 4 Wochen | Tier 3 (NFS) |
| VM Incremental Backup | Täglich 02:00 | 14 Tage | Tier 3 (NFS) |
| DB Dump (PostgreSQL) | Alle 6 Stunden | 7 Tage | Tier 3 (NFS) |
| etcd Snapshot | Stündlich | 48 Stunden | MinIO (VLAN 30) |
| Vault Snapshot | Stündlich | 48 Stunden | MinIO (VLAN 30) |
| Config Backup (Git) | Bei jeder Änderung | Unbegrenzt (Git) | GitLab On-Prem |
8.4 PostgreSQL Backup & Replication
PostgreSQL Streaming Replication + WAL Archiving
+--------------+ Streaming +--------------+
| pg-primary |----Replication---->| pg-replica-01|
| 10.10.30.11 | | 10.10.30.12 |
+------+-------+ +--------------+
| Streaming +--------------+
+--------Replication------->| pg-replica-02|
| | 10.10.30.13 |
| +--------------+
|
v
+--------------+
| WAL Archive |--> MinIO (S3)
| pg_basebackup| Retention: 7 Tage
+--------------+
8.5 DR-Testplan
| Test | Frequenz | Beschreibung |
| etcd Restore | Quartalsweise | Restore eines etcd Snapshots auf Staging-Cluster |
| PostgreSQL Failover | Quartalsweise | Manuelles Failover auf Replica, Application Recovery |
| VM Recovery (Veeam) | Halbjährlich | Kompletter VM Restore eines Worker-Nodes |
| Full Stack Recovery | Jährlich | Kompletter Neuaufbau aus Backups + IaC |
| Node-Ausfall-Simulation | Quartalsweise | Abschaltung eines Worker-Nodes, K8s Rescheduling |
9. Kosten-Schätzung
9.1 Monatliche Betriebskosten (On-Premises)
| Kostenposition | Monatlich (EUR) | Anmerkung |
| Hardware-Amortisation |
| 4x ESXi Server (5J Abschreibung) | 3.500–4.500 | ~210.000 EUR Anschaffung / 60 Monate |
| SAN Storage (5J Abschreibung) | 1.500–2.500 | ~90.000–150.000 EUR / 60 Monate |
| NFS Filer (5J Abschreibung) | 500–800 | ~30.000–48.000 EUR / 60 Monate |
| Netzwerk (Switches, Firewall, 5J) | 800–1.200 | ~48.000–72.000 EUR / 60 Monate |
| Lizenzen |
| VMware vSphere Enterprise Plus (4 Hosts) | 1.500–2.500 | Subscription-Modell (Broadcom) |
| Rancher Prime (optional) | 500–1.000 | Alternativ: Open Source (kostenfrei) |
| Veeam Backup & Replication | 300–500 | Per Socket Lizenz |
| HashiCorp Vault Enterprise (optional) | 0–1.000 | Alternativ: Community Edition (kostenfrei) |
| Externe APIs |
| Anthropic Claude API | 2.000–5.000 | Abhängig von Token-Volumen und Modell |
| Microsoft 365 E3/E5 Lizenzen | 1.500–3.000 | Für Entra ID, Graph API, Exchange Online |
| Betrieb Rechenzentrum |
| Strom & Kühlung | 800–1.500 | 4 Server + Storage + Netzwerk (~8–12 kW) |
| Rack-Space / Colocation | 500–1.000 | Falls externes RZ genutzt wird |
| Personal |
| Infrastruktur-Administration (anteilig) | 4.000–6.000 | ~0,5 FTE Systemadministrator |
| Kubernetes-Operations (anteilig) | 3.000–5.000 | ~0,3–0,5 FTE DevOps Engineer |
| Security & Compliance (anteilig) | 2.000–3.000 | ~0,2 FTE Security Engineer |
| Support & Wartung |
| Hardware-Wartungsverträge | 500–1.000 | Next Business Day Austausch |
| Software-Support (VMware, Veeam) | Inkl. in Lizenzen | – |
9.2 Monatliche Gesamtkosten
| Kategorie | Min. (EUR) | Max. (EUR) |
| Hardware-Amortisation | 6.300 | 9.000 |
| Lizenzen | 2.300 | 5.000 |
| Externe APIs | 3.500 | 8.000 |
| Betrieb Rechenzentrum | 1.300 | 2.500 |
| Personal (anteilig) | 9.000 | 14.000 |
| Support & Wartung | 500 | 1.000 |
| Gesamt: 22.900 | 39.500 |
9.3 Einmalige Investitionskosten (CAPEX)
| Position | Kosten (EUR) |
| 4x ESXi Server | 180.000–240.000 |
| SAN Storage (All-Flash + Hybrid) | 90.000–150.000 |
| NFS Filer | 30.000–48.000 |
| Netzwerk (Switches, Firewall, Kabel) | 48.000–72.000 |
| Rack, USV, PDU | 15.000–25.000 |
| Initiales Setup & Professional Services | 30.000–50.000 |
| Gesamt CAPEX | 393.000–585.000 |
9.4 Kostenvergleich: On-Premises vs. Cloud-Hybrid
| Aspekt | On-Premises | Cloud-Hybrid (Azure) |
| Monatl. Betrieb | 22.900–39.500 EUR | 15.000–30.000 EUR (variabel) |
| Initiale Investition | 393.000–585.000 EUR | Minimal (Pay-as-you-go) |
| Break-Even | ~24–36 Monate | – |
| Datensouveränität | Vollständig im eigenen RZ | Abhängig von Azure Region |
| Skalierbarkeit | Hardware-Beschaffung erforderlich | On-Demand |
| Compliance (BaFin) | Volle Kontrolle | Shared Responsibility |
| Vendor Lock-In | Gering (Standard-K8s) | Mittel (Azure-spezifische Services) |
Hinweis
Die On-Premises-Variante wurde aus Gründen der Datensouveränität, BaFin-Compliance und vollständiger Kontrolle über die Infrastruktur gewählt. Der höhere initiale Investitionsaufwand amortisiert sich über die Laufzeit von 3–5 Jahren.
Anhang
A. IP-Adressplan Zusammenfassung
| VLAN | Netzwerk | Nutzbare IPs | Zugewiesene Hosts |
| 10 | 10.10.10.0/24 | 10.10.10.2–.254 | ~10 Hosts |
| 20 | 10.10.20.0/23 | 10.10.20.2–10.10.21.254 | ~15 Hosts + VIPs |
| 30 | 10.10.30.0/24 | 10.10.30.2–.254 | ~20 Hosts |
| 40 | 10.10.40.0/24 | 10.10.40.2–.254 | ~5 Hosts |
| 50 | 10.10.50.0/24 | 10.10.50.2–.254 | ~5 Hosts |
| 60 | 10.10.60.0/24 | 10.10.60.2–.254 | ~5 Hosts |
B. Port-Matrix (Kurzfassung)
| Service | Port(s) | Protokoll | Beschreibung |
| K8s API | 6443 | TCP/TLS | Kubernetes API Server |
| etcd | 2379, 2380 | TCP/TLS | etcd Client + Peer |
| Kubelet | 10250 | TCP/TLS | Kubelet API |
| Calico BGP | 179 | TCP | BGP Peering |
| Calico VXLAN | 4789 | UDP | VXLAN Overlay |
| PostgreSQL | 5432 | TCP/TLS | Datenbank |
| MongoDB | 27017 | TCP/TLS | Datenbank |
| Redis | 6379 | TCP/TLS | Cache / Message Broker |
| Elasticsearch | 9200, 9300 | TCP/TLS | REST API + Cluster Communication |
| MinIO | 9000, 9001 | TCP/TLS | S3 API + Console |
| Vault | 8200 | TCP/TLS | Vault API |
| Prometheus | 9090 | TCP | Metrics Query |
| Grafana | 3000 | TCP | Dashboard UI |
| Loki | 3100 | TCP | Log Push/Query |
| Jaeger | 16686, 14268 | TCP | UI + Collector |
| NGINX Ingress | 80, 443 | TCP | HTTP/HTTPS |
| LDAP/LDAPS | 389, 636 | TCP | Active Directory |
| Kerberos | 88, 464 | TCP/UDP | AD Authentifizierung |
| DNS | 53 | TCP/UDP | Namensauflösung |
| NFS | 2049 | TCP | NFS Filer |