1. Design-Prinzipien

1.1 Leitprinzipien

PrinzipBeschreibung
On-Premises FirstAlle Workloads laufen im eigenen Rechenzentrum. Kein Public-Cloud-Hosting für Applikationen oder Daten.
VLAN-SegmentierungStrenge Netzwerksegmentierung nach Funktion. Jede Zone erhält ein eigenes VLAN mit dedizierten Firewall-Regeln.
Defense in DepthMehrschichtige Sicherheitsarchitektur: Perimeter-Firewall → VLAN-Firewall → Kubernetes NetworkPolicies → Applikations-Authentifizierung.
Infrastructure as CodeGesamte Infrastruktur wird deklarativ verwaltet: Terraform für vSphere-VMs, Helm/Kustomize für K8s, Ansible für Basiskonfiguration.
ObservabilityVollständige Beobachtbarkeit aller Schichten: Metriken (Prometheus), Logs (Loki), Traces (Jaeger), Dashboards (Grafana).

1.2 Architektur-Übersicht

HAFS-Group Rechenzentrum – Übersicht
  +-------------------------------------------------------------------------+
  |                     HAUCK AUFHAEUSER RECHENZENTRUM                      |
  |                                                                         |
  |  +--------------+  +--------------+  +--------------+  +------------+  |
  |  |  VLAN 10     |  |  VLAN 20     |  |  VLAN 30     |  |  VLAN 40   |  |
  |  |  Management  |  |  Kubernetes  |  |  Datenbank   |  |  Identity  |  |
  |  |              |  |              |  |  & Storage   |  |            |  |
  |  |  vCenter     |  |  RKE2 CP x3  |  |  PostgreSQL  |  |  AD DCs    |  |
  |  |  Rancher     |  |  Workers     |  |  MongoDB     |  |  Entra ID  |  |
  |  |  Bastion     |  |  (System/    |  |  MinIO       |  |  Connect   |  |
  |  |              |  |   App/AI/    |  |  Redis       |  |            |  |
  |  |              |  |   Data)      |  |  Elastic     |  |            |  |
  |  |              |  |              |  |  Vault       |  |            |  |
  |  +------+-------+  +------+-------+  +------+-------+  +-----+------+  |
  |         |                 |                 |                 |          |
  |  =======|=================|=================|=================|========  |
  |         |          Core Switches (L3)       |                 |          |
  |  =======|=================|=================|=================|========  |
  |         |                 |                 |                 |          |
  |  +------+-------+  +------+-------+                                    |
  |  |  VLAN 50     |  |  VLAN 60     |                                    |
  |  |  DMZ         |  |  Monitoring  |                                    |
  |  |              |  |              |                                    |
  |  |  NGINX       |  |  Prometheus  |                                    |
  |  |  Ingress     |  |  Grafana     |                                    |
  |  |  WAF         |  |  Loki        |                                    |
  |  |              |  |  Jaeger      |                                    |
  |  +------+-------+  +--------------+                                    |
  |         |                                                               |
  |  +------+-------+                                                      |
  |  |  Perimeter   |---- Internet (nur M365 Graph API + Anthropic API)    |
  |  |  Firewall    |                                                      |
  |  +--------------+                                                      |
  +-------------------------------------------------------------------------+

2. VMware vSphere Infrastruktur

2.1 ESXi Host Design

ParameterWert
HypervisorVMware ESXi 8.0 U2+
Anzahl ESXi HostsMinimum 4 (N+1 Redundanz)
CPU pro Host2x Intel Xeon Gold 6348 (28 Cores)
RAM pro Host512 GB DDR4 ECC
Lokaler Storage2x 960 GB NVMe SSD (ESXi Boot + Cache)
Netzwerk4x 25 GbE (2x Fabric A, 2x Fabric B)

2.2 vCenter Server

ParameterWert
vCenter VersionvCenter Server 8.0 U2+
DeploymentvCenter Server Appliance (VCSA)
PlatzierungVLAN 10 – Management
HAvCenter HA (Active/Passive/Witness)
LizenzvSphere Enterprise Plus

2.3 Cluster Design

vSphere Cluster: HAFS-PROD
  +-------------------------------------------------+
  |              vSphere Cluster: HAFS-PROD          |
  |                                                   |
  |  +-----------+ +-----------+ +-----------+       |
  |  | ESXi-01   | | ESXi-02   | | ESXi-03   |       |
  |  | 56C/512GB | | 56C/512GB | | 56C/512GB |       |
  |  +-----------+ +-----------+ +-----------+       |
  |                                                   |
  |  +-----------+                                   |
  |  | ESXi-04   |  <-- N+1 Reserve                  |
  |  | 56C/512GB |                                   |
  |  +-----------+                                   |
  |                                                   |
  |  Features: DRS, HA, vMotion, FT (selektiv)       |
  |  Admission Control: 25% Reserve                  |
  +-------------------------------------------------+

2.4 Resource Pools

Resource PoolCPU-AnteilRAM-AnteilVerwendung
RP-K8s-ControlPlane12 vCPU48 GBRKE2 Control Plane Nodes
RP-K8s-System16 vCPU64 GBSystem-Worker (Ingress, Monitoring)
RP-K8s-Application32 vCPU128 GBApplikations-Worker
RP-K8s-AI24 vCPU96 GBAI/ML Worker
RP-K8s-Data16 vCPU64 GBData-Worker (Elasticsearch, Analytics)
RP-Database24 vCPU128 GBPostgreSQL, MongoDB, Redis
RP-Infrastructure16 vCPU64 GBvCenter, Rancher, Bastion, AD

2.5 Storage-Architektur

Storage-TierTechnologieKapazitätVerwendung
Tier 1 – PerformanceSAN (FC 32 Gbit/s), All-Flash10 TBDatenbanken, etcd, kritische PVs
Tier 2 – StandardSAN (FC 16 Gbit/s), Hybrid20 TBApplikationsdaten, Container Images
Tier 3 – CapacityNFS (10 GbE)50 TBBackups, Logs, Archivierung, MinIO
Storage-Topologie & Kubernetes StorageClasses
  +--------------------------------------------+
  |            Storage-Topologie                |
  |                                            |
  |  ESXi Hosts --FC 32G--> SAN Array (Tier 1) |
  |             --FC 16G--> SAN Array (Tier 2) |
  |             --10GbE---> NFS Filer  (Tier 3) |
  |                                            |
  |  K8s StorageClasses:                       |
  |   - hafs-fast     -> Tier 1 (SAN)          |
  |   - hafs-standard -> Tier 2 (SAN)          |
  |   - hafs-capacity -> Tier 3 (NFS)          |
  +--------------------------------------------+

3. RKE2 Kubernetes Cluster Design

3.1 Cluster-Konfiguration

ParameterWert
Kubernetes DistributionRKE2 (Rancher Kubernetes Engine 2)
Kubernetes Versionv1.29.x (LTS-Channel)
ManagementRancher v2.8+ (UI + GitOps)
CNI PluginCalico (NetworkPolicy + BGP-fähig)
Container Runtimecontainerd (in RKE2 integriert)
Ingress ControllerNGINX Ingress Controller (extern via DMZ)
Service MeshOptional: Istio (bei Bedarf für mTLS)
Certificate Managementcert-manager (interne CA)
Secret ManagementHashiCorp Vault + External Secrets Operator

3.2 Node-Topologie

RKE2 Cluster: hafs-prod (~12 VMs)
  +-----------------------------------------------------------------+
  |                    RKE2 Cluster: hafs-prod                       |
  |                                                                   |
  |  +------------------- Control Plane -------------------+         |
  |  |  cp-01          cp-02          cp-03                |         |
  |  |  4 vCPU         4 vCPU         4 vCPU              |         |
  |  |  16 GB RAM      16 GB RAM      16 GB RAM           |         |
  |  |  100 GB SSD     100 GB SSD     100 GB SSD          |         |
  |  |  (etcd+API)     (etcd+API)     (etcd+API)          |         |
  |  +----------------------------------------------------|         |
  |                                                                   |
  |  +---- System Workers ----+  +---- App Workers -----------+     |
  |  |  sys-w01    sys-w02    |  |  app-w01  app-w02  app-w03 |     |
  |  |  4vCPU      4vCPU      |  |  8vCPU    8vCPU    8vCPU   |     |
  |  |  16GB       16GB       |  |  32GB     32GB     32GB    |     |
  |  +------------------------+  +-----------------------------+     |
  |                                                                   |
  |  +---- AI Workers --------+  +---- Data Workers ----------+     |
  |  |  ai-w01     ai-w02     |  |  data-w01   data-w02       |     |
  |  |  8vCPU      8vCPU      |  |  8vCPU      8vCPU          |     |
  |  |  32GB       32GB       |  |  32GB       32GB           |     |
  |  +------------------------+  +-----------------------------+     |
  +-----------------------------------------------------------------+

3.3 Node Pools, Labels und Taints

Node PoolAnzahlLabelTaintVerwendung
control-plane3node-role.kubernetes.io/masternode-role…/master:NoScheduleetcd, API Server, Controller Manager, Scheduler
system2hafs.de/pool=systemhafs.de/pool=system:PreferNoScheduleIngress Controller, cert-manager, Monitoring Agents, CoreDNS
application3hafs.de/pool=applicationPortal, Tickets, Notifications, Automation, Knowledge, Governance
ai2hafs.de/pool=aihafs.de/pool=ai:NoScheduleClaude API Gateway, AI Orchestrator, NLP, Embeddings
data2hafs.de/pool=datahafs.de/pool=data:NoScheduleElasticsearch, Analytics, SIEM Addon

3.4 Namespace-Struktur

NamespaceBeschreibungNode Pool
kube-systemKubernetes Core-Komponentensystem
ingress-nginxNGINX Ingress Controllersystem
cert-managerTLS-Zertifikatsverwaltung (interne CA)system
monitoringPrometheus, Grafana, Loki, Jaeger, Alertmanagersystem
hafs-portalSelf-Help Portal Frontend + BFFapplication
hafs-ticketsTicket-System (CRUD, Workflows, SLA)application
hafs-aiAI-Services (Claude Gateway, Orchestrator, NLP)ai
hafs-securityAuthentifizierung, Autorisierung, RBACapplication
hafs-governanceCompliance, Audit Trail, Data Governanceapplication
hafs-automationWorkflow Engine, Runbooks, Scheduled Tasksapplication
hafs-knowledgeKnowledge Base, Dokumentenverwaltungapplication
hafs-analyticsReporting, Dashboards, Metrikendata
hafs-notificationsE-Mail, Push, In-App Benachrichtigungenapplication
hafs-sharedShared Libraries, Config, Common Servicesapplication
hafs-siem-addonSIEM-Integration (Log-Forwarding, Correlation)data
hafs-iam-addonIAM-Integration (AD Sync, Provisioning)application
hafs-monitor-addonMonitoring-Addon (Custom Exporters, Alerts)system

3.5 Kubernetes NetworkPolicies (Calico)

NetworkPolicy-Konzept: Default Deny + Whitelist
  Namespace: hafs-portal (Beispiel)
  +-----------------------------------------------------------------+
  | 1) Default Deny All                                              |
  |    podSelector: {}                                               |
  |    policyTypes: [Ingress, Egress]                                |
  |                                                                   |
  | 2) Allow Ingress from NGINX Ingress                              |
  |    from: namespaceSelector: ingress-nginx                        |
  |    ports: TCP/8080                                               |
  |                                                                   |
  | 3) Allow Egress to hafs-ai                                      |
  |    to: namespaceSelector: hafs-ai                                |
  |    ports: TCP/8080                                               |
  |                                                                   |
  | 4) Allow Egress to kube-system (DNS)                             |
  |    to: namespaceSelector: kube-system                            |
  |    ports: UDP/53                                                 |
  +-----------------------------------------------------------------+
NetworkPolicy-Strategie

Jeder Namespace erhält eine Default-Deny-All-Policy. Erlaubter Traffic wird explizit über Whitelist-Regeln definiert. Calico als CNI ermöglicht zusätzlich GlobalNetworkPolicies für clusterweite Regeln.

4. VLAN-Architektur Detail

4.1 VLAN-Übersicht

VLAN IDNameSubnetzGatewayBeschreibung
10Management10.10.10.0/2410.10.10.1vCenter, Rancher, Bastion Host, IPMI/iDRAC
20Kubernetes10.10.20.0/2310.10.20.1RKE2 Control Plane + Worker Nodes (/23 für Skalierung)
30Datenbank10.10.30.0/2410.10.30.1PostgreSQL, MongoDB, MinIO, Redis, Elasticsearch, Vault
40Identity10.10.40.0/2410.10.40.1AD Domain Controller, Entra ID Connect
50DMZ10.10.50.0/2410.10.50.1NGINX Ingress VIP, WAF, Reverse Proxy
60Monitoring10.10.60.0/2410.10.60.1Prometheus, Grafana, Loki, Jaeger

4.2 VLAN-Diagramm

VLAN-Topologie mit Perimeter-Firewall
                             +---------------+
                             |   Internet     |
                             +-------+-------+
                                     |
                             +-------+-------+
                             |  Perimeter FW  |
                             |  (Palo Alto /  |
                             |   FortiGate)   |
                             +-------+-------+
                                     |
                      +--------------+-------------------+
                      |        Core Switch L3             |
                      |     (Cisco Nexus / Arista)        |
                      +--+----+----+----+----+----+------+
                         |    |    |    |    |    |
              +----------+    |    |    |    |    +----------+
              |               |    |    |    |               |
       +------+------+ +-----+----++ +--+---+-----+ +-------+-----+
       |  VLAN 10    | |  VLAN 20  | |  VLAN 30   | |  VLAN 40    |
       |  Management | |  K8s      | |  DB/Store  | |  Identity   |
       |  .0/24      | |  .0/23    | |  .0/24     | |  .0/24      |
       +-------------+ +-----------+ +------------+ +-------------+

       +-------------+ +-----------+
       |  VLAN 50    | |  VLAN 60  |
       |  DMZ        | |  Monitor  |
       |  .0/24      | |  .0/24    |
       +-------------+ +-----------+

4.3 VLAN 10 – Management

HostIP-AdresseFunktion
vcenter.hafs.internal10.10.10.10vCenter Server Appliance
rancher.hafs.internal10.10.10.20Rancher Management Server
bastion.hafs.internal10.10.10.30SSH Bastion / Jump Host
ipmi-esxi01–04.hafs.internal10.10.10.41–.44ESXi Host IPMI
Zugriff

Nur über Bastion Host mit MFA. Kein direkter Zugriff aus anderen VLANs.

4.4 VLAN 20 – Kubernetes

HostIP-AdresseRolle
rke2-cp-0110.10.20.11Control Plane Node 1
rke2-cp-0210.10.20.12Control Plane Node 2
rke2-cp-0310.10.20.13Control Plane Node 3
rke2-sys-w0110.10.20.21System Worker 1
rke2-sys-w0210.10.20.22System Worker 2
rke2-app-w0110.10.20.31Application Worker 1
rke2-app-w0210.10.20.32Application Worker 2
rke2-app-w0310.10.20.33Application Worker 3
rke2-ai-w0110.10.20.41AI Worker 1
rke2-ai-w0210.10.20.42AI Worker 2
rke2-data-w0110.10.20.51Data Worker 1
rke2-data-w0210.10.20.52Data Worker 2
rke2-api-vip10.10.20.100K8s API Server VIP (kube-vip)

4.5 VLAN 30 – Datenbank & Storage

Host / ServiceIP-AdresseFunktion
pg-primary10.10.30.11PostgreSQL Primary
pg-replica-0110.10.30.12PostgreSQL Streaming Replica
pg-replica-0210.10.30.13PostgreSQL Streaming Replica
mongo-rs-01–0310.10.30.21–.23MongoDB Replica Set (3 Members)
minio-01–0210.10.30.31–.32MinIO Object Storage
redis-master10.10.30.41Redis Primary (Sentinel-Cluster)
redis-replica-0110.10.30.42Redis Replica
redis-sentinel-0110.10.30.43Redis Sentinel
elastic-01–0310.10.30.51–.53Elasticsearch Cluster (3 Nodes)
vault.hafs.internal10.10.30.61HashiCorp Vault (HA)

4.6 VLAN 40 – Identity

HostIP-AdresseFunktion
dc01.hafs.internal10.10.40.11Active Directory Domain Controller 1
dc02.hafs.internal10.10.40.12Active Directory Domain Controller 2
aadconnect.hafs.internal10.10.40.21Microsoft Entra ID Connect Server

4.7 VLAN 50 – DMZ

Host / ServiceIP-AdresseFunktion
ingress-vip10.10.50.10NGINX Ingress Controller VIP
waf-0110.10.50.20Web Application Firewall

4.8 VLAN 60 – Monitoring

Host / ServiceIP-AdresseFunktion
prometheus10.10.60.11Prometheus + Alertmanager
grafana10.10.60.12Grafana Dashboards
loki10.10.60.13Loki Log Aggregation
jaeger10.10.60.14Jaeger Tracing Backend

4.9 Firewall-Regeln zwischen VLANs

Quelle (VLAN)Ziel (VLAN)Protokoll/PortBeschreibung
VLAN 10 (Mgmt)VLAN 20 (K8s)TCP/6443Rancher → K8s API Server
VLAN 10 (Mgmt)VLAN 20 (K8s)TCP/22Bastion → Node SSH
VLAN 10 (Mgmt)VLAN 30 (DB)TCP/5432,27017,9200,8200Management → DB Administration
VLAN 10 (Mgmt)VLAN 60 (Mon)TCP/3000,9090Mgmt → Grafana/Prometheus UI
VLAN 20 (K8s)VLAN 30 (DB)TCP/5432App Pods → PostgreSQL
VLAN 20 (K8s)VLAN 30 (DB)TCP/27017App Pods → MongoDB
VLAN 20 (K8s)VLAN 30 (DB)TCP/6379App Pods → Redis
VLAN 20 (K8s)VLAN 30 (DB)TCP/9200Data Pods → Elasticsearch
VLAN 20 (K8s)VLAN 30 (DB)TCP/9000App Pods → MinIO S3 API
VLAN 20 (K8s)VLAN 30 (DB)TCP/8200App Pods → Vault
VLAN 20 (K8s)VLAN 40 (ID)TCP/389,636K8s → AD LDAP/LDAPS
VLAN 20 (K8s)VLAN 40 (ID)TCP/88,464K8s → Kerberos
VLAN 20 (K8s)VLAN 60 (Mon)TCP/9090Prometheus Scraping (Reverse)
VLAN 50 (DMZ)VLAN 20 (K8s)TCP/80,443Ingress → K8s Service Endpoints
VLAN 60 (Mon)VLAN 20 (K8s)TCP/9100,8080Prometheus → Node Exporter, Metrics
VLAN 60 (Mon)VLAN 30 (DB)TCP/9187,9216,9114Prometheus → DB Exporters
VLAN 40 (ID)InternetTCP/443Entra ID Connect → Microsoft 365
VLAN 20 (K8s)InternetTCP/443AI Pods → api.anthropic.com (via Proxy)
VLAN 20 (K8s)InternetTCP/443App Pods → graph.microsoft.com (via Proxy)
***Default: DENY ALL

5. DNS-Architektur

5.1 Übersicht

DNS-Architektur (AD DNS + CoreDNS)
  +-----------------------------------------------------------+
  |                    DNS-Architektur                          |
  |                                                            |
  |  +-----------------+       +-----------------+            |
  |  |  AD DNS (DC01)  |<---->|  AD DNS (DC02)  |            |
  |  |  10.10.40.11    | Repl. |  10.10.40.12    |            |
  |  |                 |       |                 |            |
  |  |  Zone:          |       |  Zone:          |            |
  |  |  hafs.internal  |       |  hafs.internal  |            |
  |  +--------+--------+       +--------+--------+            |
  |           |                         |                     |
  |           +-----------+-------------+                     |
  |                       |                                   |
  |                       v                                   |
  |           +------------------------+                     |
  |           |  CoreDNS (in K8s)     |                     |
  |           |  cluster.local Zone   |                     |
  |           |                        |                     |
  |           |  Forward: hafs.internal|                     |
  |           |    -> 10.10.40.11/12  |                     |
  |           |  Forward: Internet     |                     |
  |           |    -> Nicht erlaubt    |                     |
  |           +------------------------+                     |
  +-----------------------------------------------------------+

5.2 DNS-Zonen

ZoneTypServerBeschreibung
hafs.internalAD-integriertDC01, DC02Primäre interne Zone
10.10.in-addr.arpaAD-integriertDC01, DC02Reverse-Lookup
cluster.localCoreDNSK8s CoreDNS PodsKubernetes-interne Service-Auflösung

5.3 Wichtige DNS-Einträge (hafs.internal)

FQDNTypWertBeschreibung
portal.hafs.internalA10.10.50.10Self-Help Portal (via Ingress)
api.hafs.internalA10.10.50.10API Gateway (via Ingress)
rancher.hafs.internalA10.10.10.20Rancher UI
grafana.hafs.internalA10.10.50.10Grafana (via Ingress)
vault.hafs.internalA10.10.30.61HashiCorp Vault
vcenter.hafs.internalA10.10.10.10vCenter Server
k8s-api.hafs.internalA10.10.20.100Kubernetes API VIP

6. Internet Connectivity

6.1 Grundprinzip

Die gesamte Self-Help Portal Infrastruktur ist on-premises. Externe Internetverbindungen werden ausschließlich für zwei Zwecke benötigt:

  1. Microsoft 365 Graph API – Ticket-Erstellung, E-Mail-Versand, Benutzer-Synchronisierung
  2. Anthropic Claude API – KI-gestützte Ticket-Analyse und Chatbot-Funktionalität

6.2 Proxy-Architektur

Outbound-Proxy über Perimeter-Firewall
  +------------------------------------------------------------+
  |  K8s Pod (hafs-ai / hafs-tickets)                          |
  |  |                                                          |
  |  |  HTTP_PROXY=http://proxy.hafs.internal:3128             |
  |  |  HTTPS_PROXY=http://proxy.hafs.internal:3128            |
  |  |  NO_PROXY=.hafs.internal,.cluster.local,10.10.0.0/16   |
  |  |                                                          |
  |  +----------+----------------------------------------------+
  |             |
  |             v
  |  +------------------+
  |  |  Forward Proxy   |     Squid / HAProxy
  |  |  VLAN 20         |     mit TLS-Inspection (optional)
  |  |  10.10.20.200    |
  |  +----------+-------+
  |             |
  |             v
  |  +------------------+
  |  |  Perimeter FW    |     Allowlist-basiert
  |  +----------+-------+
  |             |
  |             v
  |         Internet
  +------------------------------------------------------------+

6.3 Firewall Allowlist (Outbound)

Ziel-FQDNPortProtokollQuell-VLANZweck
api.anthropic.com443HTTPSVLAN 20Claude API (Chatbot, Ticket-Analyse)
graph.microsoft.com443HTTPSVLAN 20M365 Graph API
login.microsoftonline.com443HTTPSVLAN 40Entra ID Authentifizierung
*.servicebus.windows.net443HTTPSVLAN 40Entra ID Connect Sync
adminwebservice.microsoftonline.com443HTTPSVLAN 40Entra ID Connect
Default: DENY ALL Outbound

Alles andere ist blockiert. Die Default-Regel der Perimeter-Firewall ist DENY ALL outbound. Nur explizit gelistete FQDNs sind erlaubt.

6.4 Rate Limiting & Budget Controls

APIRate LimitMonatliches Token-BudgetAlerting
Anthropic Claude100 req/minMax. 5.000.000 Input TokensAlert bei 80% Budget
Microsoft Graph1.000 req/minKeine Token-LimitierungAlert bei >500 Fehler/Stunde

7. Monitoring & Observability Stack

7.1 Stack-Übersicht

Observability Stack (On-Premises)
  +--------------------------------------------------------------------+
  |                    Observability Stack (On-Premises)                |
  |                                                                    |
  |  +--------------+  +--------------+  +--------------+              |
  |  |  Fluent Bit  |  |  Fluent Bit  |  |  Fluent Bit  |  DaemonSet  |
  |  |  (Node)      |  |  (Node)      |  |  (Node)      |  auf allen  |
  |  +------+-------+  +------+-------+  +------+-------+  Nodes      |
  |         |                 |                 |                      |
  |         +--------+--------+-----------------+                     |
  |                  |                                                 |
  |                  v                                                 |
  |          +---------------+                                        |
  |          |     Loki      |  Log-Aggregation                       |
  |          |  (VLAN 60)    |  Retention: 90 Tage                    |
  |          +---------------+                                        |
  |                                                                    |
  |  +--------------------------------------------------------------+ |
  |  |  Prometheus + Alertmanager                                    | |
  |  |  - K8s Metriken (kube-state-metrics, node-exporter)          | |
  |  |  - App Metriken (/metrics Endpoints)                          | |
  |  |  - DB Metriken (postgres_exporter, mongodb_exporter)          | |
  |  |  - vSphere Metriken (vmware_exporter)                         | |
  |  |  Retention: 180 Tage (Tier 2 Storage)                         | |
  |  +--------------------------------------------------------------+ |
  |                                                                    |
  |  +--------------------------------------------------------------+ |
  |  |  OpenTelemetry Collector (Sidecar + Gateway)                  | |
  |  |  --> Jaeger (Distributed Tracing)                             | |
  |  |  Retention: 30 Tage                                           | |
  |  +--------------------------------------------------------------+ |
  |                                                                    |
  |  +--------------------------------------------------------------+ |
  |  |  Grafana                                                      | |
  |  |  - Datasources: Prometheus, Loki, Jaeger                     | |
  |  |  - Auth: LDAP (AD-integriert)                                 | |
  |  |  - Dashboards: K8s, Apps, DBs, vSphere, Business Metrics     | |
  |  +--------------------------------------------------------------+ |
  +--------------------------------------------------------------------+

7.2 Prometheus & Alertmanager

ParameterWert
DeploymentStatefulSet im Namespace monitoring
Storage500 GB (hafs-standard StorageClass)
Retention180 Tage
Scrape Interval30s (Standard), 15s (kritische Services)
Alertmanager Replicas2 (HA)
Alert-RoutenE-Mail, MS Teams Webhook, PagerDuty (optional)

7.3 Wichtige Alert Rules

AlertBedingungSeverity
KubePodCrashLoopingPod restarted >3x in 10 Mincritical
KubeNodeNotReadyNode NotReady >5 Mincritical
PostgreSQLReplicationLagReplikation >30s hinter Primarywarning
ClaudeAPIErrorRate>5% Fehlerrate in 5 Mincritical
DiskSpaceLow<15% frei auf PVwarning
CertificateExpiringSoonTLS-Zertifikat läuft in <14 Tagen abwarning
HighMemoryUsagePod Memory >90% Limitwarning
IngressLatencyHighp99 Latenz >2swarning
VaultSealedStatusVault sealedcritical
EtcdHighCommitDurationetcd Commit >250mswarning

7.4 Grafana Dashboards

DashboardBeschreibung
K8s Cluster OverviewNodes, Pods, Deployments, Resource Utilization
HAFS Portal ApplicationRequest Rate, Latenz, Error Rate, Aktive Sessions
AI Service MonitoringClaude API Calls, Token Usage, Latenz, Kosten
Ticket SystemTicket-Erstellung, SLA Compliance, Backlog
Database PerformancePostgreSQL Connections, Query Time, Replication Lag
vSphere InfrastructureESXi CPU/RAM/Storage, VM Performance
Security & ComplianceAuth Failures, RBAC Violations, Certificate Status
Business KPIsTicket-Volumen, Self-Service Rate, User Satisfaction

7.5 Loki (Log-Aggregation)

ParameterWert
DeploymentStatefulSet (3 Replicas, Microservice Mode)
Storage BackendMinIO (S3-kompatibel, VLAN 30)
Retention90 Tage
Ingestion RateMax. 10 MB/s
Log Labelsnamespace, pod, container, app, level

7.6 OpenTelemetry & Jaeger (Distributed Tracing)

ParameterWert
Collector DeploymentDaemonSet (Agent) + Deployment (Gateway)
Sampling Rate10% (Standard), 100% (bei Fehler)
Trace BackendJaeger (Elasticsearch Storage)
Retention30 Tage
InstrumentationOpenTelemetry SDK in allen HAFS-Microservices

7.7 Fluent Bit Pipeline

Fluent Bit Log-Routing
  Container Logs --> Fluent Bit (DaemonSet)
      |
      +--> Loki            (Langzeit-Speicherung, 90 Tage)
      +--> Elasticsearch   (Volltextsuche, SIEM)
      +--> stdout          (Debug, nur in Dev)

8. Disaster Recovery & Business Continuity

8.1 Übersicht DR-Strategie

KomponenteRPO (Recovery Point)RTO (Recovery Time)Methode
Kubernetes etcd1 Stunde30 Minutenetcd Snapshot (stündlich) + S3 Upload
PostgreSQL5 Minuten15 MinutenStreaming Replication + WAL Archiving
MongoDB5 Minuten15 MinutenReplica Set (automatisches Failover)
Redis1 Stunde5 MinutenSentinel Failover + RDB Snapshots
Elasticsearch1 Stunde30 MinutenSnapshot to MinIO (stündlich)
MinIO24 Stunden1 StundeErasure Coding + Veeam Backup
VMs (gesamt)24 Stunden2 StundenVeeam Backup & Replication
Vault1 Stunde30 MinutenRaft Snapshots + Offline Backup

8.2 VMware HA & vMotion

VMware HA Cluster: HAFS-PROD
  +--------------------------------------------------------+
  |             VMware HA Cluster: HAFS-PROD                |
  |                                                         |
  |  +---------+  +---------+  +---------+  +---------+   |
  |  | ESXi-01 |  | ESXi-02 |  | ESXi-03 |  | ESXi-04 |   |
  |  |  Active  |  |  Active  |  |  Active  |  | Reserve |   |
  |  +---------+  +---------+  +---------+  +---------+   |
  |                                                         |
  |  Admission Control: 25% CPU/RAM Reserve                |
  |  Host Monitoring: Aktiviert                             |
  |  VM Monitoring: Aktiviert (Application-Level)          |
  |  vMotion: Live Migration bei Host-Wartung              |
  |  DRS: Fully Automated (Threshold: 3)                   |
  |  Proactive HA: Hardware-Health Monitoring              |
  +--------------------------------------------------------+
FeatureKonfiguration
VMware HAAktiviert, Host Isolation Response: Power Off
Admission Control25% Reserve (entspricht 1 Host Ausfall)
vMotionDediziertes VMkernel auf 25 GbE
DRSFully Automated, Migration Threshold 3
Proactive HAAktiviert (IPMI/iLO Sensor Monitoring)

8.3 Veeam Backup & Replication

Backup-JobScheduleRetentionStorage Tier
VM Full BackupSonntag 02:004 WochenTier 3 (NFS)
VM Incremental BackupTäglich 02:0014 TageTier 3 (NFS)
DB Dump (PostgreSQL)Alle 6 Stunden7 TageTier 3 (NFS)
etcd SnapshotStündlich48 StundenMinIO (VLAN 30)
Vault SnapshotStündlich48 StundenMinIO (VLAN 30)
Config Backup (Git)Bei jeder ÄnderungUnbegrenzt (Git)GitLab On-Prem

8.4 PostgreSQL Backup & Replication

PostgreSQL Streaming Replication + WAL Archiving
  +--------------+     Streaming      +--------------+
  |  pg-primary  |----Replication---->|  pg-replica-01|
  |  10.10.30.11 |                    |  10.10.30.12 |
  +------+-------+                    +--------------+
         |            Streaming      +--------------+
         +--------Replication------->|  pg-replica-02|
         |                           |  10.10.30.13 |
         |                           +--------------+
         |
         v
  +--------------+
  |  WAL Archive |--> MinIO (S3)
  |  pg_basebackup|   Retention: 7 Tage
  +--------------+

8.5 DR-Testplan

TestFrequenzBeschreibung
etcd RestoreQuartalsweiseRestore eines etcd Snapshots auf Staging-Cluster
PostgreSQL FailoverQuartalsweiseManuelles Failover auf Replica, Application Recovery
VM Recovery (Veeam)HalbjährlichKompletter VM Restore eines Worker-Nodes
Full Stack RecoveryJährlichKompletter Neuaufbau aus Backups + IaC
Node-Ausfall-SimulationQuartalsweiseAbschaltung eines Worker-Nodes, K8s Rescheduling

9. Kosten-Schätzung

9.1 Monatliche Betriebskosten (On-Premises)

KostenpositionMonatlich (EUR)Anmerkung
Hardware-Amortisation
4x ESXi Server (5J Abschreibung)3.500–4.500~210.000 EUR Anschaffung / 60 Monate
SAN Storage (5J Abschreibung)1.500–2.500~90.000–150.000 EUR / 60 Monate
NFS Filer (5J Abschreibung)500–800~30.000–48.000 EUR / 60 Monate
Netzwerk (Switches, Firewall, 5J)800–1.200~48.000–72.000 EUR / 60 Monate
Lizenzen
VMware vSphere Enterprise Plus (4 Hosts)1.500–2.500Subscription-Modell (Broadcom)
Rancher Prime (optional)500–1.000Alternativ: Open Source (kostenfrei)
Veeam Backup & Replication300–500Per Socket Lizenz
HashiCorp Vault Enterprise (optional)0–1.000Alternativ: Community Edition (kostenfrei)
Externe APIs
Anthropic Claude API2.000–5.000Abhängig von Token-Volumen und Modell
Microsoft 365 E3/E5 Lizenzen1.500–3.000Für Entra ID, Graph API, Exchange Online
Betrieb Rechenzentrum
Strom & Kühlung800–1.5004 Server + Storage + Netzwerk (~8–12 kW)
Rack-Space / Colocation500–1.000Falls externes RZ genutzt wird
Personal
Infrastruktur-Administration (anteilig)4.000–6.000~0,5 FTE Systemadministrator
Kubernetes-Operations (anteilig)3.000–5.000~0,3–0,5 FTE DevOps Engineer
Security & Compliance (anteilig)2.000–3.000~0,2 FTE Security Engineer
Support & Wartung
Hardware-Wartungsverträge500–1.000Next Business Day Austausch
Software-Support (VMware, Veeam)Inkl. in Lizenzen

9.2 Monatliche Gesamtkosten

KategorieMin. (EUR)Max. (EUR)
Hardware-Amortisation6.3009.000
Lizenzen2.3005.000
Externe APIs3.5008.000
Betrieb Rechenzentrum1.3002.500
Personal (anteilig)9.00014.000
Support & Wartung5001.000
Gesamt: 22.90039.500

9.3 Einmalige Investitionskosten (CAPEX)

PositionKosten (EUR)
4x ESXi Server180.000–240.000
SAN Storage (All-Flash + Hybrid)90.000–150.000
NFS Filer30.000–48.000
Netzwerk (Switches, Firewall, Kabel)48.000–72.000
Rack, USV, PDU15.000–25.000
Initiales Setup & Professional Services30.000–50.000
Gesamt CAPEX393.000–585.000

9.4 Kostenvergleich: On-Premises vs. Cloud-Hybrid

AspektOn-PremisesCloud-Hybrid (Azure)
Monatl. Betrieb22.900–39.500 EUR15.000–30.000 EUR (variabel)
Initiale Investition393.000–585.000 EURMinimal (Pay-as-you-go)
Break-Even~24–36 Monate
DatensouveränitätVollständig im eigenen RZAbhängig von Azure Region
SkalierbarkeitHardware-Beschaffung erforderlichOn-Demand
Compliance (BaFin)Volle KontrolleShared Responsibility
Vendor Lock-InGering (Standard-K8s)Mittel (Azure-spezifische Services)
Hinweis

Die On-Premises-Variante wurde aus Gründen der Datensouveränität, BaFin-Compliance und vollständiger Kontrolle über die Infrastruktur gewählt. Der höhere initiale Investitionsaufwand amortisiert sich über die Laufzeit von 3–5 Jahren.

Anhang

A. IP-Adressplan Zusammenfassung

VLANNetzwerkNutzbare IPsZugewiesene Hosts
1010.10.10.0/2410.10.10.2–.254~10 Hosts
2010.10.20.0/2310.10.20.2–10.10.21.254~15 Hosts + VIPs
3010.10.30.0/2410.10.30.2–.254~20 Hosts
4010.10.40.0/2410.10.40.2–.254~5 Hosts
5010.10.50.0/2410.10.50.2–.254~5 Hosts
6010.10.60.0/2410.10.60.2–.254~5 Hosts

B. Port-Matrix (Kurzfassung)

ServicePort(s)ProtokollBeschreibung
K8s API6443TCP/TLSKubernetes API Server
etcd2379, 2380TCP/TLSetcd Client + Peer
Kubelet10250TCP/TLSKubelet API
Calico BGP179TCPBGP Peering
Calico VXLAN4789UDPVXLAN Overlay
PostgreSQL5432TCP/TLSDatenbank
MongoDB27017TCP/TLSDatenbank
Redis6379TCP/TLSCache / Message Broker
Elasticsearch9200, 9300TCP/TLSREST API + Cluster Communication
MinIO9000, 9001TCP/TLSS3 API + Console
Vault8200TCP/TLSVault API
Prometheus9090TCPMetrics Query
Grafana3000TCPDashboard UI
Loki3100TCPLog Push/Query
Jaeger16686, 14268TCPUI + Collector
NGINX Ingress80, 443TCPHTTP/HTTPS
LDAP/LDAPS389, 636TCPActive Directory
Kerberos88, 464TCP/UDPAD Authentifizierung
DNS53TCP/UDPNamensauflösung
NFS2049TCPNFS Filer