08 – Netzwerk & Infrastruktur | HAFS Service Portal

Inhaltsverzeichnis

Design-Prinzipien
VMware vSphere Infrastruktur
RKE2 Kubernetes Cluster Design
VLAN-Architektur Detail
DNS-Architektur
Internet Connectivity
Monitoring & Observability Stack
Disaster Recovery & Business Continuity
Kosten-Schätzung

1. Design-Prinzipien

1.1 Leitprinzipien

Prinzip	Beschreibung
On-Premises First	Alle Workloads laufen im eigenen Rechenzentrum. Kein Public-Cloud-Hosting für Applikationen oder Daten.
VLAN-Segmentierung	Strenge Netzwerksegmentierung nach Funktion. Jede Zone erhält ein eigenes VLAN mit dedizierten Firewall-Regeln.
Defense in Depth	Mehrschichtige Sicherheitsarchitektur: Perimeter-Firewall → VLAN-Firewall → Kubernetes NetworkPolicies → Applikations-Authentifizierung.
Infrastructure as Code	Gesamte Infrastruktur wird deklarativ verwaltet: Terraform für vSphere-VMs, Helm/Kustomize für K8s, Ansible für Basiskonfiguration.
Observability	Vollständige Beobachtbarkeit aller Schichten: Metriken (Prometheus), Logs (Loki), Traces (Jaeger), Dashboards (Grafana).

1.2 Architektur-Übersicht

HAFS-Group Rechenzentrum – Übersicht

  +-------------------------------------------------------------------------+
  |                     HAUCK AUFHAEUSER RECHENZENTRUM                      |
  |                                                                         |
  |  +--------------+  +--------------+  +--------------+  +------------+  |
  |  |  VLAN 10     |  |  VLAN 20     |  |  VLAN 30     |  |  VLAN 40   |  |
  |  |  Management  |  |  Kubernetes  |  |  Datenbank   |  |  Identity  |  |
  |  |              |  |              |  |  & Storage   |  |            |  |
  |  |  vCenter     |  |  RKE2 CP x3  |  |  PostgreSQL  |  |  AD DCs    |  |
  |  |  Rancher     |  |  Workers     |  |  MongoDB     |  |  Entra ID  |  |
  |  |  Bastion     |  |  (System/    |  |  MinIO       |  |  Connect   |  |
  |  |              |  |   App/AI/    |  |  Redis       |  |            |  |
  |  |              |  |   Data)      |  |  Elastic     |  |            |  |
  |  |              |  |              |  |  Vault       |  |            |  |
  |  +------+-------+  +------+-------+  +------+-------+  +-----+------+  |
  |         |                 |                 |                 |          |
  |  =======|=================|=================|=================|========  |
  |         |          Core Switches (L3)       |                 |          |
  |  =======|=================|=================|=================|========  |
  |         |                 |                 |                 |          |
  |  +------+-------+  +------+-------+                                    |
  |  |  VLAN 50     |  |  VLAN 60     |                                    |
  |  |  DMZ         |  |  Monitoring  |                                    |
  |  |              |  |              |                                    |
  |  |  NGINX       |  |  Prometheus  |                                    |
  |  |  Ingress     |  |  Grafana     |                                    |
  |  |  WAF         |  |  Loki        |                                    |
  |  |              |  |  Jaeger      |                                    |
  |  +------+-------+  +--------------+                                    |
  |         |                                                               |
  |  +------+-------+                                                      |
  |  |  Perimeter   |---- Internet (nur M365 Graph API + Anthropic API)    |
  |  |  Firewall    |                                                      |
  |  +--------------+                                                      |
  +-------------------------------------------------------------------------+

2. VMware vSphere Infrastruktur

2.1 ESXi Host Design

Parameter	Wert
Hypervisor	VMware ESXi 8.0 U2+
Anzahl ESXi Hosts	Minimum 4 (N+1 Redundanz)
CPU pro Host	2x Intel Xeon Gold 6348 (28 Cores)
RAM pro Host	512 GB DDR4 ECC
Lokaler Storage	2x 960 GB NVMe SSD (ESXi Boot + Cache)
Netzwerk	4x 25 GbE (2x Fabric A, 2x Fabric B)

2.2 vCenter Server

Parameter	Wert
vCenter Version	vCenter Server 8.0 U2+
Deployment	vCenter Server Appliance (VCSA)
Platzierung	VLAN 10 – Management
HA	vCenter HA (Active/Passive/Witness)
Lizenz	vSphere Enterprise Plus

2.3 Cluster Design

vSphere Cluster: HAFS-PROD

  +-------------------------------------------------+
  |              vSphere Cluster: HAFS-PROD          |
  |                                                   |
  |  +-----------+ +-----------+ +-----------+       |
  |  | ESXi-01   | | ESXi-02   | | ESXi-03   |       |
  |  | 56C/512GB | | 56C/512GB | | 56C/512GB |       |
  |  +-----------+ +-----------+ +-----------+       |
  |                                                   |
  |  +-----------+                                   |
  |  | ESXi-04   |  <-- N+1 Reserve                  |
  |  | 56C/512GB |                                   |
  |  +-----------+                                   |
  |                                                   |
  |  Features: DRS, HA, vMotion, FT (selektiv)       |
  |  Admission Control: 25% Reserve                  |
  +-------------------------------------------------+

2.4 Resource Pools

Resource Pool	CPU-Anteil	RAM-Anteil	Verwendung
RP-K8s-ControlPlane	12 vCPU	48 GB	RKE2 Control Plane Nodes
RP-K8s-System	16 vCPU	64 GB	System-Worker (Ingress, Monitoring)
RP-K8s-Application	32 vCPU	128 GB	Applikations-Worker
RP-K8s-AI	24 vCPU	96 GB	AI/ML Worker
RP-K8s-Data	16 vCPU	64 GB	Data-Worker (Elasticsearch, Analytics)
RP-Database	24 vCPU	128 GB	PostgreSQL, MongoDB, Redis
RP-Infrastructure	16 vCPU	64 GB	vCenter, Rancher, Bastion, AD

2.5 Storage-Architektur

Storage-Tier	Technologie	Kapazität	Verwendung
Tier 1 – Performance	SAN (FC 32 Gbit/s), All-Flash	10 TB	Datenbanken, etcd, kritische PVs
Tier 2 – Standard	SAN (FC 16 Gbit/s), Hybrid	20 TB	Applikationsdaten, Container Images
Tier 3 – Capacity	NFS (10 GbE)	50 TB	Backups, Logs, Archivierung, MinIO

Storage-Topologie & Kubernetes StorageClasses

  +--------------------------------------------+
  |            Storage-Topologie                |
  |                                            |
  |  ESXi Hosts --FC 32G--> SAN Array (Tier 1) |
  |             --FC 16G--> SAN Array (Tier 2) |
  |             --10GbE---> NFS Filer  (Tier 3) |
  |                                            |
  |  K8s StorageClasses:                       |
  |   - hafs-fast     -> Tier 1 (SAN)          |
  |   - hafs-standard -> Tier 2 (SAN)          |
  |   - hafs-capacity -> Tier 3 (NFS)          |
  +--------------------------------------------+

3. RKE2 Kubernetes Cluster Design

3.1 Cluster-Konfiguration

Parameter	Wert
Kubernetes Distribution	RKE2 (Rancher Kubernetes Engine 2)
Kubernetes Version	v1.29.x (LTS-Channel)
Management	Rancher v2.8+ (UI + GitOps)
CNI Plugin	Calico (NetworkPolicy + BGP-fähig)
Container Runtime	containerd (in RKE2 integriert)
Ingress Controller	NGINX Ingress Controller (extern via DMZ)
Service Mesh	Optional: Istio (bei Bedarf für mTLS)
Certificate Management	cert-manager (interne CA)
Secret Management	HashiCorp Vault + External Secrets Operator

3.2 Node-Topologie

RKE2 Cluster: hafs-prod (~12 VMs)

  +-----------------------------------------------------------------+
  |                    RKE2 Cluster: hafs-prod                       |
  |                                                                   |
  |  +------------------- Control Plane -------------------+         |
  |  |  cp-01          cp-02          cp-03                |         |
  |  |  4 vCPU         4 vCPU         4 vCPU              |         |
  |  |  16 GB RAM      16 GB RAM      16 GB RAM           |         |
  |  |  100 GB SSD     100 GB SSD     100 GB SSD          |         |
  |  |  (etcd+API)     (etcd+API)     (etcd+API)          |         |
  |  +----------------------------------------------------|         |
  |                                                                   |
  |  +---- System Workers ----+  +---- App Workers -----------+     |
  |  |  sys-w01    sys-w02    |  |  app-w01  app-w02  app-w03 |     |
  |  |  4vCPU      4vCPU      |  |  8vCPU    8vCPU    8vCPU   |     |
  |  |  16GB       16GB       |  |  32GB     32GB     32GB    |     |
  |  +------------------------+  +-----------------------------+     |
  |                                                                   |
  |  +---- AI Workers --------+  +---- Data Workers ----------+     |
  |  |  ai-w01     ai-w02     |  |  data-w01   data-w02       |     |
  |  |  8vCPU      8vCPU      |  |  8vCPU      8vCPU          |     |
  |  |  32GB       32GB       |  |  32GB       32GB           |     |
  |  +------------------------+  +-----------------------------+     |
  +-----------------------------------------------------------------+

3.3 Node Pools, Labels und Taints

Node Pool	Anzahl	Label	Taint	Verwendung
control-plane	3	node-role.kubernetes.io/master	node-role…/master:NoSchedule	etcd, API Server, Controller Manager, Scheduler
system	2	hafs.de/pool=system	hafs.de/pool=system:PreferNoSchedule	Ingress Controller, cert-manager, Monitoring Agents, CoreDNS
application	3	hafs.de/pool=application	–	Portal, Tickets, Notifications, Automation, Knowledge, Governance
ai	2	hafs.de/pool=ai	hafs.de/pool=ai:NoSchedule	Claude API Gateway, AI Orchestrator, NLP, Embeddings
data	2	hafs.de/pool=data	hafs.de/pool=data:NoSchedule	Elasticsearch, Analytics, SIEM Addon

3.4 Namespace-Struktur

Namespace	Beschreibung	Node Pool
kube-system	Kubernetes Core-Komponenten	system
ingress-nginx	NGINX Ingress Controller	system
cert-manager	TLS-Zertifikatsverwaltung (interne CA)	system
monitoring	Prometheus, Grafana, Loki, Jaeger, Alertmanager	system
hafs-portal	Self-Help Portal Frontend + BFF	application
hafs-tickets	Ticket-System (CRUD, Workflows, SLA)	application
hafs-ai	AI-Services (Claude Gateway, Orchestrator, NLP)	ai
hafs-security	Authentifizierung, Autorisierung, RBAC	application
hafs-governance	Compliance, Audit Trail, Data Governance	application
hafs-automation	Workflow Engine, Runbooks, Scheduled Tasks	application
hafs-knowledge	Knowledge Base, Dokumentenverwaltung	application
hafs-analytics	Reporting, Dashboards, Metriken	data
hafs-notifications	E-Mail, Push, In-App Benachrichtigungen	application
hafs-shared	Shared Libraries, Config, Common Services	application
hafs-siem-addon	SIEM-Integration (Log-Forwarding, Correlation)	data
hafs-iam-addon	IAM-Integration (AD Sync, Provisioning)	application
hafs-monitor-addon	Monitoring-Addon (Custom Exporters, Alerts)	system

3.5 Kubernetes NetworkPolicies (Calico)

NetworkPolicy-Konzept: Default Deny + Whitelist

  Namespace: hafs-portal (Beispiel)
  +-----------------------------------------------------------------+
  | 1) Default Deny All                                              |
  |    podSelector: {}                                               |
  |    policyTypes: [Ingress, Egress]                                |
  |                                                                   |
  | 2) Allow Ingress from NGINX Ingress                              |
  |    from: namespaceSelector: ingress-nginx                        |
  |    ports: TCP/8080                                               |
  |                                                                   |
  | 3) Allow Egress to hafs-ai                                      |
  |    to: namespaceSelector: hafs-ai                                |
  |    ports: TCP/8080                                               |
  |                                                                   |
  | 4) Allow Egress to kube-system (DNS)                             |
  |    to: namespaceSelector: kube-system                            |
  |    ports: UDP/53                                                 |
  +-----------------------------------------------------------------+

NetworkPolicy-Strategie

Jeder Namespace erhält eine Default-Deny-All-Policy. Erlaubter Traffic wird explizit über Whitelist-Regeln definiert. Calico als CNI ermöglicht zusätzlich GlobalNetworkPolicies für clusterweite Regeln.

4. VLAN-Architektur Detail

4.1 VLAN-Übersicht

VLAN ID	Name	Subnetz	Gateway	Beschreibung
10	Management	10.10.10.0/24	10.10.10.1	vCenter, Rancher, Bastion Host, IPMI/iDRAC
20	Kubernetes	10.10.20.0/23	10.10.20.1	RKE2 Control Plane + Worker Nodes (/23 für Skalierung)
30	Datenbank	10.10.30.0/24	10.10.30.1	PostgreSQL, MongoDB, MinIO, Redis, Elasticsearch, Vault
40	Identity	10.10.40.0/24	10.10.40.1	AD Domain Controller, Entra ID Connect
50	DMZ	10.10.50.0/24	10.10.50.1	NGINX Ingress VIP, WAF, Reverse Proxy
60	Monitoring	10.10.60.0/24	10.10.60.1	Prometheus, Grafana, Loki, Jaeger

4.2 VLAN-Diagramm

VLAN-Topologie mit Perimeter-Firewall

                             +---------------+
                             |   Internet     |
                             +-------+-------+
                                     |
                             +-------+-------+
                             |  Perimeter FW  |
                             |  (Palo Alto /  |
                             |   FortiGate)   |
                             +-------+-------+
                                     |
                      +--------------+-------------------+
                      |        Core Switch L3             |
                      |     (Cisco Nexus / Arista)        |
                      +--+----+----+----+----+----+------+
                         |    |    |    |    |    |
              +----------+    |    |    |    |    +----------+
              |               |    |    |    |               |
       +------+------+ +-----+----++ +--+---+-----+ +-------+-----+
       |  VLAN 10    | |  VLAN 20  | |  VLAN 30   | |  VLAN 40    |
       |  Management | |  K8s      | |  DB/Store  | |  Identity   |
       |  .0/24      | |  .0/23    | |  .0/24     | |  .0/24      |
       +-------------+ +-----------+ +------------+ +-------------+

       +-------------+ +-----------+
       |  VLAN 50    | |  VLAN 60  |
       |  DMZ        | |  Monitor  |
       |  .0/24      | |  .0/24    |
       +-------------+ +-----------+

4.3 VLAN 10 – Management

Host	IP-Adresse	Funktion
vcenter.hafs.internal	10.10.10.10	vCenter Server Appliance
rancher.hafs.internal	10.10.10.20	Rancher Management Server
bastion.hafs.internal	10.10.10.30	SSH Bastion / Jump Host
ipmi-esxi01–04.hafs.internal	10.10.10.41–.44	ESXi Host IPMI

Zugriff

Nur über Bastion Host mit MFA. Kein direkter Zugriff aus anderen VLANs.

4.4 VLAN 20 – Kubernetes

Host	IP-Adresse	Rolle
rke2-cp-01	10.10.20.11	Control Plane Node 1
rke2-cp-02	10.10.20.12	Control Plane Node 2
rke2-cp-03	10.10.20.13	Control Plane Node 3
rke2-sys-w01	10.10.20.21	System Worker 1
rke2-sys-w02	10.10.20.22	System Worker 2
rke2-app-w01	10.10.20.31	Application Worker 1
rke2-app-w02	10.10.20.32	Application Worker 2
rke2-app-w03	10.10.20.33	Application Worker 3
rke2-ai-w01	10.10.20.41	AI Worker 1
rke2-ai-w02	10.10.20.42	AI Worker 2
rke2-data-w01	10.10.20.51	Data Worker 1
rke2-data-w02	10.10.20.52	Data Worker 2
rke2-api-vip	10.10.20.100	K8s API Server VIP (kube-vip)

4.5 VLAN 30 – Datenbank & Storage

Host / Service	IP-Adresse	Funktion
pg-primary	10.10.30.11	PostgreSQL Primary
pg-replica-01	10.10.30.12	PostgreSQL Streaming Replica
pg-replica-02	10.10.30.13	PostgreSQL Streaming Replica
mongo-rs-01–03	10.10.30.21–.23	MongoDB Replica Set (3 Members)
minio-01–02	10.10.30.31–.32	MinIO Object Storage
redis-master	10.10.30.41	Redis Primary (Sentinel-Cluster)
redis-replica-01	10.10.30.42	Redis Replica
redis-sentinel-01	10.10.30.43	Redis Sentinel
elastic-01–03	10.10.30.51–.53	Elasticsearch Cluster (3 Nodes)
vault.hafs.internal	10.10.30.61	HashiCorp Vault (HA)

4.6 VLAN 40 – Identity

Host	IP-Adresse	Funktion
dc01.hafs.internal	10.10.40.11	Active Directory Domain Controller 1
dc02.hafs.internal	10.10.40.12	Active Directory Domain Controller 2
aadconnect.hafs.internal	10.10.40.21	Microsoft Entra ID Connect Server

4.7 VLAN 50 – DMZ

Host / Service	IP-Adresse	Funktion
ingress-vip	10.10.50.10	NGINX Ingress Controller VIP
waf-01	10.10.50.20	Web Application Firewall

4.8 VLAN 60 – Monitoring

Host / Service	IP-Adresse	Funktion
prometheus	10.10.60.11	Prometheus + Alertmanager
grafana	10.10.60.12	Grafana Dashboards
loki	10.10.60.13	Loki Log Aggregation
jaeger	10.10.60.14	Jaeger Tracing Backend

4.9 Firewall-Regeln zwischen VLANs

Quelle (VLAN)	Ziel (VLAN)	Protokoll/Port	Beschreibung
VLAN 10 (Mgmt)	VLAN 20 (K8s)	TCP/6443	Rancher → K8s API Server
VLAN 10 (Mgmt)	VLAN 20 (K8s)	TCP/22	Bastion → Node SSH
VLAN 10 (Mgmt)	VLAN 30 (DB)	TCP/5432,27017,9200,8200	Management → DB Administration
VLAN 10 (Mgmt)	VLAN 60 (Mon)	TCP/3000,9090	Mgmt → Grafana/Prometheus UI
VLAN 20 (K8s)	VLAN 30 (DB)	TCP/5432	App Pods → PostgreSQL
VLAN 20 (K8s)	VLAN 30 (DB)	TCP/27017	App Pods → MongoDB
VLAN 20 (K8s)	VLAN 30 (DB)	TCP/6379	App Pods → Redis
VLAN 20 (K8s)	VLAN 30 (DB)	TCP/9200	Data Pods → Elasticsearch
VLAN 20 (K8s)	VLAN 30 (DB)	TCP/9000	App Pods → MinIO S3 API
VLAN 20 (K8s)	VLAN 30 (DB)	TCP/8200	App Pods → Vault
VLAN 20 (K8s)	VLAN 40 (ID)	TCP/389,636	K8s → AD LDAP/LDAPS
VLAN 20 (K8s)	VLAN 40 (ID)	TCP/88,464	K8s → Kerberos
VLAN 20 (K8s)	VLAN 60 (Mon)	TCP/9090	Prometheus Scraping (Reverse)
VLAN 50 (DMZ)	VLAN 20 (K8s)	TCP/80,443	Ingress → K8s Service Endpoints
VLAN 60 (Mon)	VLAN 20 (K8s)	TCP/9100,8080	Prometheus → Node Exporter, Metrics
VLAN 60 (Mon)	VLAN 30 (DB)	TCP/9187,9216,9114	Prometheus → DB Exporters
VLAN 40 (ID)	Internet	TCP/443	Entra ID Connect → Microsoft 365
VLAN 20 (K8s)	Internet	TCP/443	AI Pods → api.anthropic.com (via Proxy)
VLAN 20 (K8s)	Internet	TCP/443	App Pods → graph.microsoft.com (via Proxy)
*	*	*	Default: DENY ALL

5. DNS-Architektur

5.1 Übersicht

DNS-Architektur (AD DNS + CoreDNS)

  +-----------------------------------------------------------+
  |                    DNS-Architektur                          |
  |                                                            |
  |  +-----------------+       +-----------------+            |
  |  |  AD DNS (DC01)  |<---->|  AD DNS (DC02)  |            |
  |  |  10.10.40.11    | Repl. |  10.10.40.12    |            |
  |  |                 |       |                 |            |
  |  |  Zone:          |       |  Zone:          |            |
  |  |  hafs.internal  |       |  hafs.internal  |            |
  |  +--------+--------+       +--------+--------+            |
  |           |                         |                     |
  |           +-----------+-------------+                     |
  |                       |                                   |
  |                       v                                   |
  |           +------------------------+                     |
  |           |  CoreDNS (in K8s)     |                     |
  |           |  cluster.local Zone   |                     |
  |           |                        |                     |
  |           |  Forward: hafs.internal|                     |
  |           |    -> 10.10.40.11/12  |                     |
  |           |  Forward: Internet     |                     |
  |           |    -> Nicht erlaubt    |                     |
  |           +------------------------+                     |
  +-----------------------------------------------------------+

5.2 DNS-Zonen

Zone	Typ	Server	Beschreibung
hafs.internal	AD-integriert	DC01, DC02	Primäre interne Zone
10.10.in-addr.arpa	AD-integriert	DC01, DC02	Reverse-Lookup
cluster.local	CoreDNS	K8s CoreDNS Pods	Kubernetes-interne Service-Auflösung

5.3 Wichtige DNS-Einträge (hafs.internal)

FQDN	Typ	Wert	Beschreibung
portal.hafs.internal	A	10.10.50.10	Self-Help Portal (via Ingress)
api.hafs.internal	A	10.10.50.10	API Gateway (via Ingress)
rancher.hafs.internal	A	10.10.10.20	Rancher UI
grafana.hafs.internal	A	10.10.50.10	Grafana (via Ingress)
vault.hafs.internal	A	10.10.30.61	HashiCorp Vault
vcenter.hafs.internal	A	10.10.10.10	vCenter Server
k8s-api.hafs.internal	A	10.10.20.100	Kubernetes API VIP

6. Internet Connectivity

6.1 Grundprinzip

Die gesamte Self-Help Portal Infrastruktur ist on-premises. Externe Internetverbindungen werden ausschließlich für zwei Zwecke benötigt:

Microsoft 365 Graph API – Ticket-Erstellung, E-Mail-Versand, Benutzer-Synchronisierung
Anthropic Claude API – KI-gestützte Ticket-Analyse und Chatbot-Funktionalität

6.2 Proxy-Architektur

Outbound-Proxy über Perimeter-Firewall

  +------------------------------------------------------------+
  |  K8s Pod (hafs-ai / hafs-tickets)                          |
  |  |                                                          |
  |  |  HTTP_PROXY=http://proxy.hafs.internal:3128             |
  |  |  HTTPS_PROXY=http://proxy.hafs.internal:3128            |
  |  |  NO_PROXY=.hafs.internal,.cluster.local,10.10.0.0/16   |
  |  |                                                          |
  |  +----------+----------------------------------------------+
  |             |
  |             v
  |  +------------------+
  |  |  Forward Proxy   |     Squid / HAProxy
  |  |  VLAN 20         |     mit TLS-Inspection (optional)
  |  |  10.10.20.200    |
  |  +----------+-------+
  |             |
  |             v
  |  +------------------+
  |  |  Perimeter FW    |     Allowlist-basiert
  |  +----------+-------+
  |             |
  |             v
  |         Internet
  +------------------------------------------------------------+

6.3 Firewall Allowlist (Outbound)

Ziel-FQDN	Port	Protokoll	Quell-VLAN	Zweck
api.anthropic.com	443	HTTPS	VLAN 20	Claude API (Chatbot, Ticket-Analyse)
graph.microsoft.com	443	HTTPS	VLAN 20	M365 Graph API
login.microsoftonline.com	443	HTTPS	VLAN 40	Entra ID Authentifizierung
*.servicebus.windows.net	443	HTTPS	VLAN 40	Entra ID Connect Sync
adminwebservice.microsoftonline.com	443	HTTPS	VLAN 40	Entra ID Connect

Default: DENY ALL Outbound

Alles andere ist blockiert. Die Default-Regel der Perimeter-Firewall ist DENY ALL outbound. Nur explizit gelistete FQDNs sind erlaubt.

6.4 Rate Limiting & Budget Controls

API	Rate Limit	Monatliches Token-Budget	Alerting
Anthropic Claude	100 req/min	Max. 5.000.000 Input Tokens	Alert bei 80% Budget
Microsoft Graph	1.000 req/min	Keine Token-Limitierung	Alert bei >500 Fehler/Stunde

7. Monitoring & Observability Stack

7.1 Stack-Übersicht

Observability Stack (On-Premises)

  +--------------------------------------------------------------------+
  |                    Observability Stack (On-Premises)                |
  |                                                                    |
  |  +--------------+  +--------------+  +--------------+              |
  |  |  Fluent Bit  |  |  Fluent Bit  |  |  Fluent Bit  |  DaemonSet  |
  |  |  (Node)      |  |  (Node)      |  |  (Node)      |  auf allen  |
  |  +------+-------+  +------+-------+  +------+-------+  Nodes      |
  |         |                 |                 |                      |
  |         +--------+--------+-----------------+                     |
  |                  |                                                 |
  |                  v                                                 |
  |          +---------------+                                        |
  |          |     Loki      |  Log-Aggregation                       |
  |          |  (VLAN 60)    |  Retention: 90 Tage                    |
  |          +---------------+                                        |
  |                                                                    |
  |  +--------------------------------------------------------------+ |
  |  |  Prometheus + Alertmanager                                    | |
  |  |  - K8s Metriken (kube-state-metrics, node-exporter)          | |
  |  |  - App Metriken (/metrics Endpoints)                          | |
  |  |  - DB Metriken (postgres_exporter, mongodb_exporter)          | |
  |  |  - vSphere Metriken (vmware_exporter)                         | |
  |  |  Retention: 180 Tage (Tier 2 Storage)                         | |
  |  +--------------------------------------------------------------+ |
  |                                                                    |
  |  +--------------------------------------------------------------+ |
  |  |  OpenTelemetry Collector (Sidecar + Gateway)                  | |
  |  |  --> Jaeger (Distributed Tracing)                             | |
  |  |  Retention: 30 Tage                                           | |
  |  +--------------------------------------------------------------+ |
  |                                                                    |
  |  +--------------------------------------------------------------+ |
  |  |  Grafana                                                      | |
  |  |  - Datasources: Prometheus, Loki, Jaeger                     | |
  |  |  - Auth: LDAP (AD-integriert)                                 | |
  |  |  - Dashboards: K8s, Apps, DBs, vSphere, Business Metrics     | |
  |  +--------------------------------------------------------------+ |
  +--------------------------------------------------------------------+

7.2 Prometheus & Alertmanager

Parameter	Wert
Deployment	StatefulSet im Namespace monitoring
Storage	500 GB (hafs-standard StorageClass)
Retention	180 Tage
Scrape Interval	30s (Standard), 15s (kritische Services)
Alertmanager Replicas	2 (HA)
Alert-Routen	E-Mail, MS Teams Webhook, PagerDuty (optional)

7.3 Wichtige Alert Rules

Alert	Bedingung	Severity
KubePodCrashLooping	Pod restarted >3x in 10 Min	critical
KubeNodeNotReady	Node NotReady >5 Min	critical
PostgreSQLReplicationLag	Replikation >30s hinter Primary	warning
ClaudeAPIErrorRate	>5% Fehlerrate in 5 Min	critical
DiskSpaceLow	<15% frei auf PV	warning
CertificateExpiringSoon	TLS-Zertifikat läuft in <14 Tagen ab	warning
HighMemoryUsage	Pod Memory >90% Limit	warning
IngressLatencyHigh	p99 Latenz >2s	warning
VaultSealedStatus	Vault sealed	critical
EtcdHighCommitDuration	etcd Commit >250ms	warning

7.4 Grafana Dashboards

Dashboard	Beschreibung
K8s Cluster Overview	Nodes, Pods, Deployments, Resource Utilization
HAFS Portal Application	Request Rate, Latenz, Error Rate, Aktive Sessions
AI Service Monitoring	Claude API Calls, Token Usage, Latenz, Kosten
Ticket System	Ticket-Erstellung, SLA Compliance, Backlog
Database Performance	PostgreSQL Connections, Query Time, Replication Lag
vSphere Infrastructure	ESXi CPU/RAM/Storage, VM Performance
Security & Compliance	Auth Failures, RBAC Violations, Certificate Status
Business KPIs	Ticket-Volumen, Self-Service Rate, User Satisfaction

7.5 Loki (Log-Aggregation)

Parameter	Wert
Deployment	StatefulSet (3 Replicas, Microservice Mode)
Storage Backend	MinIO (S3-kompatibel, VLAN 30)
Retention	90 Tage
Ingestion Rate	Max. 10 MB/s
Log Labels	namespace, pod, container, app, level

7.6 OpenTelemetry & Jaeger (Distributed Tracing)

Parameter	Wert
Collector Deployment	DaemonSet (Agent) + Deployment (Gateway)
Sampling Rate	10% (Standard), 100% (bei Fehler)
Trace Backend	Jaeger (Elasticsearch Storage)
Retention	30 Tage
Instrumentation	OpenTelemetry SDK in allen HAFS-Microservices

7.7 Fluent Bit Pipeline

Fluent Bit Log-Routing

  Container Logs --> Fluent Bit (DaemonSet)
      |
      +--> Loki            (Langzeit-Speicherung, 90 Tage)
      +--> Elasticsearch   (Volltextsuche, SIEM)
      +--> stdout          (Debug, nur in Dev)

8. Disaster Recovery & Business Continuity

8.1 Übersicht DR-Strategie

Komponente	RPO (Recovery Point)	RTO (Recovery Time)	Methode
Kubernetes etcd	1 Stunde	30 Minuten	etcd Snapshot (stündlich) + S3 Upload
PostgreSQL	5 Minuten	15 Minuten	Streaming Replication + WAL Archiving
MongoDB	5 Minuten	15 Minuten	Replica Set (automatisches Failover)
Redis	1 Stunde	5 Minuten	Sentinel Failover + RDB Snapshots
Elasticsearch	1 Stunde	30 Minuten	Snapshot to MinIO (stündlich)
MinIO	24 Stunden	1 Stunde	Erasure Coding + Veeam Backup
VMs (gesamt)	24 Stunden	2 Stunden	Veeam Backup & Replication
Vault	1 Stunde	30 Minuten	Raft Snapshots + Offline Backup

8.2 VMware HA & vMotion

VMware HA Cluster: HAFS-PROD

  +--------------------------------------------------------+
  |             VMware HA Cluster: HAFS-PROD                |
  |                                                         |
  |  +---------+  +---------+  +---------+  +---------+   |
  |  | ESXi-01 |  | ESXi-02 |  | ESXi-03 |  | ESXi-04 |   |
  |  |  Active  |  |  Active  |  |  Active  |  | Reserve |   |
  |  +---------+  +---------+  +---------+  +---------+   |
  |                                                         |
  |  Admission Control: 25% CPU/RAM Reserve                |
  |  Host Monitoring: Aktiviert                             |
  |  VM Monitoring: Aktiviert (Application-Level)          |
  |  vMotion: Live Migration bei Host-Wartung              |
  |  DRS: Fully Automated (Threshold: 3)                   |
  |  Proactive HA: Hardware-Health Monitoring              |
  +--------------------------------------------------------+

Feature	Konfiguration
VMware HA	Aktiviert, Host Isolation Response: Power Off
Admission Control	25% Reserve (entspricht 1 Host Ausfall)
vMotion	Dediziertes VMkernel auf 25 GbE
DRS	Fully Automated, Migration Threshold 3
Proactive HA	Aktiviert (IPMI/iLO Sensor Monitoring)

8.3 Veeam Backup & Replication

Backup-Job	Schedule	Retention	Storage Tier
VM Full Backup	Sonntag 02:00	4 Wochen	Tier 3 (NFS)
VM Incremental Backup	Täglich 02:00	14 Tage	Tier 3 (NFS)
DB Dump (PostgreSQL)	Alle 6 Stunden	7 Tage	Tier 3 (NFS)
etcd Snapshot	Stündlich	48 Stunden	MinIO (VLAN 30)
Vault Snapshot	Stündlich	48 Stunden	MinIO (VLAN 30)
Config Backup (Git)	Bei jeder Änderung	Unbegrenzt (Git)	GitLab On-Prem

8.4 PostgreSQL Backup & Replication

PostgreSQL Streaming Replication + WAL Archiving

  +--------------+     Streaming      +--------------+
  |  pg-primary  |----Replication---->|  pg-replica-01|
  |  10.10.30.11 |                    |  10.10.30.12 |
  +------+-------+                    +--------------+
         |            Streaming      +--------------+
         +--------Replication------->|  pg-replica-02|
         |                           |  10.10.30.13 |
         |                           +--------------+
         |
         v
  +--------------+
  |  WAL Archive |--> MinIO (S3)
  |  pg_basebackup|   Retention: 7 Tage
  +--------------+

8.5 DR-Testplan

Test	Frequenz	Beschreibung
etcd Restore	Quartalsweise	Restore eines etcd Snapshots auf Staging-Cluster
PostgreSQL Failover	Quartalsweise	Manuelles Failover auf Replica, Application Recovery
VM Recovery (Veeam)	Halbjährlich	Kompletter VM Restore eines Worker-Nodes
Full Stack Recovery	Jährlich	Kompletter Neuaufbau aus Backups + IaC
Node-Ausfall-Simulation	Quartalsweise	Abschaltung eines Worker-Nodes, K8s Rescheduling

9. Kosten-Schätzung

9.1 Monatliche Betriebskosten (On-Premises)

Kostenposition	Monatlich (EUR)	Anmerkung
Hardware-Amortisation
4x ESXi Server (5J Abschreibung)	3.500–4.500	~210.000 EUR Anschaffung / 60 Monate
SAN Storage (5J Abschreibung)	1.500–2.500	~90.000–150.000 EUR / 60 Monate
NFS Filer (5J Abschreibung)	500–800	~30.000–48.000 EUR / 60 Monate
Netzwerk (Switches, Firewall, 5J)	800–1.200	~48.000–72.000 EUR / 60 Monate
Lizenzen
VMware vSphere Enterprise Plus (4 Hosts)	1.500–2.500	Subscription-Modell (Broadcom)
Rancher Prime (optional)	500–1.000	Alternativ: Open Source (kostenfrei)
Veeam Backup & Replication	300–500	Per Socket Lizenz
HashiCorp Vault Enterprise (optional)	0–1.000	Alternativ: Community Edition (kostenfrei)
Externe APIs
Anthropic Claude API	2.000–5.000	Abhängig von Token-Volumen und Modell
Microsoft 365 E3/E5 Lizenzen	1.500–3.000	Für Entra ID, Graph API, Exchange Online
Betrieb Rechenzentrum
Strom & Kühlung	800–1.500	4 Server + Storage + Netzwerk (~8–12 kW)
Rack-Space / Colocation	500–1.000	Falls externes RZ genutzt wird
Personal
Infrastruktur-Administration (anteilig)	4.000–6.000	~0,5 FTE Systemadministrator
Kubernetes-Operations (anteilig)	3.000–5.000	~0,3–0,5 FTE DevOps Engineer
Security & Compliance (anteilig)	2.000–3.000	~0,2 FTE Security Engineer
Support & Wartung
Hardware-Wartungsverträge	500–1.000	Next Business Day Austausch
Software-Support (VMware, Veeam)	Inkl. in Lizenzen	–

9.2 Monatliche Gesamtkosten

Kategorie	Min. (EUR)	Max. (EUR)
Hardware-Amortisation	6.300	9.000
Lizenzen	2.300	5.000
Externe APIs	3.500	8.000
Betrieb Rechenzentrum	1.300	2.500
Personal (anteilig)	9.000	14.000
Support & Wartung	500	1.000
	Gesamt: 22.900	39.500

9.3 Einmalige Investitionskosten (CAPEX)

Position	Kosten (EUR)
4x ESXi Server	180.000–240.000
SAN Storage (All-Flash + Hybrid)	90.000–150.000
NFS Filer	30.000–48.000
Netzwerk (Switches, Firewall, Kabel)	48.000–72.000
Rack, USV, PDU	15.000–25.000
Initiales Setup & Professional Services	30.000–50.000
Gesamt CAPEX	393.000–585.000

9.4 Kostenvergleich: On-Premises vs. Cloud-Hybrid

Aspekt	On-Premises	Cloud-Hybrid (Azure)
Monatl. Betrieb	22.900–39.500 EUR	15.000–30.000 EUR (variabel)
Initiale Investition	393.000–585.000 EUR	Minimal (Pay-as-you-go)
Break-Even	~24–36 Monate	–
Datensouveränität	Vollständig im eigenen RZ	Abhängig von Azure Region
Skalierbarkeit	Hardware-Beschaffung erforderlich	On-Demand
Compliance (BaFin)	Volle Kontrolle	Shared Responsibility
Vendor Lock-In	Gering (Standard-K8s)	Mittel (Azure-spezifische Services)

Hinweis

Die On-Premises-Variante wurde aus Gründen der Datensouveränität, BaFin-Compliance und vollständiger Kontrolle über die Infrastruktur gewählt. Der höhere initiale Investitionsaufwand amortisiert sich über die Laufzeit von 3–5 Jahren.

Anhang

A. IP-Adressplan Zusammenfassung

VLAN	Netzwerk	Nutzbare IPs	Zugewiesene Hosts
10	10.10.10.0/24	10.10.10.2–.254	~10 Hosts
20	10.10.20.0/23	10.10.20.2–10.10.21.254	~15 Hosts + VIPs
30	10.10.30.0/24	10.10.30.2–.254	~20 Hosts
40	10.10.40.0/24	10.10.40.2–.254	~5 Hosts
50	10.10.50.0/24	10.10.50.2–.254	~5 Hosts
60	10.10.60.0/24	10.10.60.2–.254	~5 Hosts

B. Port-Matrix (Kurzfassung)

Service	Port(s)	Protokoll	Beschreibung
K8s API	6443	TCP/TLS	Kubernetes API Server
etcd	2379, 2380	TCP/TLS	etcd Client + Peer
Kubelet	10250	TCP/TLS	Kubelet API
Calico BGP	179	TCP	BGP Peering
Calico VXLAN	4789	UDP	VXLAN Overlay
PostgreSQL	5432	TCP/TLS	Datenbank
MongoDB	27017	TCP/TLS	Datenbank
Redis	6379	TCP/TLS	Cache / Message Broker
Elasticsearch	9200, 9300	TCP/TLS	REST API + Cluster Communication
MinIO	9000, 9001	TCP/TLS	S3 API + Console
Vault	8200	TCP/TLS	Vault API
Prometheus	9090	TCP	Metrics Query
Grafana	3000	TCP	Dashboard UI
Loki	3100	TCP	Log Push/Query
Jaeger	16686, 14268	TCP	UI + Collector
NGINX Ingress	80, 443	TCP	HTTP/HTTPS
LDAP/LDAPS	389, 636	TCP	Active Directory
Kerberos	88, 464	TCP/UDP	AD Authentifizierung
DNS	53	TCP/UDP	Namensauflösung
NFS	2049	TCP	NFS Filer

← ZurückGovernance & Compliance Weiter →Rollout & Roadmap