01—Senior DevOps / Platform Engineer·11+ yrs production

Loay
Ali

I build, harden, and operate production cloud platforms the way they're supposed to work, quiet, observable, recoverable, and cheap enough to defend in a FinOps review.

Based

Munich, Germany · CET

Working with

Linux · AWS · OCI · Kubernetes · Terraform

loayali.de@gmail.com

linkedin.com/in/loayali

Open to work

11+

Years operating
production infrastructure

2M+

Domains under management
at a country-code TLD registry

1M+

Users on a VPN platform
I built solo from zero

99.9%

Sustained uptime SLA
across multi-year tenures

02 — about

A working biography, briefly.

// Eleven years of being the engineer responsible when the pager goes off, the certificate expires, the cluster autoscaler misbehaves, or the CFO asks why the AWS bill spiked.

Senior infrastructure engineer with a craftsman's bias toward simple, observable, recoverable systems.

I'm Loay Ali, a Senior DevOps / Platform Engineer based in Munich. I've spent the last eleven years operating production Linux at every scale that matters: from a regulated ccTLD registry to a global Anycast DNS network across 60+ POPs, from enterprise email on AWS EKS to a VPN platform I built from zero to 1M+ users as the sole engineer.

My background is unusual in one specific way. I sit comfortably between the classic stack (BIND, BGP, iptables, FreeBSD, VMware vSphere, hardened bare metal) and the modern stack (Kubernetes on EKS/OKE, Terraform, Helm, GitOps, ELK, Prometheus). That bridge matters when you're migrating an old platform to a new cloud without dropping a query.

I work like a craftsman, not a checklist. I write infrastructure-as-code by default, run quarterly FinOps reviews that have delivered 20–25% cost reductions, document everything so the next person can sleep, and treat security as a feature, not an afterthought. I've held the pager 24/7 as the only engineer for an ISP serving 35% of a country's customers — that experience teaches you to design things that don't break.

Currently open to senior DevOps, Platform, SRE, and Cloud Architect roles — remote, hybrid, or on-site. Munich-based with full EU work authorization.

Strongest at

Operating things that must not fail

Regulated environments, sole-engineer ownership, 24/7 on-call without drama.

Building toward

Platform Engineering & MLOps

Self-service platforms, golden paths, GenAI workloads on Kubernetes.

Languages spoken

English · Arabic · German

Fluent · Native · A1 (actively learning)

Operates from

Munich, Germany

CET timezone · full EU work authorization · open to relocation

03 — how I work

Six principles, learned the hard way.

// Each one of these cost me at least one outage to learn. They're cheaper to inherit than to discover.

01 / DESIGN

Boring infrastructure ships on Fridays.

If a system is exciting, something's wrong. Predictable beats clever. Recoverable beats fast. Documented beats novel. The best platform is the one nobody talks about because it just works.

02 / SECURITY

Hardening is day-one, not month-six.

Least-privilege IAM, centralized secrets (Vault, KMS, SSM), TLS/PKI automation, hardened images, and audit-readiness all live in the original Terraform module — not in a panicked retro after the first compliance review.

03 / OBSERVABILITY

If you can't see it, you can't operate it.

Prometheus, Grafana, ELK, kube-prometheus-stack, structured logging, SLO-driven alerting. Dashboards before deployment. Alerts that mean something. Runbooks linked from every alert.

04 / COST

A platform defensible in a FinOps review.

Right-sized nodes. Reserved instances where stable. Spot where tolerable. Quarterly reviews with measurable savings (20-25% delivered). Cost is an architectural concern, not someone else's problem.

05 / OPERATIONS

ITIL without the bureaucracy.

Structured incident response. Post-mortems that produce action items, not blame. Change windows planned with stakeholders. BCP/DR drills that are actually rehearsed. The boring parts of ops, treated with respect.

06 / HANDOFF

If only I can run it, I haven't finished.

Architecture docs, deployment runbooks, scaling guides, DR procedures, on-call rotations. The job ends when another engineer can operate the platform independently. Not before.

04 — selected work

A handful of platforms that didn't fail.

// Build · Operate · Hand off. Each entry below was owned end-to-end, from architecture through production go-live and team enablement.

CASE / 01

Recent

Dec 2025 — Jul 2026 Sole infrastructure engineer
Reconhece · Brazil (Remote)

[01]Reconhece, a complete platform built from zero on OCI.

Designed, built, and launched a complete cloud platform from scratch for a SaaS startup running NestJS microservices and a NextJS frontend. Owned every decision from VCN topology through go-live, then handed it off to the in-house engineering team to operate independently.

100%

Greenfield IaC

CI/CD pipelines

Manual deploys

25%+

Cost savings at launch

Cloud foundation — modular Terraform: VCN, public/private subnets, NAT/IGW/Service gateways, NSGs, Bastion, managed PostgreSQL, Redis Cache, OCI Vault (KMS/AES-256).
Production Kubernetes (OKE) with Cluster Autoscaler + HPA, RBAC, resource quotas, topology spread, pod anti-affinity.
CI/CD on Azure Pipelines — four multi-stage flows: build, registry push, rolling deploy, health checks, auto-rollback, blue-green strategy.
TLS automation with Nginx Ingress + Cert-Manager + Let's Encrypt across all endpoints.
Observability — kube-prometheus-stack + ELK across all microservices; SLA alerting rules.
Handoff — runbooks, scaling guides, DR procedures so the in-house team owns it now.

Terraform OCI / OKE Kubernetes Helm Azure Pipelines Nginx Ingress Cert-Manager Prometheus Grafana ELK PostgreSQL Redis RabbitMQ Vault / KMS

CASE / 02

Enterprise email

2024 — 2025 Infrastructure lead
Dynamic IT · Dubai

[02]Axigen on AWS EKS, enterprise email at production scale.

Email-on-Kubernetes is rare. Doing it under enterprise hardening discipline is rarer. Built a production EKS cluster for the Axigen enterprise email platform from scratch with Terraform, with three dedicated node groups, NLB load balancers, and full email-authentication and observability baked in.

Dedicated node groups

SMTP + WebMail

HPA-scaled frontends

SSM

Secrets management

IRSA

Identity hardening

VPC + node groups — system / frontend / backend with taints, tolerations, and topology-spread.
Storage + identity — EBS CSI persistent storage, IRSA for least-privilege IAM, Cert-Manager wildcard TLS.
Email authentication — DKIM, SPF, DMARC via Route53, validated end-to-end.
Observability — Prometheus / Grafana with mail-flow dashboards, SLO-driven alerting.
Hardening — SSM-backed secrets, audit-ready posture, regulated-environment runbooks.

AWS EKS Terraform VPC IRSA EBS CSI Cert-Manager NLB Route53 DKIM / SPF / DMARC SSM Prometheus Grafana

CASE / 03

Global DNS

2022 — 2025 Operator + migration lead
Dynamic IT · Dubai

[03]CDNS, Anycast DNS across 60+ POPs migrated without dropping a query.

Operated a global Anycast DNS network across 60+ locations on five continents using BGP routing and AXFR distribution. Then executed a full operating-system migration (Slackware 11 to modern Linux) across every node in production with zero service disruption. The end users never knew.

60+

Anycast POPs

Continents covered

2006 → 2024

OS migration span

Lost queries

BGP + Anycast routing for resilient, low-latency DNS resolution at the edge.
AXFR-based zone distribution with strict consistency monitoring.
Migration playbook — withdraw BGP, drain queries, reinstall OS, restore config from version control, re-announce, monitor. Per POP. Repeated 60+ times.
Bridged classic + modern — BIND configs translated from Slackware 11 era to the modern toolchain without breaking the abstraction.

BGP Anycast DNS BIND AXFR FreeBSD Slackware → modern Linux iptables Nagios

CASE / 04

Regulated

2022 — 2025 Registry operator
Dynamic IT · Dubai

[04].TM ccTLD Registry, country-code TLD run under strict compliance.

Operated and upgraded the full registry stack for the .TM country-code TLD: EPP, WHOIS, RDAP. A hardened, regulated enterprise environment managing roughly 2 million domains under strict compliance, audit, and uptime requirements where change windows are scheduled and post-mortems are formal.

2M+

Domains under management

EPP · WHOIS · RDAP

Registry protocols

24/7

Regulated compliance posture

Full stack ownership — EPP server, WHOIS/RDAP, supporting databases, monitoring, backups.
Hardened posture — patch management, least-privilege IAM, DDoS resilience, structured incident response.
Audit-ready — documented runbooks, change-control discipline, compliance-grade logging.
Stakeholder coordination — change windows aligned with registrar partners and oversight bodies.

EPP WHOIS / RDAP BIND PostgreSQL Linux hardening Veeam DR ITIL change mgmt

CASE / 05

Multi-tenant

2022 — 2025 Architect
Dynamic IT · Dubai

[05]Multi-tenant SaaS on AWS, 100K+ daily users, zero downtime.

Architected a multi-tenant SaaS platform on AWS using Terraform, EKS, Helm, Nginx Ingress, ALB + ACM wildcard TLS, and Route53. Achieved zero-downtime blue-green deployments across multiple environments. Owned tenant isolation, scaling policy, and the operational discipline that kept it boring.

100K+

Daily users

Blue / Green

Deployment model

20-25%

FinOps savings delivered

VPC-native networking with ALB + ACM wildcard TLS and Route53 traffic management.
Containerized services on EKS via Helm, with HPA, Cluster Autoscaler, and PodDisruptionBudgets.
CI/CD through GitHub Actions and GitLab CI, with structured rollback paths.
Quarterly FinOps reviews delivering 20-25% infrastructure cost reduction through right-sizing and reserved instances.

AWS EKS Terraform Helm ALB + ACM Route53 Nginx Ingress GitHub Actions GitLab CI RDS FinOps

CASE / 06

Sole engineer

2016 — 2020 Sole infrastructure owner
I2VPN · Berlin (Remote)

[06]I2VPN, from zero to 1M+ global users, alone.

Built and operated the entire production stack for a startup VPN platform from scratch as the only infrastructure engineer. Scaled it to 1,000,000+ active global users with 99.9% uptime and zero-downtime architecture. Wrote my own DDoS mitigation when the off-the-shelf options didn't fit.

1M+

Active users

Node HA MySQL cluster

90%

DDoS attack reduction

24/7

On-call, alone

Multi-protocol VPN — OpenVPN, WireGuard, SOCKS5, V2Ray, Squid.
12-node HA MySQL cluster — 2 controllers + 4 storage + 6 SQL nodes for load-balanced traffic.
Custom DDoS mitigation — iptables scripts I authored personally, cutting attack impact by 90%.
Privacy-first DNS infrastructure eliminating leaks for 1M+ users.
Published apps live on App Store + Google Play (i2vpn-secure-vpn-proxy).

OpenVPN WireGuard SOCKS5 / V2Ray / Squid MySQL HA Docker Ansible iptables Nagios BIND

CASE / 07

AI / Automation

2024 — 2025 Designer + operator
Dynamic IT · Dubai

[07]AI Agents in production, three N8N workflows that actually shipped.

Designed and deployed three production AI agents at Dynamic IT, each solving a distinct operational problem. Built on N8N as the orchestration layer, with LLM calls integrated through internal APIs and external services. All three remain in active production use.

Production agents

Real-time

Recommendation latency

24/7

Support coverage shifted

N8N + LLM

Orchestration stack

Scheduling and booking agent — automated calendar management, slot proposals, and smart assistance integrated with internal APIs. Replaced manual coordination workflows that previously consumed hours of staff time daily.
E-commerce recommendation engine — captures user browsing behavior on the site (pages visited, products viewed, dwell time), stores it as a session profile, and runs an AI agent that recommends products in real-time based on the user's evolving signals. Built end-to-end on N8N with custom event capture and a recommendation workflow.
Customer support agent (domains.tm) — first-line technical support for domain registrar inquiries: WHOIS questions, DNS configuration, registration status, transfer requests. Reduces ticket volume reaching human agents and provides 24/7 response coverage.
Production discipline — each agent treated as a microservice: monitoring, blast-radius limits, validation layers, separate review for destructive actions, cost ceilings per workflow.

N8N AI Agents LLM integration Workflow automation REST APIs Webhooks PostgreSQL Real-time recommendation

CASE / 08

High stakes

2013 — 2016 System administrator
Tatweer Co · Damascus

[08]Tatweer ISP, data-center migration during war.

Led a zero-downtime live data-center migration for an ISP serving roughly 35% of private internet customers in Syria, using replication over a four-day cutover. A mission-critical, 24/7 environment where the consequences of a misstep were national, not just operational.

VMware vSphere/vCenter environments — dedicated servers, VPS fleets, virtualized infrastructure.
Veeam-based daily backups with under 1-hour RTO and documented DR procedures.
Replication-based cutover — 4 days of dual-site running, final flip with zero downtime.
Zero security breaches across the tenure despite the operating environment.

VMware vSphere vCenter Veeam Linux / Windows Server Proxy infrastructure

05 — career log

A timeline, most recent first.

// Five roles. Three continents. One discipline. The pattern: take ownership, ship infrastructure that survives, document it so the next person can operate it.

2025-12 → 2026-078 months · freelance

Reconhece / Brazil · Remote

Freelance Systems / Platform Engineer

Sole infrastructure engineer engaged to design, build, and launch a complete cloud platform from scratch for a SaaS startup (NestJS microservices, NextJS frontend). Owned full lifecycle through production go-live and team handoff.

2022-02 → 2025-113 yrs 10 mo

Dynamic IT Consultant / Dubai · UAE

Senior Systems / DevOps Engineer — Infrastructure Lead

Owned full infrastructure lifecycle for multiple enterprise platforms on AWS and on-premises: regulated .TM ccTLD registry, global Anycast DNS network across 60+ POPs, enterprise email (Axigen), and multi-tenant SaaS serving 100K+ daily users. ITIL-aligned operations, FinOps savings of 20-25%, structured incident response. Designed and deployed an AI-powered N8N agent for operational automation.

2020-07 → 2022-021 yr 8 mo

Tech Studio Technology / Dubai · UAE

Senior System Administrator & Support Engineer

Delivered enterprise system administration and end-user support across dozens of enterprise clients. Resolved 200+ incidents per month with documented RCA and 85%+ customer satisfaction. Owned ticket triage, SLA tracking, escalation, and Ansible/Chef automation. Mentored junior engineers as the Linux escalation point.

2016-06 → 2020-124 yrs 7 mo · remote

I2VPN / Berlin · Remote

System Administrator (sole engineer)

Built the entire production VPN platform from scratch with full autonomy. Scaled from zero to 1,000,000+ global users with 99.9% uptime. Engineered a 12-node HA MySQL cluster, wrote custom DDoS-mitigation tooling (reduced attack impact by 90%), designed leak-free DNS infrastructure, and handled 24/7 on-call as the only infrastructure engineer.

2013-11 → 2016-012 yrs 3 mo

Tatweer Co / Damascus · Syria

System Administrator

Managed enterprise infrastructure for a major ISP serving roughly 35% of private internet customers in Syria. VMware vSphere/vCenter virtualization, Veeam-based backups with sub-1-hour RTO. Led a zero-downtime data-center migration over a 4-day cutover. Zero security breaches over the tenure.

06 — stack

Tools, not religions.

// Everything listed below has been in production with my hands on it. Items in amber are in heavy daily use.

Cloud Platforms [03]

AWS · 8+ yrs OCI · production Azure Pipelines EC2VPCEKSIAM / IRSALambdaS3ECRALB / NLBRoute53ACMRDSSSMEBS CSICloudWatchCodePipelineOKEVCNVault / KMS

Containers & Orchestration [10]

Kubernetes Docker Helm HPACluster AutoscalerRBACNetworkPolicyCert-ManagerExternal-SecretsNginx IngressRancher

IaC, CI/CD & Automation [10]

Terraform Ansible ChefBashPythonGitLab CIGitHub ActionsJenkinsAzure PipelinesAWS CodePipeline

Networking & Security [14]

BGP Anycast DNS BINDRoute53Cloudflare CDNNginxiptablesOpenVPNWireGuardTLS / PKISPF / DKIM / DMARCHashiCorp VaultOCI KMSAWS SSM

Linux & Systems [08]

RHEL / CentOS / Rocky Ubuntu / Debian FreeBSDSlackwaresystemdSELinuxKernel tuningVMware vSphere/vCenter

Data & Messaging [07]

PostgreSQL Redis / ElastiCache MySQL HAMongoDBTimescaleDBRabbitMQVeeam

Observability [08]

Prometheus Grafana kube-prometheus-stackELK / ElasticsearchNagiosPRTGCloudWatchRsyslog

Operations & Discipline —

ITILIncident responsePost-mortemsRCARunbooksBCP / DRCapacity planningFinOpsSLO / SLA design

07 — writing

Field notes from production.

// Occasional writing on running infrastructure that doesn't fall over. Currently published on LinkedIn while the long-form home is under construction.

2026-05

7 Kubernetes settings every production cluster should configure.

Kubernetes · production checklist · ~5 min read

→

2026-05

Production AI Agents, what nobody tells you after the demo.

AI / MLOps · field report · ~6 min read

→

2026-05

Migrating an Anycast DNS network from Slackware 11 to modern Linux.

Networking · case study · ~7 min read

→

2026-05

EKS vs OKE, an honest comparison after running both in production.

Kubernetes / Cloud · opinion · ~6 min read

→

08 — beyond the stack

Education, credentials, language.

// Operational depth is built on top of a fundamentals education in communications and information technology engineering.

Education

Master's in Web Science

Syrian Virtual University (SVU)

2015

B.Sc. Communication & Information Technology Engineering

Tishreen University

2008 — 2013

Certifications

Data Loss Prevention (DLP)

INFOWATCH · Cert ID: EISSTM6 012938 2019

2019

Languages

English

Fluent · B2+

Arabic

Native

German

A1 · actively learning

09 — contact

Open to the next platform.

// I'm based in Munich, hold full EU work authorization, and I'm currently open to senior DevOps, Platform, SRE, and Cloud Architect roles. Remote, hybrid, on-site — all viable.

The most reliable
way to reach me is email.

Reply window is typically same day during CET working hours. Comfortable on-call and weekend coverage if the role calls for it.