Boring infrastructure ships on Fridays.
If a system is exciting, something's wrong. Predictable beats clever. Recoverable beats fast. Documented beats novel. The best platform is the one nobody talks about because it just works.
I build, harden, and operate production cloud platforms the way they're supposed to work, quiet, observable, recoverable, and cheap enough to defend in a FinOps review.
// Eleven years of being the engineer responsible when the pager goes off, the certificate expires, the cluster autoscaler misbehaves, or the CFO asks why the AWS bill spiked.
Senior infrastructure engineer with a craftsman's bias toward simple, observable, recoverable systems.
I'm Loay Ali, a Senior DevOps / Platform Engineer based in Munich. I've spent the last eleven years operating production Linux at every scale that matters: from a regulated ccTLD registry to a global Anycast DNS network across 60+ POPs, from enterprise email on AWS EKS to a VPN platform I built from zero to 500K+ users as the sole engineer.
My background is unusual in one specific way. I sit comfortably between the classic stack (BIND, BGP, iptables, FreeBSD, VMware vSphere, hardened bare metal) and the modern stack (Kubernetes on EKS/OKE, Terraform, Helm, GitOps, ELK, Prometheus). That bridge matters when you're migrating an old platform to a new cloud without dropping a query.
I work like a craftsman, not a checklist. I write infrastructure-as-code by default, run quarterly FinOps reviews that have delivered 20–25% cost reductions, document everything so the next person can sleep, and treat security as a feature, not an afterthought. I've held the pager 24/7 as the only engineer for an ISP serving 35% of a country's customers — that experience teaches you to design things that don't break.
Currently open to senior DevOps, Platform, SRE, and Cloud Architect roles — remote, hybrid, or on-site. Munich-based with full EU work authorization.
// Each one of these cost me at least one outage to learn. They're cheaper to inherit than to discover.
If a system is exciting, something's wrong. Predictable beats clever. Recoverable beats fast. Documented beats novel. The best platform is the one nobody talks about because it just works.
Least-privilege IAM, centralized secrets (Vault, KMS, SSM), TLS/PKI automation, hardened images, and audit-readiness all live in the original Terraform module — not in a panicked retro after the first compliance review.
Prometheus, Grafana, ELK, kube-prometheus-stack, structured logging, SLO-driven alerting. Dashboards before deployment. Alerts that mean something. Runbooks linked from every alert.
Right-sized nodes. Reserved instances where stable. Spot where tolerable. Quarterly reviews with measurable savings (20-25% delivered). Cost is an architectural concern, not someone else's problem.
Structured incident response. Post-mortems that produce action items, not blame. Change windows planned with stakeholders. BCP/DR drills that are actually rehearsed. The boring parts of ops, treated with respect.
Architecture docs, deployment runbooks, scaling guides, DR procedures, on-call rotations. The job ends when another engineer can operate the platform independently. Not before.
// Build · Operate · Hand off. Each entry below was owned end-to-end, from architecture through production go-live and team enablement.
Designed, built, and launched a complete cloud platform from scratch for a SaaS startup running NestJS microservices and a NextJS frontend. Owned every decision from VCN topology through go-live, then handed it off to the in-house engineering team to operate independently.
Email-on-Kubernetes is rare. Doing it under enterprise hardening discipline is rarer. Built a production EKS cluster for the Axigen enterprise email platform from scratch with Terraform, with three dedicated node groups, NLB load balancers, and full email-authentication and observability baked in.
Operated a global Anycast DNS network across 60+ locations on five continents using BGP routing and AXFR distribution. Then executed a full operating-system migration (Slackware 11 to modern Linux) across every node in production with zero service disruption. The end users never knew.
Operated and upgraded the full registry stack for the .TM country-code TLD: EPP, WHOIS, RDAP. A hardened, regulated enterprise environment managing roughly 2 million domains under strict compliance, audit, and uptime requirements where change windows are scheduled and post-mortems are formal.
Architected a multi-tenant SaaS platform on AWS using Terraform, EKS, Helm, Nginx Ingress, ALB + ACM wildcard TLS, and Route53. Achieved zero-downtime blue-green deployments across multiple environments. Owned tenant isolation, scaling policy, and the operational discipline that kept it boring.
Built and operated the entire production stack for a startup VPN platform from scratch as the only infrastructure engineer. Scaled it to 500,000+ active global users with 99.9% uptime and zero-downtime architecture. Wrote my own DDoS mitigation when the off-the-shelf options didn't fit.
Led a zero-downtime live data-center migration for an ISP serving roughly 35% of private internet customers in Syria, using replication over a four-day cutover. A mission-critical, 24/7 environment where the consequences of a misstep were national, not just operational.
// Five roles. Three continents. One discipline. The pattern: take ownership, ship infrastructure that survives, document it so the next person can operate it.
Sole infrastructure engineer engaged to design, build, and launch a complete cloud platform from scratch for a SaaS startup (NestJS microservices, NextJS frontend). Owned full lifecycle through production go-live and team handoff.
Owned full infrastructure lifecycle for multiple enterprise platforms on AWS and on-premises: regulated .TM ccTLD registry, global Anycast DNS network across 60+ POPs, enterprise email (Axigen), and multi-tenant SaaS serving 100K+ daily users. ITIL-aligned operations, FinOps savings of 20-25%, structured incident response. Designed and deployed an AI-powered N8N agent for operational automation.
Delivered enterprise system administration and end-user support across dozens of enterprise clients. Resolved 200+ incidents per month with documented RCA and 85%+ customer satisfaction. Owned ticket triage, SLA tracking, escalation, and Ansible/Chef automation. Mentored junior engineers as the Linux escalation point.
Built the entire production VPN platform from scratch with full autonomy. Scaled from zero to 500,000+ global users with 99.9% uptime. Engineered a 12-node HA MySQL cluster, wrote custom DDoS-mitigation tooling (reduced attack impact by 90%), designed leak-free DNS infrastructure, and handled 24/7 on-call as the only infrastructure engineer.
Managed enterprise infrastructure for a major ISP serving roughly 35% of private internet customers in Syria. VMware vSphere/vCenter virtualization, Veeam-based backups with sub-1-hour RTO. Led a zero-downtime data-center migration over a 4-day cutover. Zero security breaches over the tenure.
// Everything listed below has been in production with my hands on it. Items in amber are in heavy daily use.
// Occasional writing on running infrastructure that doesn't fall over. Currently published on LinkedIn while the long-form home is under construction.
// Operational depth is built on top of a fundamentals education in communications and information technology engineering.
// I'm based in Munich, hold full EU work authorization, and I'm currently open to senior DevOps, Platform, SRE, and Cloud Architect roles. Remote, hybrid, on-site — all viable.
Reply window is typically same day during CET working hours. Comfortable on-call and weekend coverage if the role calls for it.