Hendrik von Kiedrowski
Head of Operations & Principal Site Reliability Engineer
Berlin, DE.About
Highly accomplished Head of Operations and Principal Site Reliability Engineer with extensive experience in designing, implementing, and optimizing large-scale cloud-native infrastructures. Expert in Kubernetes, advanced observability stacks, and CI/CD automation, driving significant improvements in system reliability, performance, and cost efficiency for complex multi-cloud environments, including ultra-low-latency video CDNs. Proven leader in mentoring engineering teams and strategically managing infrastructure budgets to support rapid user growth and achieve critical business objectives.
Work
nanocosmos GmbH
|Head of Operations
Berlin, Berlin, Germany
→
Summary
Led and mentored a growing team of DevOps Engineers, driving strategic initiatives to enhance CDN performance and cost efficiency while integrating new technologies.
Highlights
Mentored and guided a growing team of DevOps Engineers, fostering skill development and enhancing operational capabilities to support critical infrastructure.
Created and evaluated new Key Performance Indicators (KPIs) to effectively measure the success of new CDN features and identify opportunities for cost efficiency improvements.
Researched and successfully launched new technologies, including NATS, to optimize low-latency CDN performance and enhance real-time data processing capabilities.
Strategically planned and optimized monthly and yearly infrastructure budgets, leveraging KPIs like user growth to ensure cost-effective resource allocation and scalability.
netgo GmbH
|Senior DevOps Engineer (freelance)
Berlin, Berlin, Germany
→
Summary
Coordinated development and engineering teams, leading the migration to a multi-cluster Kubernetes approach and implementing a globally distributed monitoring solution.
Highlights
Provided leadership and coordination for development and engineering teams, streamlining workflows and ensuring successful project delivery for critical infrastructure initiatives.
Spearheaded the migration from Ansible to Terraform and Rancher RKE2, transitioning to a robust multi-cluster Kubernetes approach that enhanced scalability and automation.
Migrated from Thanos to a globally distributed monitoring solution utilizing the LGTM stack, significantly improving observability and incident response capabilities.
AppConceptionOne GmbH
|Senior DevOps Engineer
Berlin, Berlin, Germany
→
Summary
Coordinated development and engineering efforts, ensuring robust data and access security while optimizing infrastructure budgets based on user growth metrics.
Highlights
Served in a leading position, coordinating cross-functional development and engineering teams to ensure cohesive project execution and operational excellence.
Ensured the rigorous implementation of data and access security protocols in close collaboration with the CISO, significantly strengthening the overall security posture.
Planned and optimized monthly and yearly infrastructure budgets, leveraging key performance indicators (KPIs) such as user growth to achieve cost efficiency and scalability.
nanocosmos GmbH
|Lead SRE, Observability Architecture (Project: Observability of a large Scale Video CDN)
Berlin, Berlin, Germany
→
Summary
Architected and implemented a comprehensive observability solution for a large-scale video CDN, collecting and storing metrics and logs from over 2000 servers in a multi-cloud environment.
Highlights
Designed and implemented a comprehensive observability architecture for a large-scale video CDN, collecting and storing metrics and logs from approximately 2000 servers.
Standardized processes across a multi-cloud/multi-provider environment, ensuring consistent data collection and providing a powerful backend solution for metrics and logs.
Integrated Alloy for metric and trace collection, Mimir for long-term metric storage, Loki for unindexed logs, and ELK for business-critical indexed logs, enhancing system visibility.
Developed and implemented infrastructure for observability based on retention, importance, and compliance of data using Loki, ELK, Kubernetes, Prometheus, Thanos, and Rancher.
nanocosmos GmbH
|Site Reliability Engineer
Berlin, Berlin, Germany
→
Summary
Automated development and infrastructure workflows with Terraform and GitLab CI, implemented KPIs for infrastructure usage, and designed a multi-cloud Kubernetes CDN.
Highlights
Automated all development and infrastructure workflows using Terraform and GitLab CI, significantly improving deployment speed and operational efficiency.
Implemented several KPIs with Prometheus to accurately measure deployed infrastructure usage, enabling intelligent autoscaling and resource optimization.
Designed and implemented a Single-Pane-of-Glass solution for a large Media CDN, providing centralized visibility and control over complex systems.
Collaborated closely with developers to automate workflows with GitLab CI and Kubernetes, streamlining CI/CD pipelines and accelerating feature delivery.
Planned and implemented a multi-cluster, multi-cloud Kubernetes CDN for media streaming, ensuring high availability and resilience across diverse cloud providers like AWS, Hetzner, and Vultr.
Built several Prometheus data collectors in GoLang, enhancing observability and alerting capabilities across the infrastructure.
nanocosmos GmbH
|Principal DevOps Engineer, Kubernetes Migration (Project: Moving an Ultra-Low-Latency Video Streaming Platform to Kubernetes)
Berlin, Berlin, Germany
→
Summary
Led the architectural design and migration of an ultra-low-latency live streaming platform to Kubernetes, maintaining or improving sub-second stream latency.
Highlights
Designed and implemented an architecture for migrating an ultra-low-latency live streaming platform to Kubernetes, successfully maintaining or improving sub-second stream latency.
Leveraged low-level networking technologies like BGP to optimize performance within the Kubernetes environment, addressing challenges posed by diverse legacy technologies.
Collaborated with the team to ensure the new Kubernetes-based services met stringent performance requirements for live video streaming, enhancing platform reliability and scalability.
nanocosmos GmbH
|Cloud Migration Lead (Project: Migrating a Dating Application from Azure to AWS)
Berlin, Berlin, Germany
→
Summary
Orchestrated the complete backend migration of a dating application from Azure to AWS, resolving performance and service reliability issues.
Highlights
Managed the comprehensive migration of a dating application's backend from Azure to AWS, successfully resolving critical performance and service reliability issues.
Migrated essential components including media storage and a PostgresDB, ensuring data integrity and high availability throughout the transition.
Utilized Terraform for infrastructure as code and deployed services on Kubernetes, standardizing the deployment process and enhancing operational efficiency.
nanocosmos GmbH
|Full-Stack-Developer
Berlin, Berlin, Germany
→
Summary
Implemented several back-office automations, including full digital payment solutions with Stripe, to enhance operational efficiency.
Highlights
Implemented several critical back-office automations, including full digital payments processing with Stripe, streamlining financial operations and reducing manual effort.
Technische Universität Section Intelligent Networks
|Student assistant
Berlin, Berlin, Germany
→
Summary
Contributed to Apache Flink data-stream processing and created a test environment for Dashed Videostreaming research.
Highlights
Contributed to Apache Flink Data-stream processing, specifically focusing on the visualization of queues and internal processes to improve understanding and debugging.
Developed a test environment for Dashed Videostreaming, supporting research into various buffering and quality switching algorithms to optimize video delivery.
Education
Technische Universität Berlin
→
M.Sc. without degree
Informatics
Hochschule für Technik und Wirtschaft
→
B.Sc.
Environmental Informatics
Courses
Bachelor Thesis: Efficiency analysis using the example of the IT of a logistics company
Languages
German
English
Skills
Prometheus
Kubernetes
ELK-Stack
LGTM-Stack
Ansible
Terraform
Apache
nginx
OS
Linux (Debian, Ubuntu, Arch), MacOS.
Groupware
Microsoft Exchange, OpenExchange.
SQL
Microsoft SQL Server, mySql, mongoDb, RDS-Postgres.
GoLang
noSQL
Rust
React
d3.js
node.js
Docker
Gitlab CI & AutoDevops
Interests
Hobbies
Playing the Piano, Bouldering, Cooking, CCC-Events, Home-Networking.