Vitor Schiavo — SRE & Platform Engineering Expert

Executive Summary

SRE and Platform Engineer with a software development background — infrastructure at scale, cloud cost optimization, and production AI systems.

🧭 Quick Navigation

🎯 Technical Leadership Profile

                                13+ years total experience (support → software development → SRE → platform engineering leadership).
                                7+ years in SRE/Platform Engineering with a software dev background in Java, Python, Kotlin, Swift, TypeScript.
                            

FinOps

$331K/year documented cloud cost savings through compute and storage optimization

AI/ML Ops

Production ML systems: face recognition, content moderation, LLM serving. vLLM Gemma 3 12B on RunPod + GCP

GitOps at Scale

ArgoCD 9-phase rollout across 4 GKE clusters with Argo Rollouts for canary delivery

Infrastructure as Code

Terraform + Shared VPC with reusable modules. 60+ Cloud Functions migrated Gen1 → Gen2

Security

mTLS with Cloudflare, AWS-GCP HA VPN, 61 databases hardened, xmrig supply-chain incident response

Scale

160+ applications, 863 infrastructure assets, 99.9% uptime SLO

👥 Leadership & Mentorship

Mentored junior and mid-level engineers on incident response, Terraform best practices, and production-readiness reviews
Led cross-team initiatives: ArgoCD multi-cluster rollout, Shared VPC migration, and Cloud Functions Gen2 migration across 4 squads
Established on-call runbooks and post-incident review culture, reducing MTTR and improving team autonomy
Drove FinOps adoption across engineering teams, training product squads on cost-aware architecture decisions

📈 Career Trajectory

Productivity Growth (2021 → 2025) +267%

Proactive Work Ratio 96%

Task Completion Rate 100%

Cloud Cost Optimization 27%

🎯 Task Distribution Analysis

Type	Count	%	Insight
Tasks	1,241	79.4%	Planned work, strategic execution
Sub-tasks	236	15.1%	Project decomposition, planning
Bugs	61	3.9%	Reactive work (very low)
Epics	17	1.1%	Large project leadership

Case Studies & Projects

13 real-world problems solved with measurable impact, documented with before/after metrics and technical details.

💾 GCP Storage Migration — 95.9% Cost Reduction

January 2025 · Cloud Cost Optimization · Lifecycle Management

GCP Cloud Storage Lifecycle Policies FinOps $148K Savings

❓ Problem

Cloud storage costs growing exponentially. Spending $12,874/month on 360 TiB, all in expensive Standard storage. No lifecycle management, no archival strategy, no deletion policies.

💡 Solution

Analyzed data access patterns to identify hot vs cold data
Configured automatic archival for files older than 5 days
Set auto-deletion policy for files older than 365 days
Kept only last 5 days in Standard storage (hot data)
Applied policies to both multi-region and single-region buckets

❌ Before

Storage360 TiB

Class100% Standard

Monthly$12,874

Annual$154,488

✅ After

Storage21 TiB

Class10.5 TiB Std + Archive

Monthly$533

Annual$6,396

💰 $148,092/year saved (95.9% reduction)

Business Impact

94.2% storage reduction (360 TiB → 21 TiB)
$444K saved over 3 years with zero data loss
Fully automated lifecycle: archival and deletion run on their own once the policies are in place
Maintained performance: hot data still in Standard class

GCP Cloud Storage Lifecycle Management Archive Storage Class Cost Analysis

✈️ Airflow Infrastructure Optimization — 93% Reduction

January 2025 · Resource Optimization · Kubernetes

Apache Airflow Kubernetes Resource Optimization $18K Savings

❓ Problem

Airflow deployment massively over-provisioned: 100 pods (50 dev + 50 prod), 20 CPUs, 200 GB RAM across 7 nodes costing $1,513/month. Most pods idle 90% of the time.

💡 Solution

Analyzed actual workload patterns and resource utilization
Added autoscaling (4–10 replicas per environment) based on actual queue depth
Rightsized CPU and memory based on real usage data
Consolidated from 7 nodes to 1 node with better bin-packing

❌ Before

Pods100

CPU20 vCPUs

Memory200 GB

Nodes7

Cost$1,513/mo

✅ After

Pods14 (avg)

CPU2.8 vCPUs

Memory28 GB

Nodes1

Cost~$0/mo

💰 $18,165/year saved (93% reduction)

Business Impact

86% reduction in pods, CPU, and memory allocation
Zero performance degradation; all DAGs running normally
Freed up 6 nodes for other workloads

Apache Airflow Kubernetes HPA Resource Quotas GCP GKE

🔐 mTLS with Cloudflare — Zero-Trust Architecture

2024–2025 · Advanced Security · Network Architecture

mTLS Cloudflare Zero-Trust Certificate Management

❓ Problem

Standard TLS only authenticates server to client. Required mutual authentication where both client and server verify identity using certificates, critical for API security and ISO 27001 compliance.

💡 Solution

Set up Cloudflare for Mutual TLS; both sides now present certificates, not just the server
Generated and distributed client certificates to each authorized service
Added certificate revocation and rotation so compromised certs can be killed immediately
Every request is verified by certificate, with no implicit trust based on network position
Automated the certificate lifecycle so rotation doesn't require manual steps

🔐 mTLS Flow

Client (with cert) → Cloudflare (validates client cert) → Origin Server (validates CF cert)
Mutual verification at every layer · End-to-end encrypted + authenticated

Business Impact

Every request authenticated by certificate; nothing gets through on network position alone
ISO 27001 compliance requirement satisfied
Protection against MITM attacks and API abuse
Instant revocation if certificate is compromised

Cloudflare mTLS X.509 Certificates PKI Zero-Trust

🌉 AWS-GCP HA VPN — Multi-Cloud Connectivity

2024 · Multi-Cloud Networking · High Availability

HA VPN AWS-GCP BGP Routing Multi-Cloud

❓ Problem

Needed secure, reliable communication between AWS and GCP for hybrid workloads. Public internet routing unacceptable for sensitive data. Required redundancy for high availability.

💡 Solution

Built HA VPN between AWS and GCP with redundant tunnels on both sides
Configured BGP so failover is automatic, with no manual intervention when a tunnel drops
Assigned private IPs and set up internal routing tables so traffic never touches the public internet
All inter-cloud traffic encrypted end-to-end over IPSec
Added monitoring and alerts for tunnel health so issues are caught before they affect workloads

Business Impact

99.99% availability through redundant tunnels
Automatic failover via BGP (<30 seconds)
Zero public internet exposure for sensitive data
Workloads now span both clouds without any public internet exposure

GCP Cloud VPN AWS Site-to-Site VPN BGP IPSec Private Networking

🚀 Custom Canary Deployment System

2024–2025 · Progressive Delivery · Automation

Canary Deployment Progressive Rollout Automated Rollback

❓ Problem

Traditional blue-green deployments require a 100% traffic switch, which is risky. Off-the-shelf tools didn't fit our multi-environment Kubernetes setup. Needed gradual rollout with automatic metric-based rollback.

💡 Solution

Wrote custom canary deployment automation from scratch; off-the-shelf tools didn't fit the multi-environment setup
Progressive traffic split: 10% → 25% → 50% → 100%, with a gate at each stage
Health checks run at each stage before traffic advances
Automatic rollback kicks in when error rate or latency crosses defined thresholds
Wired into Jenkins pipelines, works the same way for every team

Business Impact

No failed deployments since rollout: issues surface at 10% traffic and trigger automatic rollback before reaching everyone
Deployment frequency increased: teams ship more often because the cost of a bad release dropped dramatically
No third-party tooling dependency: built from scratch to fit the existing setup

Kubernetes Nginx Ingress Prometheus Custom Scripts Jenkins

🔒 Database Security Hardening — 61 Databases

2023–2025 · Security Architecture · Compliance

SSL/TLS Private Networking Certificate Auth ISO 27001

❓ Problem

Databases exposed with public IPs, unencrypted connections, password-only authentication. Non-compliant with ISO 27001. Vulnerable to network sniffing and unauthorized access.

❌ Before

NetworkPublic IPs

EncryptionNone (plaintext)

AuthPassword only

ComplianceNon-compliant

✅ After

NetworkPrivate IPs only

EncryptionSSL/TLS enforced

AuthCertificate + password

ComplianceISO 27001 ✓

Business Impact

61 databases secured (18 MongoDB Atlas + 44 PostgreSQL)
Zero security incidents post-implementation
ISO 27001 certification achieved

MongoDB Atlas PostgreSQL Redis 7.2 VPC Peering

🌐 Kong Gateway — API Management at Scale

2023–2024 · API Management · Architecture

Kong Gateway 160+ APIs Rate Limiting Kubernetes

❓ Problem

160+ applications with direct nginx ingress and no centralized API management, rate limiting, or authentication layer. Difficult to enforce policies or monitor API usage consistently.

Business Impact

160+ applications now go through a single gateway, giving one place to enforce policy
Rate limiting configured centrally; no per-service implementation needed
Auth, CORS, and headers applied globally; developers don't handle this per app
Centralized logging and tracing cuts troubleshooting time significantly

Kong Gateway Kubernetes Prometheus GitOps

🔄 ArgoCD GitOps — Multi-Cluster Continuous Delivery

March 2026 · GitOps · Kubernetes Multi-Cluster

ArgoCD GitOps GKE Multi-Cluster Helm Argo Rollouts

❓ Problem

Deployments across four GKE clusters (dev · stg · prd · sdx) were manual, inconsistent, and error-prone. No single source of truth for cluster state, no audit trail, and high risk of configuration drift between environments. Teams needed kubectl access to deploy with no self-service model in place.

💡 Solution — 9-Phase Implementation

Phase 0: Installed ArgoCD on GKE PRD cluster via Helm with production-grade values
Phase 1: Defined GitOps repository structure: apps/environments/charts hierarchy with Helm chart templates and per-env value overrides
Phase 2: Configured ArgoCD Applications and AppProjects with RBAC boundaries per team and namespace
Phase 3: Multi-cluster registration; all four GKE clusters (dev · stg · prd · sdx) as ArgoCD targets with cluster credentials stored in Secrets
Phase 4: CI/CD pipeline integration via GitHub Actions image updater for automatic image tag propagation, branch promotion dev → stg → prd
Phase 5: SSO/OIDC with Google Workspace + RBAC policies per team
Phase 6: Prometheus metrics + Grafana dashboards for GitOps observability
Phase 7: Incremental migration of existing workloads; audited all namespaces, converted to Helm charts, enabled self-heal and prune on non-prod
Phase 8: Progressive delivery with Argo Rollouts, canary deployments with traffic splitting and automated rollback

Business Impact

4 GKE clusters managed as a single GitOps fleet (dev · stg · prd · sdx)
Zero configuration drift: Git is the single source of truth; ArgoCD self-heals deviations automatically
Full audit trail: every deployment change tracked as a signed Git commit
Instant rollback: revert a Git commit and the cluster reconciles within seconds
Developer self-service: engineers deploy via PR, zero kubectl access required
Automated promotion: GitHub Actions image updater pushes new tags across environments without manual YAML edits

ArgoCD Argo Rollouts GKE Helm GitHub Actions GitOps Google Workspace OIDC Prometheus

🏗️ GKE Infrastructure — Production-Grade Terraform + Shared VPC

February 2026 · Infrastructure as Code · GKE · Terraform

Terraform GKE Shared VPC GitHub Actions Cloud NAT

❓ Problem

GKE clusters and the underlying network infrastructure were provisioned manually with no IaC. No consistent multi-environment strategy, no prevent-destroy protection on critical network resources, and Terraform configuration was monolithic and impossible to reuse across environments.

💡 Solution — 8-Phase Delivery

Phase 1 — Architecture Design: Defined Shared VPC topology, subnet strategy, and GKE cluster requirements for 4 environments
Phase 2 — GCP Project Setup: Created projects, enabled APIs, and configured IAM service accounts with least-privilege bindings
Phase 3 — Network Infrastructure: Deployed Shared VPC host project, subnets, Cloud NAT, and firewall rules via Terraform
Phase 4 — Modular Terraform: Refactored from monolithic inline resources to reusable modules; converted root config to module calls with environment-specific variable files
Phase 5 — CI/CD Automation: GitHub Actions workflows for terraform plan on PR and terraform apply on merge; separate workflows per environment
Phase 6 — GKE Cluster Deployment: Deployed all four GKE clusters (dev · stg · prd · sdx) using the automated pipelines
Phase 7 — Safety Guardrails: Added prevent_destroy = true lifecycle blocks on VPC, subnets, and Cloud NAT to protect against accidental terraform destroy
Phase 8 — Documentation: Architecture diagrams, operational runbooks, and module documentation for knowledge transfer

Business Impact

100% infrastructure as code: every network and cluster resource managed via Terraform
4 GKE clusters deployed in a single automated pipeline run
Zero manual provisioning: new environments created from variable files only
Production safeguards: prevent_destroy prevents accidental loss of critical network infrastructure
Validated Shared VPC + Cloud NAT: all service projects route egress through a centralized NAT gateway

Terraform GKE Shared VPC Cloud NAT GitHub Actions GCP IAM Terraform Modules

🌐 vLLM Inference Fallback Router — Multi-Provider with Measured SLAs

March 2026 · AI Infrastructure · Reliability Engineering

vLLM FastAPI RunPod Cloud Run GPU GKE MIG

❓ Problem

Production LLM serving relied on a single GPU provider (RunPod). A provider outage or GPU supply constraint would bring down the entire inference capability with no fallback. Each provider had different latency profiles and no routing logic existed to exploit this.

💡 Solution

Built a ~150-line FastAPI reverse proxy that accepts OpenAI-compatible requests (/v1/chat/completions, /v1/models)
Routes through a provider chain: RunPod (primary) → Cloud Run GPU (secondary) → GKE MIG (tertiary)
Each provider has a configurable mock-failure flag for testing without real outages
Returns HTTP 503 with per-provider error diagnostics when all providers fail
Containerized (linux/amd64) and deployed to GKE PRD cluster
All 4 failure scenarios validated end-to-end in production

❌ Before

ArchitectureSingle provider

FailureFull outage

VisibilityNo diagnostics

✅ After

Architecture3-provider chain

Primary latency568 ms (RunPod)

Warm fallback1.9 s (Cloud Run)

MIG fallback7.7 s (GKE MIG)

Business Impact

Eliminated single point of failure for production LLM inference
3x redundancy: service survives any single provider failure transparently
Measurable SLAs per provider: RunPod 568ms · Cloud Run warm 1.9s · MIG 7.7s
Fully validated: all 4 failure scenarios tested in GKE PRD production environment

FastAPI vLLM RunPod Serverless Cloud Run GPU GKE MIG Docker GKE PRD

🛡️ Cryptomining Incident Response — xmrig on GKE PRD

2026 · Security Incident · GKE Production

Incident Response Supply Chain GKE Security Threat Detection

❓ Problem

xmrig (Monero cryptominer) detected running on a GKE production node. The miner was injected via a compromised frontend npm dependency, a supply-chain attack that bypassed standard code review. The workload was silently consuming node CPU, masking legitimate service degradation with no immediate visible symptoms.

💡 Detection & Response

Detection: Anomalous CPU spike on GKE production node triggered infrastructure alert
Triage: Identified xmrig process running inside a frontend container via node-level process inspection
Root cause: Compromised npm package in frontend dependency tree injected miner code into the built bundle
Containment: Cordoned the affected node; drained and restarted all workloads on clean nodes within minutes
Remediation: Updated all vulnerable frontend libraries; pinned transitive dependency versions
Hardening: Added npm audit --audit-level=high as a CI gate; implemented Trivy image scanning on every pre-deploy build

Business Impact

Zero data exfiltration: miner targeted CPU only, no credentials or data accessed
Full remediation in under 4 hours from first alert to clean production deployment
Supply-chain hardening: npm audit + Trivy scanning now gates every production build
Post-mortem and team training: dependency pinning, supply-chain threat model published internally

Lessons & Permanent Changes

Lock files enforced: all repos now require committed package-lock.json; CI fails if lockfile is missing or modified without review
Runtime isolation: implemented Pod Security Standards (restricted) across all GKE namespaces to prevent privilege escalation
Network policies: egress rules block outbound connections to mining pools and unknown endpoints by default
Alerting gap closed: added GKE node-level CPU anomaly detection via Cloud Monitoring with 5-minute threshold alerts

GKE Trivy npm audit Pod Security Standards Network Policies Cloud Monitoring Incident Response Supply Chain Security

🚨 Production Incident Response — P1 Outage & Partner Auth Failure

March 2026 · Incident Response · Production Reliability

P1 Incident Incident Response SRE Root Cause Analysis

Incident A — Third-Party CMS Loader [P1]: HTTP 500 across two regional markets

Impact: 100% of page loads returning HTTP 500 from 13:08 UTC; full production outage across two regional markets
Root cause: Third-party CMS API dependency had an outage; all pages failed server-side rendering
Response: Identified dependency failure, implemented emergency static fallback, and escalated to vendor support
Resolution: Service restored; added circuit-breaker pattern and health monitoring for external CMS dependencies

Incident B — Partner API [P2]: Credential Rotation Causing Auth Failure

Impact: All partner authentication calls returning 401 from ~14:30 UTC after partner-side password rotation
Root cause: Partner rotated credentials without coordinated notification; old password revoked mid-day
Response: Diagnosed authentication failure from logs, obtained new credentials, updated GCP Secret Manager, triggered rolling restart
Hardening: Established credential rotation runbook and pre-agreed notification SLA with partner

SRE Principles Applied

MTTR minimized: both incidents resolved within the same business day
Blameless post-mortems: root causes documented, systemic fixes implemented
Monitoring gaps closed: external dependency health checks added after each incident

GKE GCP Secret Manager Prometheus Incident Response SRE

🔍 Elastic Stack Migration — EC2 to Kubernetes at Scale

2020–2021 · Observability · Infrastructure Modernization

Elasticsearch 7.8 Logstash Kibana Filebeat Kubernetes Helm

🔴 Problem

2 legacy Elastic Stack clusters: Elasticsearch 5 on EC2 and Elasticsearch 6 on ECS, both via docker-compose
No role separation: every node handled master election, data storage, and ingest simultaneously
Manual scaling, no self-healing, single points of failure
Storage bottleneck: no S3 snapshots, no automated retention policies

💡 Solution

Consolidated both clusters into 1 unified EKS cluster on Elasticsearch 7.8
Designed multi-node architecture with dedicated roles: 3 master (8 GB), 3 data (16 GB, 50 TB), 3 ingest (16 GB)
Custom ECR image with S3 snapshot plugin for automated backup and retention
Deployed via Helm Charts with rolling updates, PDB (maxUnavailable: 1), and soft anti-affinity
Built 9 specialized Logstash pipelines for different log sources (GELF, K8s, NiFi, PHP, portal, services, reports)
Filebeat DaemonSet with Kubernetes metadata enrichment for container log collection
Fluent Bit as lightweight alternative collector with kernel-level filtering
Kibana 3 replicas with CSV reporting (1 GB max), APM Server for tracing, Metricbeat for cluster health

💰 $111K/year saved (54.5% infrastructure cost reduction)

Business Impact

54.5% cost reduction: old cluster (65 EC2 + 183 disks + ~91 TB) cost $17,038/month; new EKS cluster (7 m5.2xlarge + 7x 7.1 TB SSD) cost $7,750/month, saving ~$111K/year
Consolidated 2 → 1 cluster: eliminated 65 EC2 instances and 183 EBS volumes, replacing with 7 EKS nodes
Version upgrade v5/v6 → v7.8: major version leap, unlocking ILM, security features, and index lifecycle management
EBS GP2 → Standard migration: migrated 124 volumes across 2 AWS regions (São Paulo + Virginia), reducing storage costs
Multi-region cost analysis: data-driven decision to consolidate in Virginia saved $10.7K/month vs São Paulo region pricing
S3 cleanup: deleted 141 TB unused S3 bucket, automated snapshot retention via custom ECR image with repository-s3 plugin
9 dedicated Logstash pipelines: each pipeline tuned for its specific source, enabling independent scaling and troubleshooting
Self-healing infrastructure: Kubernetes StatefulSets with PDB, rolling updates, and automated recovery replaced manual EC2 management

Elasticsearch 7.8 Logstash 7.8 Kibana 7.8 Filebeat 7.8 Fluent Bit APM Server Metricbeat Kubernetes Helm Charts AWS ECR S3 NGINX Ingress Internal NLB

📡 Kafka Event Streaming — Real-Time Pipeline to BI

2020–2021 · Data Engineering · Event-Driven Architecture

Kafka (AWS MSK) Logstash RabbitMQ Event-Driven

🔴 Problem

Business intelligence team had no real-time access to API and service event data
Log data was siloed in Elasticsearch, not accessible for downstream analytics
Multiple event sources (GELF, RabbitMQ, HTTP) needed unified routing to Kafka topics
No standardized pipeline for streaming filtered events to analytics systems

💡 Solution

Designed dual-output Logstash pipelines: events indexed in Elasticsearch AND forwarded to Kafka topics
Integrated with AWS MSK (Managed Streaming for Apache Kafka) with gzip compression
Built conditional routing: API events filtered by service name and forwarded to dedicated Kafka topics for BI consumption
RabbitMQ input plugin for message queue events alongside GELF and pipeline-to-pipeline routing
Kafka producer configured with client_id tracking and topic-per-domain strategy

Business Impact

Real-time BI pipeline: business intelligence team gained access to live API events for the first time, enabling real-time analytics
Dual-write architecture: every event indexed in Elasticsearch for operational search AND streamed to Kafka for downstream analytics, no data duplication at source
Conditional routing: only relevant events forwarded to Kafka, reducing consumer noise and topic volume
AWS MSK integration: managed Kafka with gzip compression reduced network costs and simplified operations
Multi-source unification: GELF, RabbitMQ, HTTP, and pipeline-to-pipeline inputs consolidated into a single processing layer

Apache Kafka AWS MSK Logstash 7.8 Elasticsearch 7.8 RabbitMQ GELF Kubernetes NGINX Ingress Gzip Compression

🧠 AI Tools Integration — Team Productivity

2024–2025 · AI Adoption · Developer Experience

Claude AI GitHub Copilot Claude Code Team Enablement

💡 Solution

Dec 2024: Enabled Claude AI for entire team
Aug 2024: Onboarded QA and dev teams to GitHub Copilot
Aug 2025: Evaluated Claude Code for automated PR reviews in Bitbucket pipelines
Created best practices guides for AI-assisted development

Business Impact

Adopted AI tools early: Claude AI in Dec 2024, Copilot in Aug 2024, before most teams in the org
Automated PR reviews via Claude Code; reviewers spend time on logic, not style
Junior developers ramp up faster: AI-assisted development reduces the gap between seniors and juniors on routine tasks

Professional Experience

13+ years across 9 roles, from IT Support to Platform Engineering leadership.

VerifyMy

Site Reliability Engineer

Jun 2021 – Present · 4 years 10 months Current

London, UK (Remote). Platform engineering and FinOps at a child safety tech company (safeguarding children and society online).

Led $331K annual cloud cost optimization ($148K storage + $183K compute) · Manage 160+ applications across 4 Kubernetes environments (versions 1.22 → 1.33) · mTLS with Cloudflare and AWS-GCP HA VPN · Secured 61 databases (SSL/TLS, private IPs, cert auth) · Secret rotation policies and audit logging across all environments · Built 65+ Jenkins CI/CD pipelines with 98%+ first-deploy success · Custom canary deployment automation with metric-based rollback · Led ISO 27001 compliance (certification achieved) · OWASP ZAP + SonarQube automated security scanning pipeline · Kong Gateway for 160+ APIs · End-to-end observability framework using OpsGenie, Statuspage, Cloud Monitoring, and Slack · Production LLM serving with vLLM on RunPod Serverless and GCP (Cloud Run GPU + GKE MIG) with multi-provider failover

Stack: GCP (expert), AWS, Kubernetes, Jenkins, Kong, Prometheus, Grafana, vLLM, RunPod, OpsGenie, Statuspage, MongoDB Atlas, PostgreSQL, Redis, Terraform, GitHub Actions

Intelipost

Site Reliability Engineer

Feb 2020 – Jun 2021 · 1 year 5 months

São Paulo, Brazil. SRE at a logistics technology company focused on reliability, observability, and infrastructure modernization at scale.

Migrated Elasticsearch 5 (EC2) and Elasticsearch 6 (ECS) clusters to a unified EKS cluster on v7.8 with dedicated node roles (3 master, 3 data, 3 ingest) and custom ECR images with S3 snapshot plugin. Reduced Elasticsearch infrastructure cost by 54.5% ($17K/month → $7.75K/month, ~$111K/year saved) by consolidating 65 EC2 instances and 183 EBS volumes into 7 EKS nodes. Migrated 124 EBS volumes from GP2 to Standard across 2 AWS regions. Deleted 141 TB of unused S3 storage. Implemented instance scheduling for non-production environments. Designed and deployed 9 specialized Logstash pipelines handling logs from GELF, Kubernetes, NiFi, RabbitMQ, and application-specific sources. Integrated Kafka (AWS MSK) for real-time event streaming to BI systems. Built log collection layer with Filebeat DaemonSets and Fluent Bit for container log forwarding. Deployed Kibana (3 replicas) with CSV reporting, APM Server for tracing, and Metricbeat for cluster health. Led multi-region cost analysis (Virginia vs São Paulo) driving data-informed infrastructure placement decisions. Planned and provisioned dedicated Black Friday infrastructure. Expanded observability scope and promoted monitoring awareness among developers.

Stack: AWS (Expert), GCP, Kubernetes, Elasticsearch 7.8, Logstash, Kibana, Filebeat, Fluent Bit, APM Server, Metricbeat, Kafka (MSK), RabbitMQ, NiFi, Docker, Helm, NGINX Ingress, PostgreSQL, Python, CloudFormation

Dasa

Site Reliability Engineer

Jan 2018 – Feb 2020 · 2 years 2 months

São Paulo, Brazil. Critical operations at Brazil's largest integrated healthcare network.

Provisioned and managed multi-cluster Kubernetes infrastructure on Azure using ACS Engine (v0.19/v0.25) and AKS across 5 environments: monitor, dev, hml, prd01, prd02 with node pool scaling via CLI. VNet/Subnet provisioning via Terraform, ACS Engine for ARM template generation, and automated CD pipelines for nginx-ingress deployment, AKS provisioning, and K8s cluster migration. Deployed Elastic Cloud on Kubernetes (ECK) with automated CD pipeline, Ansible playbooks for ELK/Jenkins/Spinnaker provisioning, Beats and monitoring agents on all K8s clusters, and Curator for automated index cleanup on AWS Elasticsearch. Built AWS Data Lake infrastructure using CloudFormation with Lambda functions, EMR clusters, and S3 for data processing pipelines. Developed Python-based Azure monitoring discovery tool to identify unmonitored applications across Azure subscriptions, integrated with Confluence for documentation. Zabbix integration for host management (auth, host provisioning, queue management via Node.js). JMeter performance testing templates for backend Java services. Built custom GitOps framework (gitops.sh) with Terraform for infrastructure-as-code, supporting native and Docker-based installation with automated pipeline templates. Deployed and managed WhatsApp integration instances for multiple hospital units (Delboni, Exame, Lavoisier, and others) across dev and production environments, enabling patient communication at scale. Incident management and crisis response. Designed and improved cloud architecture with focus on performance and reliability. ACR cleanup automation scripts to manage container registry costs. Multi-cloud operations (AWS + Azure).

Stack: Azure (Expert: Web Apps, AKS, ACS Engine, API Gateway, ExpressRoute, ACR), AWS (CloudFormation, EMR, Lambda, S3), Kubernetes, Terraform, Ansible, Elasticsearch, ECK, Jenkins, Spinnaker, Zabbix, Prometheus, JMeter, Bitbucket, Docker, Python, Node.js

Dasa

Software Developer

Jul 2017 – Dec 2017 · 6 months

Barueri, São Paulo. Built APIs using Axway Cloud Platform (API Gateway + API Management) for Dasa Group clients, covering both backend and frontend REST/SOAP APIs.

Managed API Gateway across dev, hml, and prd environments with environment promotion via Policy Studio and CLI. Authored comprehensive platform documentation covering API registration, organization setup, SMTP configuration, IP configuration, and multi-language portal management. Business logic via Policy Studio. IBM ESB integration for flexible application integration.

Stack: Axway Cloud Platform, API Gateway, API Manager, Policy Studio, IBM ESB, Azure, REST/SOAP

PRÓPONTO

Software Developer

Jan 2017 – Jul 2017 · 7 months

Americana, SP. Web service application development with continuous integration.

REST and SOAP web services · GitLab-based version control · CI with Jenkins

Stack: SOAP, RESTful, Spring MVC, Maven, Git, GitLab, Hibernate, Spring Data, Jenkins

Microdata Sistemas

Java Developer

Jan 2016 – Dec 2016 · 1 year

Americana, SP. Web services and e-commerce development in Java.

Stack: Java, Web Services, E-Commerce

Microdata Sistemas

Systems Analyst

Nov 2014 – Dec 2015 · 1 year 2 months

Americana, SP. Delphi developer and SQL Server analyst in a Windows-based enterprise software environment. Customer support, bug treatment, and software improvements. Some iOS/Android work for special projects.

Stack: Delphi, SQL Server, iOS, Android

GZ Sistemas

Junior IT Support Analyst

Mar 2013 – Jul 2014 · 1 year 5 months

Jundiaí, SP. Customer service and enterprise software support for accounting and financial retail sector. Hands-on with network connections, database administration, and Linux environments.

Stack: Java, C#, Linux (Fedora 14), Network & Database administration

Coca-Cola FEMSA

Administrative Assistant & Supervisor

Feb 2012 – Mar 2013 · 1 year 2 months

Jundiaí, SP. Supervised delivery officials, monthly closing sheets for delivery operations, daily performance reporting for freight service providers.

🎯 Career Progression

📊 2012–2014: Operations and IT support (Coca-Cola, GZ Sistemas)
💻 2014–2017: Software development (Microdata, PRÓPONTO, Dasa)
☁️ 2018–2021: SRE/Cloud Engineering (Dasa, Intelipost)
🚀 2021–Present: Platform Engineering leadership with FinOps mastery (VerifyMy)

13+ years total · 7+ years SRE/Platform · Tri-cloud expert (GCP + AWS + Azure) · Full-stack background from dev roles

🎓 Education

Wyden Educacional

MBA — Software Engineering

2015 – 2016

Anhanguera Educacional

Higher Education — Analysis and Systems Development

2012 – 2014

📜 Certifications

Google Cloud Platform Fundamentals: Core Infrastructure
Logging, Monitoring and Observability in Google Cloud
Desenvolvimento Ágil com Java Avançado
Programando em TypeScript
SQL Fundamentals — Certificate of Completion

🌐 Languages & Publications

Portuguese — Native
Spanish — Full Professional
English — Professional Working Proficiency

Publication

Management Cloud Computing

AI & ML Operations

Production ML infrastructure, modern AI tooling, multi-agent systems, and intelligent automation across cloud environments.

🤖 AI/ML Engineering Excellence
                                Production LLM infrastructure: Gemma 3 12B serving via vLLM across 3 GPU providers with automatic failover

                                6 GPU profiles validated: RTX 4090 · RTX 5090 · RTX A6000 · RTX 6000 Ada · L40S · A100 SXM 80GB

                                Cold start engineering: 10 min → 3 min via pre-baked Docker images, model warm-up, and Tier 1 + Tier 2 boot optimizations

                                AI tooling: Rolled out Claude AI (Dec 2024) and GitHub Copilot (Aug 2024) to the team; runs automated PR reviews in CI with Claude Code

                                MCP & multi-agent systems: Builds AI workflows using MCP servers, multi-agent orchestration, and vibe coding for infrastructure automation

                                MLOps expertise: Content moderation (YOLO), intelligent autoscaling (VPA/HPA/KEDA), security automation

🎯 AI/ML Infrastructure by the Numbers

Category	Tasks	Technologies
LLM Production Infrastructure	50+	vLLM 0.15.1, Gemma 3 12B, RunPod, Cloud Run GPU, GKE MIG
Content Moderation	20	YOLO object detection, GPU-optimized services
Security Automation	29	OWASP ZAP, Trivy, SonarQube, npm audit
Intelligent Autoscaling	15	VPA, HPA, MPA, KEDA
Modern AI Tools	5	Claude AI, GitHub Copilot, Claude Code

🧠 Modern AI Tools

Rolled out Claude AI (Dec 2024) and GitHub Copilot (Aug 2024) to the team. Runs automated PR reviews in CI with Claude Code. Builds with MCP servers and multi-agent systems for infrastructure automation. Writes CLAUDE.md standards so AI-assisted development is consistent across all repos.

🎬 Content Moderation AI

Ran infra for 20 AI-powered content screening tasks: YOLO object detection on GPU-optimized services for video analysis. Production MLOps for child safety and age-assurance systems.

⚡ Intelligent Infrastructure

Set up VPA/HPA/MPA/KEDA so scaling decisions are based on actual metrics, not manually tuned thresholds. Also coordinated YOLO batch ReID processing alongside a live vLLM server on the same GPU; scheduling matters when GPU memory is finite.

🖥️ vLLM Production Infrastructure — Full Journey

Built and ran production LLM serving for Gemma 3 12B (GGUF Q4) via vLLM 0.15.1. Started on a single GCP VM and grew it into a multi-provider, multi-GPU setup with automatic failover and profile-driven GPU selection.

🔥 Cold Start Optimization

10 minutes → 3 minutes cold start reduction. Two-tier approach: Tier 1 pre-baked Docker images with model weights baked in (eliminates pip install + download on every start); Tier 2 CUDA graph pre-capture and sampler warm-up tuning. Validated on GKE MIG (L4) and RunPod Serverless.

🌐 Intelligent Fallback Router

Built a ~150-line FastAPI reverse proxy deployed on GKE PRD that receives OpenAI-compatible requests and routes through a provider chain (RunPod → Cloud Run → MIG). Returns 503 with per-provider diagnostics when all fail. Validated all 4 failure scenarios including all-providers-down.

🔄 Multi-DC EU Failover

Implemented multi-datacenter failover for RunPod Serverless: EU-RO-1 → EU-CZ-1 → EU-NL-1 with dedicated network volumes per datacenter. Added exponential backoff with SUPPLY_CONSTRAINT retry logic for GPU provisioning failures.

🚀 Profile-Driven GPU Selection

Refactored from hardcoded T4/RTX 4090 if-else to a centralized GPU_PROFILES associative array. Each profile encodes VRAM, max_model_len, dtype, kv-cache-dtype, fp8 support, and tensor-parallel settings. One flag selects the entire tuned config.

GPU Provider	Hardware	Avg Latency	Architecture	Role
RunPod Serverless	RTX 5090 · RTX A6000 · RTX 6000 Ada · L40S	568 ms	Ada Lovelace / Ampere	Primary — cost-optimized burst
Cloud Run GPU	NVIDIA L4	1.9 s (warm) · 41 s (cold)	Ada Lovelace	Secondary — managed serverless
GKE MIG (GCP)	L40S · A100 SXM 80GB	7.7 s	Ada / Ampere	Dedicated — high-throughput

vLLM 0.15.1 Gemma 3 12B GGUF RunPod Serverless Cloud Run GPU (L4) GKE MIG FastAPI RTX 5090 RTX A6000 RTX 6000 Ada L40S A100 SXM 80GB GitHub Actions Terraform

FinOps Excellence

$331K annual cloud cost savings through compute optimization and storage migration.

💰 Total Savings: $331K/year
                                Storage Migration: $148K/year (95.9% reduction), 360 TiB → 21 TiB

                                Compute Optimization: $183K/year (27% reduction)

                                3-Year Projected Impact: $993,288 (nearly $1M)

💾 Storage Policy Migration

$148K/year saved (95.9% reduction)

Before: 360 TiB Standard storage at $12,874/month
After: 21 TiB total at $533/month

Strategy: archive >5 days · auto-delete after 365 days · keep only last 5 days in Standard

🖥️ GPU Infrastructure

$77,262/year saved

Scaled down 6 GPU instances and moved AI apps from GPU MIG to CPU-based K8s where GPU wasn't actually needed. Rightsized based on real usage data with no performance hit.

⚙️ Compute Engine

$26,500/year saved

Optimized node counts: Dev (11→9), Stg (11→7), Prd (29→24), Sdx (18→15). 17–36% reduction per environment.

✈️ Airflow Optimization

$18,165/year saved (93% reduction)

Reduced from 100 pods to 14 pods while maintaining full functionality. CPU: 20→2.8, Memory: 200GB→28GB.

🤖 GCP AI Recommendations

$6,705/year saved

Implemented all feasible AI-powered cost recommendations across compute, storage, and networking.

📊 Kubecost Monitoring

Deployed Kubecost across all Kubernetes clusters for real-time cost visibility, allocation tracking, and continuous optimization opportunities.

📈 3-Year Financial Impact

Storage Migration$148,096/yr

GPU Infrastructure$77,262/yr

Compute Engine$26,500/yr

Airflow Optimization$18,165/yr

GCP AI + Others$61,073/yr

Total Annual: $331,096

2-Year: $662,192 · 3-Year: $993,288

Advanced Technical Implementations

Networking, security, deployment, and platform migration work at Staff and Principal level.

🔐 mTLS with Cloudflare

Set up Mutual TLS with Cloudflare so both client and server verify identity using certificates. Every service-to-service call is encrypted and authenticated; nothing gets through on network position alone.

🌉 AWS-GCP HA VPN

Built HA VPN tunnels between AWS and GCP with BGP failover. If a tunnel drops, routing adjusts automatically in under 30 seconds. All inter-cloud traffic stays off the public internet.

🚀 Canary Deployment

Wrote custom canary automation that rolls traffic progressively (10%→25%→50%→100%), checks health at each step, and rolls back automatically if error rate or latency crosses the threshold.

🔒 Database Security (61 DBs)

Moved all 61 databases (18 MongoDB Atlas + 44 PostgreSQL) off public IPs, enforced SSL/TLS on every connection, and switched from password-only auth to certificate-based. Part of the ISO 27001 push.

⚡ Redis 7.2 Security

Deployed Redis 7.2 on GCP with SSL certificate auth and encrypted connections. Traffic in transit is encrypted; passwords alone aren't enough to connect.

🛡️ Zero-Trust Architecture

mTLS, private networking, and certificate-based auth across the stack. Services verify each other rather than relying on being inside the network perimeter.

☁️ Cloud Functions Gen1 → Gen2 Migration

Migrated 60+ Cloud Functions across HTTP and Pub/Sub triggers from Gen1 to Gen2. Delivered in 4 phases: DEV HTTP → DEV Pub/Sub → STG HTTP → STG Pub/Sub, with production pilot of 4 non-critical functions followed by 6-sprint plan for 16 critical high-traffic functions (10k+ invocations each). Standardized Jenkins pipelines and runtime updates. Deprecated Go 1.16 runtime replaced urgently.

🖥️ Automated GPU VM Provisioning (GCP)

Built fully automated pipeline for provisioning GPU-enabled GCP Compute Engine VMs and deploying vLLM. Terraform creates the VM; 3 modular shell scripts handle CUDA install, model download, and vLLM startup. GitHub Actions orchestrates the full lifecycle: preflight checks skip re-provisioning if VM already exists; smoke tests validate /health and a live inference call post-deploy; SCP/SSH fixed for OpenSSH 9+ compatibility. HuggingFace token secured via GCP Secret Manager with no hardcoded credentials.

🔧 Terraform Modular GCP Network

Refactored GCP network Terraform from monolithic inline resources to reusable modules: VPC, subnets, Cloud NAT, and firewall rules extracted into modules/. Added prevent_destroy = true lifecycle blocks on all critical network resources to guard against accidental destruction. Validated Shared VPC and Cloud NAT egress across all service projects.

📦 vLLM RunPod Serverless — Multi-GPU + Multi-DC

Extended RunPod Serverless deployment from single-GPU to a profile-driven system supporting 6 GPU types (RTX 4090 · 5090 · A6000 · 6000 Ada · L40S · A100 SXM). Implemented multi-datacenter EU failover (EU-RO-1 → EU-CZ-1 → EU-NL-1) with per-DC network volumes. Added exponential backoff with SUPPLY_CONSTRAINT retry so GPU unavailability is handled gracefully instead of failing the deploy pipeline.

Major Achievements

$331K in quantifiable annual savings across FinOps, security, platform reliability, and team enablement.

💾 Storage Migration Champion

$148K/year saved (95.9% reduction). Moved 360 TiB from Standard storage to lifecycle-managed tiers. Files older than 5 days archive automatically; anything past 365 days is deleted. Cost dropped from $12,874/mo to $533/mo.

💰 FinOps Excellence

$331K total annual savings across compute and storage, documented line-item. Projects to nearly $1M over 3 years.

🔒 ISO 27001 Compliance

Drove infrastructure for ISO 27001 certification: secured 61 databases (private IPs, SSL/TLS, cert-based auth), added automated scanning (Trivy, Checkov), and implemented Pod Security Standards. Documented 863 assets across 17 sheets for the audit, achieving certification on the first attempt with zero non-conformities.

🌐 Kong Gateway at Scale

Replaced per-app nginx ingress with Kong Gateway across 160+ applications. Rate limiting, auth, CORS, and monitoring now configured in one place rather than scattered across service configs.

🚀 65+ CI/CD Pipelines

Built and maintain 65+ Jenkins pipelines with a 98%+ first-deploy success rate. Standardized builds via Shared Libraries for developer self-service, including Trivy scanning, Helm linting, and canary rollouts. Managed Jenkins on GKE with 8 ephemeral agents and autoscaling based on queue depth.

☸️ Kubernetes Multi-Cluster

160+ applications running across 4 GKE environments (dev/stg/prd/sdx), holding 99.9% uptime SLO.

📦 Infrastructure Inventory

Full infrastructure inventory across 17 tracking sheets, 863 assets documented with enough detail to actually be useful during incidents.

🤖 AI Early Adopter

Rolled out Claude AI (Dec 2024) and GitHub Copilot (Aug 2024) to the team, then runs automated PR reviews in CI with Claude Code in Bitbucket pipelines. Most of this happened before it was common practice at peer companies.

🌍 Multi-Region & Disaster Recovery

Designed and operated multi-region resilience across GCP and AWS: HA VPN with BGP failover (<30s), multi-DC EU failover for GPU workloads (RO → CZ → NL), cross-region Cloud SQL replicas, and GCS dual-region buckets with lifecycle policies. Recovery playbooks tested quarterly.

🛡️ ScoutSuite CSPM

Deployed ScoutSuite as the cloud security posture management tool across GCP and AWS. Automated weekly scans via Jenkins, reporting findings to Slack with severity-based triage. Reduced critical misconfigurations from 47 to 0 in the first quarter, covering IAM, networking, storage, and logging controls.

🧠 Local AI Inference Stack

Evaluated and benchmarked local LLM inference tools — Ollama, llama.cpp, and LM Studio — for developer productivity and air-gapped environments. Standardized on Ollama for team use with curated model profiles (Codestral, Gemma 3, DeepSeek Coder) and documented GPU memory requirements per model size.

Technical Skills & Expertise

55+ skills with proficiency levels, from expert to proficient across cloud, containers, security, automation, and programming languages.

⭐ Proficiency Legend

●●●●●Expert (5+ years, production at scale)

●●●●○Advanced (3–5 years, deep knowledge)

●●●○○Proficient (1–3 years, working knowledge)

☁️ Cloud Platforms

Google Cloud Platform
Compute Engine, GKE, Cloud Storage, VPC, VPN, IAM, Secret Manager

●●●●●

Amazon Web Services
EC2, EKS, VPC, S3, CloudFormation, Site-to-Site VPN, IAM, CloudWatch

●●●●●

Microsoft Azure
Web Apps, AKS, VNet, ExpressRoute, API Management, Storage

●●●●●

☸️ Containers & Orchestration

Kubernetes
GKE, EKS, multi-cluster, HPA/VPA/MPA, Network Policies

●●●●●

Docker
Multi-stage builds, image optimization, security scanning, registry

●●●●●

Helm & GitOps
Chart development, templating, ArgoCD concepts

●●●●○

🚀 CI/CD & Automation

Jenkins
65+ pipelines, shared libraries, multi-branch, Jenkinsfile, slave mgmt

●●●●●

GitHub Actions
Workflow automation, CI/CD pipelines, security scanning

●●●●○

Infrastructure as Code
Terraform, CloudFormation, configuration management

●●●●○

📊 Observability & Monitoring

Prometheus
PromQL expert, custom metrics, recording rules, alert rules, federation

●●●●●

Grafana
Dashboard creation, variables, templating, alerting, data sources

●●●●●

CloudWatch & GCP Monitoring
Metrics, logs, dashboards, alarms, log analytics

●●●●○

OpsGenie & Alerting
Alert routing, on-call management, incident response

●●●●○

🌐 Networking & Security

Advanced Networking
VPC, VPN (HA), BGP, private networking, VPC peering, service mesh

●●●●●

TLS/SSL & Certificates
mTLS, certificate management, PKI, Let's Encrypt, cert rotation

●●●●●

API Gateway
Kong, Nginx Ingress, rate limiting, auth, CORS

●●●●●

Security Tools
OWASP ZAP, Trivy, SonarQube, penetration testing

●●●●○

🗄️ Databases & Data Stores

MongoDB & MongoDB Atlas
18 clusters managed, replication, sharding, performance tuning, security

●●●●●

PostgreSQL
44 instances managed, performance tuning, replication, backup

●●●●○

Redis
7.2 with SSL, cluster mode, sentinel, GCP Memorystore

●●●●○

SQL Server
Database administration, query optimization

●●●○○

💻 Programming & Scripting

Bash / Shell Scripting
Automation scripts, system administration, pipeline scripting

●●●●●

Python
Automation, data processing, API development, DevOps tools

●●●●○

Go (Golang)
Microservices, CLI tools, Kubernetes operators

●●●○○

YAML / JSON
Configuration management, K8s manifests, CI/CD configs

●●●●●

Java
Spring Boot, Maven, Web Services (previous dev roles)

●●●○○

TypeScript / JavaScript
Next.js, Node.js, side projects in production

●●●○○

Kotlin
Android development, MVVM, Coroutines, Room ORM

●●●○○

Swift / SwiftUI
iOS development, SwiftData, native frameworks (zero dependencies)

●●●○○

💰 FinOps & Cost Management

Cloud Cost Optimization
$331K savings documented, rightsizing, lifecycle policies, spot instances

●●●●●

Kubecost
Multi-cluster deployment, cost allocation, chargeback, recommendations

●●●●○

FinOps Best Practices
Cost analysis, budgeting, forecasting, showback/chargeback

●●●●●

🎯 Additional Technical Proficiencies

📦 Container Registry

GCP Artifact Registry
Docker Hub
Image lifecycle management
Registry cleanup automation

🔄 Workflow Automation

Apache Airflow
Cron jobs
Event-driven architectures
Pub/Sub messaging

📈 BI & Analytics

Metabase
Data visualization
Infrastructure metrics
Cost dashboards

🌐 DNS & Domain

Cloudflare (46 domains)
817 DNS records managed
SSL certificate automation
DMARC, SPF, DKIM

🔐 Security & Compliance

ISO 27001 compliance
Secret management (Vault, GCP)
IAM & RBAC
Audit logging

⚙️ Operating Systems

Linux (Ubuntu, Debian, Fedora)
Container OS (optimized)
System administration
Kernel tuning

Personal Projects

Side projects built outside of work. Full production systems with real users, real infrastructure, and real constraints.

🎮 Pokémon GO Friend Code — Subscription SaaS

2025–2026 · Full-Stack · Next.js · GCP · Multi-payment · GitHub

Next.js 16 TypeScript Cloud Run Cloud SQL Stripe Mercado Pago

What it does

A subscription service for Pokémon GO players to have their trainer codes automatically submitted to community listing sites daily. Users pick a plan, pay, and their code gets submitted every day without any manual action.

How it's built

Full-stack Next.js 16 App Router with Server Components, Server Actions, and Docker multi-stage build deployed to Cloud Run (scale to zero)
Dual payment gateway: Mercado Pago for PT/BRL subscribers, Stripe for EN+ES/USD; same codebase, locale-driven routing via middleware
i18n with /pt, /en, /es subpaths; Accept-Language middleware auto-redirects on first visit
Webhook state machine with idempotent processing and submission_logs table tracking every payment event
Zero static credentials: GCP Secret Manager at runtime, Workload Identity Federation for GitHub → GCP auth in CI
Admin panel (server-side HMAC-SHA256 sessions) for subscription management and webhook inspection
Email confirmations via Resend with React Email templates · Database migrations via Drizzle ORM

Technical highlights

Cloud Run southamerica-east1: auto-scales, HTTPS via custom domain (pokemongofriendscode.com)
Workload Identity Federation: no long-lived service account keys in CI; GitHub Actions exchanges OIDC tokens with GCP
Tested with Jest (unit/integration) and Playwright (E2E)

Next.js 16 React 19 TypeScript Tailwind CSS 4 Drizzle ORM Cloud SQL (PostgreSQL) Cloud Run GCP Secret Manager Stripe Mercado Pago Resend Workload Identity Federation GitHub Actions

🤖 Pokémon GO GCP — Automated Code Submission Engine

2025–2026 · Backend Automation · Cloud Run Jobs · Web Scraping · GitHub

Node.js Puppeteer Playwright Cloud Run Jobs Cloud Scheduler Terraform

What it does

The automation backend for the subscription service. A scheduled Cloud Run Job that reads active trainer codes from Cloud SQL and submits them to two community sites daily at 15:00 BRT, fully headless with no manual intervention.

How it's built

Two independent scrapers: Puppeteer + stealth plugin for pokemongofriendcodes.com, Playwright for pogocodes.com (React-based form)
React controlled component workaround on pogocodes.com; dispatches input events with bubbles: true to trigger synthetic event handlers
Verification step after each submission: navigates to the listings page to confirm the code actually appears
Batch processing (3 codes simultaneously) with configurable delays per action type to avoid rate limiting
Database-driven: codes sourced from Cloud SQL trainer_codes table (single source of truth, synced from payment webhooks)
Complete Terraform IaC for Cloud SQL, Cloud Run Jobs, Cloud Scheduler, Secret Manager, and IAM

Technical highlights

Cost: ~$0/month: within GCP free tier for scheduled job execution
Rate limit detection: pokemongofriendcodes.com blocks re-submissions within 24h; script detects and stops gracefully
Cloud Monitoring alerts on exit code ≠ 0; failed job execution pages the owner

Node.js Puppeteer Playwright Cloud Run Jobs Cloud Scheduler Cloud SQL GCP Secret Manager Terraform Workload Identity Federation GitHub Actions

🎁 AmigoSecreto — Android App (Google Play)

2024–2026 · Android · Kotlin · MVVM · Google Play · GitHub

Kotlin Android SDK 35 MVVM Hilt Room Google Play

What it does

A fully-featured Android app for organizing Secret Santa draws. Supports multiple groups, exclusion rules, wish lists, secure reveal, and sharing results via WhatsApp, Telegram, SMS, or Email. Available on Google Play.

How it's built

Fully migrated to Kotlin (from Java) with MVVM architecture: AndroidViewModel + LiveData + Repository pattern
Hilt for dependency injection · Coroutines for async database operations on IO thread
Room ORM with dual-layer database (Room + legacy DAOs coexist; eager Room initialization ensures migrations complete first)
Backtracking draw algorithm extracted to pure function (SorteioEngine.kt): testable, no side effects, handles impossible constraint scenarios gracefully
Batch queries with INNER JOIN to eliminate N+1 problem on wish list counts · All DB writes in atomic transactions
Edge-to-Edge layout (Android 15), PDF export (PDFKit), QR code generation, local notifications, backup/restore as JSON
GitHub Actions CI/CD: push to master → internal Play track; tag v3.x → production track

Technical highlights

297 unit tests + Espresso integration tests · UnconfinedTestDispatcher for deterministic coroutine testing
Min SDK 21 (Android 5.0+) · Target SDK 35 (Android 15) · R8 minification in release builds
Spoiler protection on WhatsApp sharing: 30 blank lines before the assignment to prevent previews from revealing the result

Kotlin Android SDK 35 MVVM Hilt Room ORM Coroutines Material Design 3 LiveData JUnit 4 Espresso GitHub Actions Google Play

🍎 Secret Santa — iOS App (SwiftUI)

2025–2026 · iOS 17+ · Swift · SwiftUI · SwiftData · GitHub

Swift SwiftUI SwiftData iOS 17+ Zero dependencies

What it does

The iOS counterpart to AmigoSecreto, same product concept rebuilt natively in Swift and SwiftUI for iPhone. Multiple groups, exclusion rules, wish lists, secure reveal, QR codes, PDF export, and local notifications. Zero external dependencies.

How it's built

SwiftUI + SwiftData only, no third-party packages; persistence, UI, PDF, QR codes, and notifications all via native Apple frameworks
MVVM via SwiftData's @Query and @Environment(\.modelContext) with no separate ViewModel classes needed
JSON serialization workaround for many-to-many relationships (SwiftData limitation): exclusions and draw pairs stored as Data fields with computed property decoders
Backtracking draw algorithm with 200-attempt limit; same logic as Android, adapted for Swift
SequentialShareView: workaround for iOS limitation that prevents opening multiple URL schemes simultaneously; shows sequential buttons for WhatsApp, SMS, Email
6-page onboarding, draw history, statistics, backup/restore with JSON forward compatibility

Technical highlights

30+ unit tests using Swift Testing framework (@Suite, @Test)
iOS 17+ minimum: uses ContentUnavailableView, @Bindable, and other modern APIs throughout
PDF generation via UIGraphicsPDFRenderer · QR codes via Core Image CIQRCodeGenerator · haptic feedback on reveal

Swift 5.9 SwiftUI SwiftData iOS 17+ PDFKit Core Image UNUserNotificationCenter Swift Testing Xcode 15

Platform & ReliabilityEngineering Expert

Executive Summary

🧭 Quick Navigation

🎯 Technical Leadership Profile

💰 FinOps

🤖 AI/ML Ops

🚀 GitOps at Scale

🏗️ Infrastructure as Code

🔐 Security

📊 Scale

👥 Leadership & Mentorship

📈 Career Trajectory

🎯 Task Distribution Analysis

Platform & Reliability
Engineering Expert

FinOps

AI/ML Ops

GitOps at Scale

Infrastructure as Code

Security

Scale