Executive Summary
SRE and Platform Engineer with a software development background — infrastructure at scale, cloud cost optimization, and production AI systems.
🧭 Quick Navigation
🎯 Technical Leadership Profile
💰 FinOps
$331K/year documented cloud cost savings through compute and storage optimization
🤖 AI/ML Ops
Production ML systems: face recognition, content moderation, LLM serving. vLLM Gemma 3 12B on RunPod + GCP
🚀 GitOps at Scale
ArgoCD 9-phase rollout across 4 GKE clusters with Argo Rollouts for canary delivery
🏗️ Infrastructure as Code
Terraform + Shared VPC with reusable modules. 60+ Cloud Functions migrated Gen1 → Gen2
🔐 Security
mTLS with Cloudflare, AWS-GCP HA VPN, 61 databases hardened, xmrig supply-chain incident response
📊 Scale
160+ applications, 863 infrastructure assets, 99.9% uptime SLO
👥 Leadership & Mentorship
- Mentored junior and mid-level engineers on incident response, Terraform best practices, and production-readiness reviews
- Led cross-team initiatives: ArgoCD multi-cluster rollout, Shared VPC migration, and Cloud Functions Gen2 migration across 4 squads
- Established on-call runbooks and post-incident review culture, reducing MTTR and improving team autonomy
- Drove FinOps adoption across engineering teams, training product squads on cost-aware architecture decisions
📈 Career Trajectory
🎯 Task Distribution Analysis
| Type | Count | % | Insight |
|---|---|---|---|
| Tasks | 1,241 | 79.4% | Planned work, strategic execution |
| Sub-tasks | 236 | 15.1% | Project decomposition, planning |
| Bugs | 61 | 3.9% | Reactive work (very low) |
| Epics | 17 | 1.1% | Large project leadership |
Case Studies & Projects
13 real-world problems solved with measurable impact, documented with before/after metrics and technical details.
💾 GCP Storage Migration — 95.9% Cost Reduction
January 2025 · Cloud Cost Optimization · Lifecycle Management
❓ Problem
Cloud storage costs growing exponentially. Spending $12,874/month on 360 TiB, all in expensive Standard storage. No lifecycle management, no archival strategy, no deletion policies.
💡 Solution
- Analyzed data access patterns to identify hot vs cold data
- Configured automatic archival for files older than 5 days
- Set auto-deletion policy for files older than 365 days
- Kept only last 5 days in Standard storage (hot data)
- Applied policies to both multi-region and single-region buckets
Business Impact
- 94.2% storage reduction (360 TiB → 21 TiB)
- $444K saved over 3 years with zero data loss
- Fully automated lifecycle: archival and deletion run on their own once the policies are in place
- Maintained performance: hot data still in Standard class
✈️ Airflow Infrastructure Optimization — 93% Reduction
January 2025 · Resource Optimization · Kubernetes
❓ Problem
Airflow deployment massively over-provisioned: 100 pods (50 dev + 50 prod), 20 CPUs, 200 GB RAM across 7 nodes costing $1,513/month. Most pods idle 90% of the time.
💡 Solution
- Analyzed actual workload patterns and resource utilization
- Added autoscaling (4–10 replicas per environment) based on actual queue depth
- Rightsized CPU and memory based on real usage data
- Consolidated from 7 nodes to 1 node with better bin-packing
Business Impact
- 86% reduction in pods, CPU, and memory allocation
- Zero performance degradation; all DAGs running normally
- Freed up 6 nodes for other workloads
🔐 mTLS with Cloudflare — Zero-Trust Architecture
2024–2025 · Advanced Security · Network Architecture
❓ Problem
Standard TLS only authenticates server to client. Required mutual authentication where both client and server verify identity using certificates, critical for API security and ISO 27001 compliance.
💡 Solution
- Set up Cloudflare for Mutual TLS; both sides now present certificates, not just the server
- Generated and distributed client certificates to each authorized service
- Added certificate revocation and rotation so compromised certs can be killed immediately
- Every request is verified by certificate, with no implicit trust based on network position
- Automated the certificate lifecycle so rotation doesn't require manual steps
🔐 mTLS Flow
Client (with cert) → Cloudflare (validates client cert) → Origin Server (validates CF cert)
Mutual verification at every layer · End-to-end encrypted + authenticated
Business Impact
- Every request authenticated by certificate; nothing gets through on network position alone
- ISO 27001 compliance requirement satisfied
- Protection against MITM attacks and API abuse
- Instant revocation if certificate is compromised
🌉 AWS-GCP HA VPN — Multi-Cloud Connectivity
2024 · Multi-Cloud Networking · High Availability
❓ Problem
Needed secure, reliable communication between AWS and GCP for hybrid workloads. Public internet routing unacceptable for sensitive data. Required redundancy for high availability.
💡 Solution
- Built HA VPN between AWS and GCP with redundant tunnels on both sides
- Configured BGP so failover is automatic, with no manual intervention when a tunnel drops
- Assigned private IPs and set up internal routing tables so traffic never touches the public internet
- All inter-cloud traffic encrypted end-to-end over IPSec
- Added monitoring and alerts for tunnel health so issues are caught before they affect workloads
Business Impact
- 99.99% availability through redundant tunnels
- Automatic failover via BGP (<30 seconds)
- Zero public internet exposure for sensitive data
- Workloads now span both clouds without any public internet exposure
🚀 Custom Canary Deployment System
2024–2025 · Progressive Delivery · Automation
❓ Problem
Traditional blue-green deployments require a 100% traffic switch, which is risky. Off-the-shelf tools didn't fit our multi-environment Kubernetes setup. Needed gradual rollout with automatic metric-based rollback.
💡 Solution
- Wrote custom canary deployment automation from scratch; off-the-shelf tools didn't fit the multi-environment setup
- Progressive traffic split: 10% → 25% → 50% → 100%, with a gate at each stage
- Health checks run at each stage before traffic advances
- Automatic rollback kicks in when error rate or latency crosses defined thresholds
- Wired into Jenkins pipelines, works the same way for every team
Business Impact
- No failed deployments since rollout: issues surface at 10% traffic and trigger automatic rollback before reaching everyone
- Deployment frequency increased: teams ship more often because the cost of a bad release dropped dramatically
- No third-party tooling dependency: built from scratch to fit the existing setup
🔒 Database Security Hardening — 61 Databases
2023–2025 · Security Architecture · Compliance
❓ Problem
Databases exposed with public IPs, unencrypted connections, password-only authentication. Non-compliant with ISO 27001. Vulnerable to network sniffing and unauthorized access.
Business Impact
- 61 databases secured (18 MongoDB Atlas + 44 PostgreSQL)
- Zero security incidents post-implementation
- ISO 27001 certification achieved
🌐 Kong Gateway — API Management at Scale
2023–2024 · API Management · Architecture
❓ Problem
160+ applications with direct nginx ingress and no centralized API management, rate limiting, or authentication layer. Difficult to enforce policies or monitor API usage consistently.
Business Impact
- 160+ applications now go through a single gateway, giving one place to enforce policy
- Rate limiting configured centrally; no per-service implementation needed
- Auth, CORS, and headers applied globally; developers don't handle this per app
- Centralized logging and tracing cuts troubleshooting time significantly
🔄 ArgoCD GitOps — Multi-Cluster Continuous Delivery
March 2026 · GitOps · Kubernetes Multi-Cluster
❓ Problem
Deployments across four GKE clusters (dev · stg · prd · sdx) were manual, inconsistent, and error-prone. No single source of truth for cluster state, no audit trail, and high risk of configuration drift between environments. Teams needed kubectl access to deploy with no self-service model in place.
💡 Solution — 9-Phase Implementation
- Phase 0: Installed ArgoCD on GKE PRD cluster via Helm with production-grade values
- Phase 1: Defined GitOps repository structure: apps/environments/charts hierarchy with Helm chart templates and per-env value overrides
- Phase 2: Configured ArgoCD Applications and AppProjects with RBAC boundaries per team and namespace
- Phase 3: Multi-cluster registration; all four GKE clusters (dev · stg · prd · sdx) as ArgoCD targets with cluster credentials stored in Secrets
- Phase 4: CI/CD pipeline integration via GitHub Actions image updater for automatic image tag propagation, branch promotion dev → stg → prd
- Phase 5: SSO/OIDC with Google Workspace + RBAC policies per team
- Phase 6: Prometheus metrics + Grafana dashboards for GitOps observability
- Phase 7: Incremental migration of existing workloads; audited all namespaces, converted to Helm charts, enabled self-heal and prune on non-prod
- Phase 8: Progressive delivery with Argo Rollouts, canary deployments with traffic splitting and automated rollback
Business Impact
- 4 GKE clusters managed as a single GitOps fleet (dev · stg · prd · sdx)
- Zero configuration drift: Git is the single source of truth; ArgoCD self-heals deviations automatically
- Full audit trail: every deployment change tracked as a signed Git commit
- Instant rollback: revert a Git commit and the cluster reconciles within seconds
- Developer self-service: engineers deploy via PR, zero kubectl access required
- Automated promotion: GitHub Actions image updater pushes new tags across environments without manual YAML edits
🏗️ GKE Infrastructure — Production-Grade Terraform + Shared VPC
February 2026 · Infrastructure as Code · GKE · Terraform
❓ Problem
GKE clusters and the underlying network infrastructure were provisioned manually with no IaC. No consistent multi-environment strategy, no prevent-destroy protection on critical network resources, and Terraform configuration was monolithic and impossible to reuse across environments.
💡 Solution — 8-Phase Delivery
- Phase 1 — Architecture Design: Defined Shared VPC topology, subnet strategy, and GKE cluster requirements for 4 environments
- Phase 2 — GCP Project Setup: Created projects, enabled APIs, and configured IAM service accounts with least-privilege bindings
- Phase 3 — Network Infrastructure: Deployed Shared VPC host project, subnets, Cloud NAT, and firewall rules via Terraform
- Phase 4 — Modular Terraform: Refactored from monolithic inline resources to reusable modules; converted root config to module calls with environment-specific variable files
- Phase 5 — CI/CD Automation: GitHub Actions workflows for terraform plan on PR and terraform apply on merge; separate workflows per environment
- Phase 6 — GKE Cluster Deployment: Deployed all four GKE clusters (dev · stg · prd · sdx) using the automated pipelines
- Phase 7 — Safety Guardrails: Added
prevent_destroy = truelifecycle blocks on VPC, subnets, and Cloud NAT to protect against accidental terraform destroy - Phase 8 — Documentation: Architecture diagrams, operational runbooks, and module documentation for knowledge transfer
Business Impact
- 100% infrastructure as code: every network and cluster resource managed via Terraform
- 4 GKE clusters deployed in a single automated pipeline run
- Zero manual provisioning: new environments created from variable files only
- Production safeguards:
prevent_destroyprevents accidental loss of critical network infrastructure - Validated Shared VPC + Cloud NAT: all service projects route egress through a centralized NAT gateway
🌐 vLLM Inference Fallback Router — Multi-Provider with Measured SLAs
March 2026 · AI Infrastructure · Reliability Engineering
❓ Problem
Production LLM serving relied on a single GPU provider (RunPod). A provider outage or GPU supply constraint would bring down the entire inference capability with no fallback. Each provider had different latency profiles and no routing logic existed to exploit this.
💡 Solution
- Built a ~150-line FastAPI reverse proxy that accepts OpenAI-compatible requests (
/v1/chat/completions,/v1/models) - Routes through a provider chain: RunPod (primary) → Cloud Run GPU (secondary) → GKE MIG (tertiary)
- Each provider has a configurable mock-failure flag for testing without real outages
- Returns HTTP 503 with per-provider error diagnostics when all providers fail
- Containerized (linux/amd64) and deployed to GKE PRD cluster
- All 4 failure scenarios validated end-to-end in production
Business Impact
- Eliminated single point of failure for production LLM inference
- 3x redundancy: service survives any single provider failure transparently
- Measurable SLAs per provider: RunPod 568ms · Cloud Run warm 1.9s · MIG 7.7s
- Fully validated: all 4 failure scenarios tested in GKE PRD production environment
🛡️ Cryptomining Incident Response — xmrig on GKE PRD
2026 · Security Incident · GKE Production
❓ Problem
xmrig (Monero cryptominer) detected running on a GKE production node. The miner was injected via a compromised frontend npm dependency, a supply-chain attack that bypassed standard code review. The workload was silently consuming node CPU, masking legitimate service degradation with no immediate visible symptoms.
💡 Detection & Response
- Detection: Anomalous CPU spike on GKE production node triggered infrastructure alert
- Triage: Identified xmrig process running inside a frontend container via node-level process inspection
- Root cause: Compromised npm package in frontend dependency tree injected miner code into the built bundle
- Containment: Cordoned the affected node; drained and restarted all workloads on clean nodes within minutes
- Remediation: Updated all vulnerable frontend libraries; pinned transitive dependency versions
- Hardening: Added
npm audit --audit-level=highas a CI gate; implemented Trivy image scanning on every pre-deploy build
Business Impact
- Zero data exfiltration: miner targeted CPU only, no credentials or data accessed
- Full remediation in under 4 hours from first alert to clean production deployment
- Supply-chain hardening: npm audit + Trivy scanning now gates every production build
- Post-mortem and team training: dependency pinning, supply-chain threat model published internally
📝 Lessons & Permanent Changes
- Lock files enforced: all repos now require committed
package-lock.json; CI fails if lockfile is missing or modified without review - Runtime isolation: implemented Pod Security Standards (restricted) across all GKE namespaces to prevent privilege escalation
- Network policies: egress rules block outbound connections to mining pools and unknown endpoints by default
- Alerting gap closed: added GKE node-level CPU anomaly detection via Cloud Monitoring with 5-minute threshold alerts
🚨 Production Incident Response — P1 Outage & Partner Auth Failure
March 2026 · Incident Response · Production Reliability
Incident A — Third-Party CMS Loader [P1]: HTTP 500 across two regional markets
- Impact: 100% of page loads returning HTTP 500 from 13:08 UTC; full production outage across two regional markets
- Root cause: Third-party CMS API dependency had an outage; all pages failed server-side rendering
- Response: Identified dependency failure, implemented emergency static fallback, and escalated to vendor support
- Resolution: Service restored; added circuit-breaker pattern and health monitoring for external CMS dependencies
Incident B — Partner API [P2]: Credential Rotation Causing Auth Failure
- Impact: All partner authentication calls returning 401 from ~14:30 UTC after partner-side password rotation
- Root cause: Partner rotated credentials without coordinated notification; old password revoked mid-day
- Response: Diagnosed authentication failure from logs, obtained new credentials, updated GCP Secret Manager, triggered rolling restart
- Hardening: Established credential rotation runbook and pre-agreed notification SLA with partner
SRE Principles Applied
- MTTR minimized: both incidents resolved within the same business day
- Blameless post-mortems: root causes documented, systemic fixes implemented
- Monitoring gaps closed: external dependency health checks added after each incident
🔍 Elastic Stack Migration — EC2 to Kubernetes at Scale
2020–2021 · Observability · Infrastructure Modernization
🔴 Problem
- 2 legacy Elastic Stack clusters: Elasticsearch 5 on EC2 and Elasticsearch 6 on ECS, both via docker-compose
- No role separation: every node handled master election, data storage, and ingest simultaneously
- Manual scaling, no self-healing, single points of failure
- Storage bottleneck: no S3 snapshots, no automated retention policies
💡 Solution
- Consolidated both clusters into 1 unified EKS cluster on Elasticsearch 7.8
- Designed multi-node architecture with dedicated roles: 3 master (8 GB), 3 data (16 GB, 50 TB), 3 ingest (16 GB)
- Custom ECR image with S3 snapshot plugin for automated backup and retention
- Deployed via Helm Charts with rolling updates, PDB (maxUnavailable: 1), and soft anti-affinity
- Built 9 specialized Logstash pipelines for different log sources (GELF, K8s, NiFi, PHP, portal, services, reports)
- Filebeat DaemonSet with Kubernetes metadata enrichment for container log collection
- Fluent Bit as lightweight alternative collector with kernel-level filtering
- Kibana 3 replicas with CSV reporting (1 GB max), APM Server for tracing, Metricbeat for cluster health
Business Impact
- 54.5% cost reduction: old cluster (65 EC2 + 183 disks + ~91 TB) cost $17,038/month; new EKS cluster (7 m5.2xlarge + 7x 7.1 TB SSD) cost $7,750/month, saving ~$111K/year
- Consolidated 2 → 1 cluster: eliminated 65 EC2 instances and 183 EBS volumes, replacing with 7 EKS nodes
- Version upgrade v5/v6 → v7.8: major version leap, unlocking ILM, security features, and index lifecycle management
- EBS GP2 → Standard migration: migrated 124 volumes across 2 AWS regions (São Paulo + Virginia), reducing storage costs
- Multi-region cost analysis: data-driven decision to consolidate in Virginia saved $10.7K/month vs São Paulo region pricing
- S3 cleanup: deleted 141 TB unused S3 bucket, automated snapshot retention via custom ECR image with repository-s3 plugin
- 9 dedicated Logstash pipelines: each pipeline tuned for its specific source, enabling independent scaling and troubleshooting
- Self-healing infrastructure: Kubernetes StatefulSets with PDB, rolling updates, and automated recovery replaced manual EC2 management
📡 Kafka Event Streaming — Real-Time Pipeline to BI
2020–2021 · Data Engineering · Event-Driven Architecture
🔴 Problem
- Business intelligence team had no real-time access to API and service event data
- Log data was siloed in Elasticsearch, not accessible for downstream analytics
- Multiple event sources (GELF, RabbitMQ, HTTP) needed unified routing to Kafka topics
- No standardized pipeline for streaming filtered events to analytics systems
💡 Solution
- Designed dual-output Logstash pipelines: events indexed in Elasticsearch AND forwarded to Kafka topics
- Integrated with AWS MSK (Managed Streaming for Apache Kafka) with gzip compression
- Built conditional routing: API events filtered by service name and forwarded to dedicated Kafka topics for BI consumption
- RabbitMQ input plugin for message queue events alongside GELF and pipeline-to-pipeline routing
- Kafka producer configured with client_id tracking and topic-per-domain strategy
Business Impact
- Real-time BI pipeline: business intelligence team gained access to live API events for the first time, enabling real-time analytics
- Dual-write architecture: every event indexed in Elasticsearch for operational search AND streamed to Kafka for downstream analytics, no data duplication at source
- Conditional routing: only relevant events forwarded to Kafka, reducing consumer noise and topic volume
- AWS MSK integration: managed Kafka with gzip compression reduced network costs and simplified operations
- Multi-source unification: GELF, RabbitMQ, HTTP, and pipeline-to-pipeline inputs consolidated into a single processing layer
🧠 AI Tools Integration — Team Productivity
2024–2025 · AI Adoption · Developer Experience
💡 Solution
- Dec 2024: Enabled Claude AI for entire team
- Aug 2024: Onboarded QA and dev teams to GitHub Copilot
- Aug 2025: Evaluated Claude Code for automated PR reviews in Bitbucket pipelines
- Created best practices guides for AI-assisted development
Business Impact
- Adopted AI tools early: Claude AI in Dec 2024, Copilot in Aug 2024, before most teams in the org
- Automated PR reviews via Claude Code; reviewers spend time on logic, not style
- Junior developers ramp up faster: AI-assisted development reduces the gap between seniors and juniors on routine tasks
Professional Experience
13+ years across 9 roles, from IT Support to Platform Engineering leadership.
London, UK (Remote). Platform engineering and FinOps at a child safety tech company (safeguarding children and society online).
Led $331K annual cloud cost optimization ($148K storage + $183K compute) · Manage 160+ applications across 4 Kubernetes environments (versions 1.22 → 1.33) · mTLS with Cloudflare and AWS-GCP HA VPN · Secured 61 databases (SSL/TLS, private IPs, cert auth) · Secret rotation policies and audit logging across all environments · Built 65+ Jenkins CI/CD pipelines with 98%+ first-deploy success · Custom canary deployment automation with metric-based rollback · Led ISO 27001 compliance (certification achieved) · OWASP ZAP + SonarQube automated security scanning pipeline · Kong Gateway for 160+ APIs · End-to-end observability framework using OpsGenie, Statuspage, Cloud Monitoring, and Slack · Production LLM serving with vLLM on RunPod Serverless and GCP (Cloud Run GPU + GKE MIG) with multi-provider failover
Stack: GCP (expert), AWS, Kubernetes, Jenkins, Kong, Prometheus, Grafana, vLLM, RunPod, OpsGenie, Statuspage, MongoDB Atlas, PostgreSQL, Redis, Terraform, GitHub Actions
São Paulo, Brazil. SRE at a logistics technology company focused on reliability, observability, and infrastructure modernization at scale.
Migrated Elasticsearch 5 (EC2) and Elasticsearch 6 (ECS) clusters to a unified EKS cluster on v7.8 with dedicated node roles (3 master, 3 data, 3 ingest) and custom ECR images with S3 snapshot plugin. Reduced Elasticsearch infrastructure cost by 54.5% ($17K/month → $7.75K/month, ~$111K/year saved) by consolidating 65 EC2 instances and 183 EBS volumes into 7 EKS nodes. Migrated 124 EBS volumes from GP2 to Standard across 2 AWS regions. Deleted 141 TB of unused S3 storage. Implemented instance scheduling for non-production environments. Designed and deployed 9 specialized Logstash pipelines handling logs from GELF, Kubernetes, NiFi, RabbitMQ, and application-specific sources. Integrated Kafka (AWS MSK) for real-time event streaming to BI systems. Built log collection layer with Filebeat DaemonSets and Fluent Bit for container log forwarding. Deployed Kibana (3 replicas) with CSV reporting, APM Server for tracing, and Metricbeat for cluster health. Led multi-region cost analysis (Virginia vs São Paulo) driving data-informed infrastructure placement decisions. Planned and provisioned dedicated Black Friday infrastructure. Expanded observability scope and promoted monitoring awareness among developers.
Stack: AWS (Expert), GCP, Kubernetes, Elasticsearch 7.8, Logstash, Kibana, Filebeat, Fluent Bit, APM Server, Metricbeat, Kafka (MSK), RabbitMQ, NiFi, Docker, Helm, NGINX Ingress, PostgreSQL, Python, CloudFormation
São Paulo, Brazil. Critical operations at Brazil's largest integrated healthcare network.
Provisioned and managed multi-cluster Kubernetes infrastructure on Azure using ACS Engine (v0.19/v0.25) and AKS across 5 environments: monitor, dev, hml, prd01, prd02 with node pool scaling via CLI. VNet/Subnet provisioning via Terraform, ACS Engine for ARM template generation, and automated CD pipelines for nginx-ingress deployment, AKS provisioning, and K8s cluster migration. Deployed Elastic Cloud on Kubernetes (ECK) with automated CD pipeline, Ansible playbooks for ELK/Jenkins/Spinnaker provisioning, Beats and monitoring agents on all K8s clusters, and Curator for automated index cleanup on AWS Elasticsearch. Built AWS Data Lake infrastructure using CloudFormation with Lambda functions, EMR clusters, and S3 for data processing pipelines. Developed Python-based Azure monitoring discovery tool to identify unmonitored applications across Azure subscriptions, integrated with Confluence for documentation. Zabbix integration for host management (auth, host provisioning, queue management via Node.js). JMeter performance testing templates for backend Java services. Built custom GitOps framework (gitops.sh) with Terraform for infrastructure-as-code, supporting native and Docker-based installation with automated pipeline templates. Deployed and managed WhatsApp integration instances for multiple hospital units (Delboni, Exame, Lavoisier, and others) across dev and production environments, enabling patient communication at scale. Incident management and crisis response. Designed and improved cloud architecture with focus on performance and reliability. ACR cleanup automation scripts to manage container registry costs. Multi-cloud operations (AWS + Azure).
Stack: Azure (Expert: Web Apps, AKS, ACS Engine, API Gateway, ExpressRoute, ACR), AWS (CloudFormation, EMR, Lambda, S3), Kubernetes, Terraform, Ansible, Elasticsearch, ECK, Jenkins, Spinnaker, Zabbix, Prometheus, JMeter, Bitbucket, Docker, Python, Node.js
Barueri, São Paulo. Built APIs using Axway Cloud Platform (API Gateway + API Management) for Dasa Group clients, covering both backend and frontend REST/SOAP APIs.
Managed API Gateway across dev, hml, and prd environments with environment promotion via Policy Studio and CLI. Authored comprehensive platform documentation covering API registration, organization setup, SMTP configuration, IP configuration, and multi-language portal management. Business logic via Policy Studio. IBM ESB integration for flexible application integration.
Stack: Axway Cloud Platform, API Gateway, API Manager, Policy Studio, IBM ESB, Azure, REST/SOAP
Americana, SP. Web service application development with continuous integration.
REST and SOAP web services · GitLab-based version control · CI with Jenkins
Stack: SOAP, RESTful, Spring MVC, Maven, Git, GitLab, Hibernate, Spring Data, Jenkins
Americana, SP. Web services and e-commerce development in Java.
Stack: Java, Web Services, E-Commerce
Americana, SP. Delphi developer and SQL Server analyst in a Windows-based enterprise software environment. Customer support, bug treatment, and software improvements. Some iOS/Android work for special projects.
Stack: Delphi, SQL Server, iOS, Android
Jundiaí, SP. Customer service and enterprise software support for accounting and financial retail sector. Hands-on with network connections, database administration, and Linux environments.
Stack: Java, C#, Linux (Fedora 14), Network & Database administration
Jundiaí, SP. Supervised delivery officials, monthly closing sheets for delivery operations, daily performance reporting for freight service providers.
🎯 Career Progression
💻 2014–2017: Software development (Microdata, PRÓPONTO, Dasa)
☁️ 2018–2021: SRE/Cloud Engineering (Dasa, Intelipost)
🚀 2021–Present: Platform Engineering leadership with FinOps mastery (VerifyMy)
13+ years total · 7+ years SRE/Platform · Tri-cloud expert (GCP + AWS + Azure) · Full-stack background from dev roles
🎓 Education
📜 Certifications
- Google Cloud Platform Fundamentals: Core Infrastructure
- Logging, Monitoring and Observability in Google Cloud
- Desenvolvimento Ágil com Java Avançado
- Programando em TypeScript
- SQL Fundamentals — Certificate of Completion
🌐 Languages & Publications
- Portuguese — Native
- Spanish — Full Professional
- English — Professional Working Proficiency
Publication
Management Cloud Computing
AI & ML Operations
Production ML infrastructure, modern AI tooling, multi-agent systems, and intelligent automation across cloud environments.
🤖 AI/ML Engineering Excellence
6 GPU profiles validated: RTX 4090 · RTX 5090 · RTX A6000 · RTX 6000 Ada · L40S · A100 SXM 80GB
Cold start engineering: 10 min → 3 min via pre-baked Docker images, model warm-up, and Tier 1 + Tier 2 boot optimizations
AI tooling: Rolled out Claude AI (Dec 2024) and GitHub Copilot (Aug 2024) to the team; runs automated PR reviews in CI with Claude Code
MCP & multi-agent systems: Builds AI workflows using MCP servers, multi-agent orchestration, and vibe coding for infrastructure automation
MLOps expertise: Content moderation (YOLO), intelligent autoscaling (VPA/HPA/KEDA), security automation
🎯 AI/ML Infrastructure by the Numbers
| Category | Tasks | Technologies |
|---|---|---|
| LLM Production Infrastructure | 50+ | vLLM 0.15.1, Gemma 3 12B, RunPod, Cloud Run GPU, GKE MIG |
| Content Moderation | 20 | YOLO object detection, GPU-optimized services |
| Security Automation | 29 | OWASP ZAP, Trivy, SonarQube, npm audit |
| Intelligent Autoscaling | 15 | VPA, HPA, MPA, KEDA |
| Modern AI Tools | 5 | Claude AI, GitHub Copilot, Claude Code |
🧠 Modern AI Tools
Rolled out Claude AI (Dec 2024) and GitHub Copilot (Aug 2024) to the team. Runs automated PR reviews in CI with Claude Code. Builds with MCP servers and multi-agent systems for infrastructure automation. Writes CLAUDE.md standards so AI-assisted development is consistent across all repos.
🎬 Content Moderation AI
Ran infra for 20 AI-powered content screening tasks: YOLO object detection on GPU-optimized services for video analysis. Production MLOps for child safety and age-assurance systems.
⚡ Intelligent Infrastructure
Set up VPA/HPA/MPA/KEDA so scaling decisions are based on actual metrics, not manually tuned thresholds. Also coordinated YOLO batch ReID processing alongside a live vLLM server on the same GPU; scheduling matters when GPU memory is finite.
🖥️ vLLM Production Infrastructure — Full Journey
Built and ran production LLM serving for Gemma 3 12B (GGUF Q4) via vLLM 0.15.1. Started on a single GCP VM and grew it into a multi-provider, multi-GPU setup with automatic failover and profile-driven GPU selection.
🔥 Cold Start Optimization
10 minutes → 3 minutes cold start reduction. Two-tier approach: Tier 1 pre-baked Docker images with model weights baked in (eliminates pip install + download on every start); Tier 2 CUDA graph pre-capture and sampler warm-up tuning. Validated on GKE MIG (L4) and RunPod Serverless.
🌐 Intelligent Fallback Router
Built a ~150-line FastAPI reverse proxy deployed on GKE PRD that receives OpenAI-compatible requests and routes through a provider chain (RunPod → Cloud Run → MIG). Returns 503 with per-provider diagnostics when all fail. Validated all 4 failure scenarios including all-providers-down.
🔄 Multi-DC EU Failover
Implemented multi-datacenter failover for RunPod Serverless: EU-RO-1 → EU-CZ-1 → EU-NL-1 with dedicated network volumes per datacenter. Added exponential backoff with SUPPLY_CONSTRAINT retry logic for GPU provisioning failures.
🚀 Profile-Driven GPU Selection
Refactored from hardcoded T4/RTX 4090 if-else to a centralized GPU_PROFILES associative array. Each profile encodes VRAM, max_model_len, dtype, kv-cache-dtype, fp8 support, and tensor-parallel settings. One flag selects the entire tuned config.
| GPU Provider | Hardware | Avg Latency | Architecture | Role |
|---|---|---|---|---|
| RunPod Serverless | RTX 5090 · RTX A6000 · RTX 6000 Ada · L40S | 568 ms | Ada Lovelace / Ampere | Primary — cost-optimized burst |
| Cloud Run GPU | NVIDIA L4 | 1.9 s (warm) · 41 s (cold) | Ada Lovelace | Secondary — managed serverless |
| GKE MIG (GCP) | L40S · A100 SXM 80GB | 7.7 s | Ada / Ampere | Dedicated — high-throughput |
FinOps Excellence
$331K annual cloud cost savings through compute optimization and storage migration.
💰 Total Savings: $331K/year
Compute Optimization: $183K/year (27% reduction)
3-Year Projected Impact: $993,288 (nearly $1M)
💾 Storage Policy Migration
$148K/year saved (95.9% reduction)
Before: 360 TiB Standard storage at $12,874/month
After: 21 TiB total at $533/month
Strategy: archive >5 days · auto-delete after 365 days · keep only last 5 days in Standard
🖥️ GPU Infrastructure
$77,262/year saved
Scaled down 6 GPU instances and moved AI apps from GPU MIG to CPU-based K8s where GPU wasn't actually needed. Rightsized based on real usage data with no performance hit.
⚙️ Compute Engine
$26,500/year saved
Optimized node counts: Dev (11→9), Stg (11→7), Prd (29→24), Sdx (18→15). 17–36% reduction per environment.
✈️ Airflow Optimization
$18,165/year saved (93% reduction)
Reduced from 100 pods to 14 pods while maintaining full functionality. CPU: 20→2.8, Memory: 200GB→28GB.
🤖 GCP AI Recommendations
$6,705/year saved
Implemented all feasible AI-powered cost recommendations across compute, storage, and networking.
📊 Kubecost Monitoring
Deployed Kubecost across all Kubernetes clusters for real-time cost visibility, allocation tracking, and continuous optimization opportunities.
📈 3-Year Financial Impact
Advanced Technical Implementations
Networking, security, deployment, and platform migration work at Staff and Principal level.
🔐 mTLS with Cloudflare
Set up Mutual TLS with Cloudflare so both client and server verify identity using certificates. Every service-to-service call is encrypted and authenticated; nothing gets through on network position alone.
🌉 AWS-GCP HA VPN
Built HA VPN tunnels between AWS and GCP with BGP failover. If a tunnel drops, routing adjusts automatically in under 30 seconds. All inter-cloud traffic stays off the public internet.
🚀 Canary Deployment
Wrote custom canary automation that rolls traffic progressively (10%→25%→50%→100%), checks health at each step, and rolls back automatically if error rate or latency crosses the threshold.
🔒 Database Security (61 DBs)
Moved all 61 databases (18 MongoDB Atlas + 44 PostgreSQL) off public IPs, enforced SSL/TLS on every connection, and switched from password-only auth to certificate-based. Part of the ISO 27001 push.
⚡ Redis 7.2 Security
Deployed Redis 7.2 on GCP with SSL certificate auth and encrypted connections. Traffic in transit is encrypted; passwords alone aren't enough to connect.
🛡️ Zero-Trust Architecture
mTLS, private networking, and certificate-based auth across the stack. Services verify each other rather than relying on being inside the network perimeter.
☁️ Cloud Functions Gen1 → Gen2 Migration
Migrated 60+ Cloud Functions across HTTP and Pub/Sub triggers from Gen1 to Gen2. Delivered in 4 phases: DEV HTTP → DEV Pub/Sub → STG HTTP → STG Pub/Sub, with production pilot of 4 non-critical functions followed by 6-sprint plan for 16 critical high-traffic functions (10k+ invocations each). Standardized Jenkins pipelines and runtime updates. Deprecated Go 1.16 runtime replaced urgently.
🖥️ Automated GPU VM Provisioning (GCP)
Built fully automated pipeline for provisioning GPU-enabled GCP Compute Engine VMs and deploying vLLM. Terraform creates the VM; 3 modular shell scripts handle CUDA install, model download, and vLLM startup. GitHub Actions orchestrates the full lifecycle: preflight checks skip re-provisioning if VM already exists; smoke tests validate /health and a live inference call post-deploy; SCP/SSH fixed for OpenSSH 9+ compatibility. HuggingFace token secured via GCP Secret Manager with no hardcoded credentials.
🔧 Terraform Modular GCP Network
Refactored GCP network Terraform from monolithic inline resources to reusable modules: VPC, subnets, Cloud NAT, and firewall rules extracted into modules/. Added prevent_destroy = true lifecycle blocks on all critical network resources to guard against accidental destruction. Validated Shared VPC and Cloud NAT egress across all service projects.
📦 vLLM RunPod Serverless — Multi-GPU + Multi-DC
Extended RunPod Serverless deployment from single-GPU to a profile-driven system supporting 6 GPU types (RTX 4090 · 5090 · A6000 · 6000 Ada · L40S · A100 SXM). Implemented multi-datacenter EU failover (EU-RO-1 → EU-CZ-1 → EU-NL-1) with per-DC network volumes. Added exponential backoff with SUPPLY_CONSTRAINT retry so GPU unavailability is handled gracefully instead of failing the deploy pipeline.
Major Achievements
$331K in quantifiable annual savings across FinOps, security, platform reliability, and team enablement.
💾 Storage Migration Champion
$148K/year saved (95.9% reduction). Moved 360 TiB from Standard storage to lifecycle-managed tiers. Files older than 5 days archive automatically; anything past 365 days is deleted. Cost dropped from $12,874/mo to $533/mo.
💰 FinOps Excellence
$331K total annual savings across compute and storage, documented line-item. Projects to nearly $1M over 3 years.
🔒 ISO 27001 Compliance
Drove infrastructure for ISO 27001 certification: secured 61 databases (private IPs, SSL/TLS, cert-based auth), added automated scanning (Trivy, Checkov), and implemented Pod Security Standards. Documented 863 assets across 17 sheets for the audit, achieving certification on the first attempt with zero non-conformities.
🌐 Kong Gateway at Scale
Replaced per-app nginx ingress with Kong Gateway across 160+ applications. Rate limiting, auth, CORS, and monitoring now configured in one place rather than scattered across service configs.
🚀 65+ CI/CD Pipelines
Built and maintain 65+ Jenkins pipelines with a 98%+ first-deploy success rate. Standardized builds via Shared Libraries for developer self-service, including Trivy scanning, Helm linting, and canary rollouts. Managed Jenkins on GKE with 8 ephemeral agents and autoscaling based on queue depth.
☸️ Kubernetes Multi-Cluster
160+ applications running across 4 GKE environments (dev/stg/prd/sdx), holding 99.9% uptime SLO.
📦 Infrastructure Inventory
Full infrastructure inventory across 17 tracking sheets, 863 assets documented with enough detail to actually be useful during incidents.
🤖 AI Early Adopter
Rolled out Claude AI (Dec 2024) and GitHub Copilot (Aug 2024) to the team, then runs automated PR reviews in CI with Claude Code in Bitbucket pipelines. Most of this happened before it was common practice at peer companies.
🌍 Multi-Region & Disaster Recovery
Designed and operated multi-region resilience across GCP and AWS: HA VPN with BGP failover (<30s), multi-DC EU failover for GPU workloads (RO → CZ → NL), cross-region Cloud SQL replicas, and GCS dual-region buckets with lifecycle policies. Recovery playbooks tested quarterly.
🛡️ ScoutSuite CSPM
Deployed ScoutSuite as the cloud security posture management tool across GCP and AWS. Automated weekly scans via Jenkins, reporting findings to Slack with severity-based triage. Reduced critical misconfigurations from 47 to 0 in the first quarter, covering IAM, networking, storage, and logging controls.
🧠 Local AI Inference Stack
Evaluated and benchmarked local LLM inference tools — Ollama, llama.cpp, and LM Studio — for developer productivity and air-gapped environments. Standardized on Ollama for team use with curated model profiles (Codestral, Gemma 3, DeepSeek Coder) and documented GPU memory requirements per model size.
Technical Skills & Expertise
55+ skills with proficiency levels, from expert to proficient across cloud, containers, security, automation, and programming languages.
⭐ Proficiency Legend
☁️ Cloud Platforms
Compute Engine, GKE, Cloud Storage, VPC, VPN, IAM, Secret Manager
EC2, EKS, VPC, S3, CloudFormation, Site-to-Site VPN, IAM, CloudWatch
Web Apps, AKS, VNet, ExpressRoute, API Management, Storage
☸️ Containers & Orchestration
GKE, EKS, multi-cluster, HPA/VPA/MPA, Network Policies
Multi-stage builds, image optimization, security scanning, registry
Chart development, templating, ArgoCD concepts
🚀 CI/CD & Automation
65+ pipelines, shared libraries, multi-branch, Jenkinsfile, slave mgmt
Workflow automation, CI/CD pipelines, security scanning
Terraform, CloudFormation, configuration management
📊 Observability & Monitoring
PromQL expert, custom metrics, recording rules, alert rules, federation
Dashboard creation, variables, templating, alerting, data sources
Metrics, logs, dashboards, alarms, log analytics
Alert routing, on-call management, incident response
🌐 Networking & Security
VPC, VPN (HA), BGP, private networking, VPC peering, service mesh
mTLS, certificate management, PKI, Let's Encrypt, cert rotation
Kong, Nginx Ingress, rate limiting, auth, CORS
OWASP ZAP, Trivy, SonarQube, penetration testing
🗄️ Databases & Data Stores
18 clusters managed, replication, sharding, performance tuning, security
44 instances managed, performance tuning, replication, backup
7.2 with SSL, cluster mode, sentinel, GCP Memorystore
Database administration, query optimization
💻 Programming & Scripting
Automation scripts, system administration, pipeline scripting
Automation, data processing, API development, DevOps tools
Microservices, CLI tools, Kubernetes operators
Configuration management, K8s manifests, CI/CD configs
Spring Boot, Maven, Web Services (previous dev roles)
Next.js, Node.js, side projects in production
Android development, MVVM, Coroutines, Room ORM
iOS development, SwiftData, native frameworks (zero dependencies)
💰 FinOps & Cost Management
$331K savings documented, rightsizing, lifecycle policies, spot instances
Multi-cluster deployment, cost allocation, chargeback, recommendations
Cost analysis, budgeting, forecasting, showback/chargeback
🎯 Additional Technical Proficiencies
📦 Container Registry
- GCP Artifact Registry
- Docker Hub
- Image lifecycle management
- Registry cleanup automation
🔄 Workflow Automation
- Apache Airflow
- Cron jobs
- Event-driven architectures
- Pub/Sub messaging
📈 BI & Analytics
- Metabase
- Data visualization
- Infrastructure metrics
- Cost dashboards
🌐 DNS & Domain
- Cloudflare (46 domains)
- 817 DNS records managed
- SSL certificate automation
- DMARC, SPF, DKIM
🔐 Security & Compliance
- ISO 27001 compliance
- Secret management (Vault, GCP)
- IAM & RBAC
- Audit logging
⚙️ Operating Systems
- Linux (Ubuntu, Debian, Fedora)
- Container OS (optimized)
- System administration
- Kernel tuning
Personal Projects
Side projects built outside of work. Full production systems with real users, real infrastructure, and real constraints.
🎮 Pokémon GO Friend Code — Subscription SaaS
2025–2026 · Full-Stack · Next.js · GCP · Multi-payment · 💻 GitHub
What it does
A subscription service for Pokémon GO players to have their trainer codes automatically submitted to community listing sites daily. Users pick a plan, pay, and their code gets submitted every day without any manual action.
How it's built
- Full-stack Next.js 16 App Router with Server Components, Server Actions, and Docker multi-stage build deployed to Cloud Run (scale to zero)
- Dual payment gateway: Mercado Pago for PT/BRL subscribers, Stripe for EN+ES/USD; same codebase, locale-driven routing via middleware
- i18n with
/pt,/en,/essubpaths; Accept-Language middleware auto-redirects on first visit - Webhook state machine with idempotent processing and
submission_logstable tracking every payment event - Zero static credentials: GCP Secret Manager at runtime, Workload Identity Federation for GitHub → GCP auth in CI
- Admin panel (server-side HMAC-SHA256 sessions) for subscription management and webhook inspection
- Email confirmations via Resend with React Email templates · Database migrations via Drizzle ORM
Technical highlights
- Cloud Run southamerica-east1: auto-scales, HTTPS via custom domain (pokemongofriendscode.com)
- Workload Identity Federation: no long-lived service account keys in CI; GitHub Actions exchanges OIDC tokens with GCP
- Tested with Jest (unit/integration) and Playwright (E2E)
🤖 Pokémon GO GCP — Automated Code Submission Engine
2025–2026 · Backend Automation · Cloud Run Jobs · Web Scraping · 💻 GitHub
What it does
The automation backend for the subscription service. A scheduled Cloud Run Job that reads active trainer codes from Cloud SQL and submits them to two community sites daily at 15:00 BRT, fully headless with no manual intervention.
How it's built
- Two independent scrapers: Puppeteer + stealth plugin for pokemongofriendcodes.com, Playwright for pogocodes.com (React-based form)
- React controlled component workaround on pogocodes.com; dispatches input events with
bubbles: trueto trigger synthetic event handlers - Verification step after each submission: navigates to the listings page to confirm the code actually appears
- Batch processing (3 codes simultaneously) with configurable delays per action type to avoid rate limiting
- Database-driven: codes sourced from Cloud SQL
trainer_codestable (single source of truth, synced from payment webhooks) - Complete Terraform IaC for Cloud SQL, Cloud Run Jobs, Cloud Scheduler, Secret Manager, and IAM
Technical highlights
- Cost: ~$0/month: within GCP free tier for scheduled job execution
- Rate limit detection: pokemongofriendcodes.com blocks re-submissions within 24h; script detects and stops gracefully
- Cloud Monitoring alerts on exit code ≠ 0; failed job execution pages the owner
🎁 AmigoSecreto — Android App (Google Play)
2024–2026 · Android · Kotlin · MVVM · Google Play · 💻 GitHub
What it does
A fully-featured Android app for organizing Secret Santa draws. Supports multiple groups, exclusion rules, wish lists, secure reveal, and sharing results via WhatsApp, Telegram, SMS, or Email. Available on Google Play.
How it's built
- Fully migrated to Kotlin (from Java) with MVVM architecture: AndroidViewModel + LiveData + Repository pattern
- Hilt for dependency injection · Coroutines for async database operations on IO thread
- Room ORM with dual-layer database (Room + legacy DAOs coexist; eager Room initialization ensures migrations complete first)
- Backtracking draw algorithm extracted to pure function (
SorteioEngine.kt): testable, no side effects, handles impossible constraint scenarios gracefully - Batch queries with INNER JOIN to eliminate N+1 problem on wish list counts · All DB writes in atomic transactions
- Edge-to-Edge layout (Android 15), PDF export (PDFKit), QR code generation, local notifications, backup/restore as JSON
- GitHub Actions CI/CD: push to master → internal Play track; tag
v3.x→ production track
Technical highlights
- 297 unit tests + Espresso integration tests ·
UnconfinedTestDispatcherfor deterministic coroutine testing - Min SDK 21 (Android 5.0+) · Target SDK 35 (Android 15) · R8 minification in release builds
- Spoiler protection on WhatsApp sharing: 30 blank lines before the assignment to prevent previews from revealing the result
🍎 Secret Santa — iOS App (SwiftUI)
2025–2026 · iOS 17+ · Swift · SwiftUI · SwiftData · 💻 GitHub
What it does
The iOS counterpart to AmigoSecreto, same product concept rebuilt natively in Swift and SwiftUI for iPhone. Multiple groups, exclusion rules, wish lists, secure reveal, QR codes, PDF export, and local notifications. Zero external dependencies.
How it's built
- SwiftUI + SwiftData only, no third-party packages; persistence, UI, PDF, QR codes, and notifications all via native Apple frameworks
- MVVM via SwiftData's
@Queryand@Environment(\.modelContext)with no separate ViewModel classes needed - JSON serialization workaround for many-to-many relationships (SwiftData limitation): exclusions and draw pairs stored as
Datafields with computed property decoders - Backtracking draw algorithm with 200-attempt limit; same logic as Android, adapted for Swift
SequentialShareView: workaround for iOS limitation that prevents opening multiple URL schemes simultaneously; shows sequential buttons for WhatsApp, SMS, Email- 6-page onboarding, draw history, statistics, backup/restore with JSON forward compatibility
Technical highlights
- 30+ unit tests using Swift Testing framework (
@Suite,@Test) - iOS 17+ minimum: uses
ContentUnavailableView,@Bindable, and other modern APIs throughout - PDF generation via
UIGraphicsPDFRenderer· QR codes via Core ImageCIQRCodeGenerator· haptic feedback on reveal