Join a transversal SRE community embedded in a product-oriented Data Platform team of 40–50 engineers, analysts, and data scientists across France and Spain. You'll drive the reliability and scalability of a next-generation Lakehouse platform — anchored on Trino, Iceberg, and on-prem object storage — while leading the transition from public cloud to a resilient hybrid/on-prem architecture.
- Own reliability of core data services: Trino, Iceberg, S3 / Ceph, Kafka, Kafka Connect, Schema Registry
- Define and enforce SLIs/SLOs, error budgets, and on-call runbooks — solid SRE foundations are non-negotiable
- Build full-stack observability with Prometheus and Grafana: metrics, dashboards, alerting pipelines, and anomaly detection
- Manage and harden PostgreSQL clusters via Patroni for high-availability control-plane services
- Operate and scale Kafka Connect clusters: connector lifecycle, offset management, dead-letter queues, and task rebalancing
- Maintain the Schema Registry as the single source of truth for Avro/Protobuf/JSON schemas — enforce compatibility rules and schema evolution policies
- Monitor consumer lag, connector throughput, and broker health via Prometheus JMX exporters and Grafana dashboards
- Ensure end-to-end data contract integrity between producers and Iceberg/S3 consumers
- Operate production Kubernetes clusters (GKE/EKS + on-prem) — capacity planning, upgrades, PodDisruptionBudgets, resource quotas
- Architect and manage Kube-in-Kube topologies to provide strong tenant isolation for data platform workloads — each team gets a dedicated virtual cluster without the overhead of a full physical cluster
- Automate infrastructure and resource provisioning with Crossplane: define composite resources (XRDs) so data teams can self-serve Kafka topics, Trino namespaces, and S3 buckets through Kubernetes-native APIs
- Maintain GitOps pipelines for platform deployment and configuration drift detection
- Migrate from public cloud data warehouse to VeepeeCloud Iceberg-based lakehouse — managing coexistence, schema evolution, and time-travel
- Architect resilient ingestion, transformation, and serving layers around Trino + S3
- Optimize Trino query performance: memory limits, spilling, cost-based optimizer tuning
- Build agentic self-service tooling so data teams can provision Trino/Iceberg resources and Kafka Connect pipelines autonomously via Crossplane — reducing toil and ops bottlenecks
- Develop FinOps dashboards (compute, storage, query cost) with Grafana and Prometheus-based cost exporters
- Write clear technical documentation, runbooks, and internal ADRs
- Design and implement multi-datacenter strategies across FR1 / NL1 — active-active and active-passive topologies
- Leverage Fast Erasure Coding on object storage (Ceph/S3) to maximize durability with minimal replication overhead
- Ensure data replication consistency across sites for Iceberg table metadata, Trino catalogs, and Schema Registry subjects
- Lead DRP exercises: failover playbooks, RTO/RPO validation, postmortems
- Strong experience with Kubernetes in production environments
- Experience with Kube-in-Kube technologies (vCluster or similar)
- Solid understanding of SRE principles (SLIs/SLOs, error budgets)
- Experience with Prometheus and Grafana
- Experience with Infrastructure as Code (Terraform or similar)
- Experience with Crossplane
- Familiarity with GitOps workflows
- Experience with S3 and object storage technologies
- Experience with PostgreSQL and Patroni
- Experience with Kafka, Kafka Connect, and Schema Registry
- Fluent in English
- Experience with multi-datacenter architectures (FR1/NL1)
- Experience designing disaster recovery plans and failover playbooks
- Experience with Fast Erasure Coding (Ceph/S3)
- Experience with Trino, Iceberg, and Lakehouse technologies
- Experience with Airflow
- Experience building agentic self-service platforms
- Knowledge of FinOps and cost optimization practices
- Programming experience in Python, Java, or Go
- Variable bonus
- E-learning platform (self-education courses)
- Meetups & conferences (local and international)
- Flexible office — up to 2 days remote
- International teams (France & Spain)
- ️ RECRUITMENT PROCESS
1️⃣ 30-minute HR Screen with a Veepeeᵀᵉᶜʰ Recruiter
2️⃣ General Technical exchange
3️⃣ Technical exchange with the manager
4️⃣ Team Interview
We are convinced that it is up to you to define the way you work, to develop yourself and to progress.
At Veepee we guarantee that you can just be yourself!
For the service of diversity and inclusion, Veepee is committed to reviewing all applications received on an equal basis.
COMPANY For more information about our ecosystem : https://careers.veepee.com/en/home-page-en/
We may use artificial intelligence (AI) tools to support parts of the hiring process, such as reviewing applications, analyzing resumes, or assessing responses and identifying potential inconsistencies or verification signals in application materials based on available information. These tools assist our recruitment team but do not replace human judgment. Final hiring decisions are ultimately made by humans. If you would like more information about how your data is processed, please contact us.