Company Description
Ubisoft is a global leader in gaming with teams across the world creating original and memorable gaming experiences, from Assassin’s Creed, Rainbow Six, to Just Dance and more. We believe diverse perspectives help both players and teams thrive. If you’re passionate about innovation and pushing entertainment boundaries, join our journey and help us create the unknown!
Job Description
At Ubisoft, we are reshaping how teams work with information through AI. We build and operate applications for document-rich, interactive, and intelligent use cases, including hybrid search, question answering, document generation, conversational assistants, agentic workflows, and coding assistants.
You will join the team responsible for the reliability, scalability, and operational excellence of the AI platform powering these experiences. This includes the AI Gateway, MCP servers, LLM serving infrastructure, retrieval services, and the surrounding platform components that help teams bring AI applications to production safely.
This role sits at the intersection of fast-moving AI engineering and disciplined site reliability engineering. You will help the platform evolve quickly while ensuring it remains observable, resilient, cost-efficient, and dependable at scale.
Responsibilities
-
Operate and improve the reliability, scalability, and performance of the AI platform, including the AI Gateway, MCP servers, LLM serving, retrieval services, and related platform components.
-
Define and maintain SLOs, SLIs, and error budgets to guide reliability decisions and balance delivery speed with platform stability.
-
Build strong observability across infrastructure, services, and AI-specific signals, such as latency, token throughput, model/provider errors, cost, saturation, and quality indicators.
-
Improve the performance, efficiency, and cost profile of model and API serving across cloud, containerized, and high-throughput environments.
-
Design graceful degradation, fallback, and failover strategies, including model, provider, and region-level resilience patterns.
-
Maintain safe, repeatable deployment practices using GitLab CI/CD, infrastructure as code, automated testing, and progressive delivery where appropriate.
-
Lead or contribute to incident response, post-incident reviews, capacity planning, and reliability improvements.
-
Partner closely with software engineers, data scientists, ML engineers, security teams, and product stakeholders to productionize AI services responsibly.
-
Help establish operational standards, runbooks, dashboards, and best practices for AI application reliability.
Qualifications
Must-Have
-
Proven experience in SRE, platform engineering, backend engineering, or infrastructure roles in compute-intensive, data-intensive, or distributed environments.
-
Strong understanding of reliability engineering practices, including SLOs, error budgets, observability, capacity planning, incident management, and post-incident analysis.
-
Strong programming skills in Python and practical experience with Rust, or a willingness to work deeply with Rust-based services.
-
Hands-on experience with cloud platforms, Docker, Kubernetes, and production-grade service operations.
-
Experience building and maintaining CI/CD pipelines, preferably with GitLab, and working with infrastructure as code.
-
Solid understanding of distributed systems, microservices, APIs, networking fundamentals, and production debugging.
-
Ability to move quickly in a fast-changing AI ecosystem while applying the rigor, discipline, and risk awareness expected from an SRE.
-
Clear communication skills and the ability to collaborate across engineering, data, ML, security, and product teams.
Nice-to-Have
-
Experience operating AI infrastructure, such as AI gateways, MCP servers, LLM serving stacks, model routers, or provider abstraction layers.
-
Familiarity with RAG and agentic architectures, including embeddings, vector databases, hybrid search, query processing, tool use, and agent orchestration.
-
Experience with GPU-backed inference, high-throughput serving, batching, caching, autoscaling, or model performance optimization.
-
Experience with serverless, event-driven systems, queues, streaming platforms, or asynchronous processing.
-
Familiarity with AI safety, governance, auditability, data privacy, or secure-by-design platform practices.
-
Open-source contributions, technical writing, talks, or publications in relevant engineering, reliability, or AI infrastructure fields.
Additional Information
You will work on projects that shape the future of AI at Ubisoft. You will stay close to emerging AI infrastructure and reliability practices, influence how we operate AI platforms at scale, and help teams deliver useful, safe, and dependable AI-powered applications.
Ubisoft's perks
Profit Sharing, yearly company saving plan. 25 paid time off + 12 additional paid days off. 50% of your transportation pass is paid by the company, lunch vouchers (9€/day), healthcare for you and your family, and lots of Ubisoft additional perks.
Maternity leaves of 20 weeks, paternity/co-parental leaves of 7 weeks.
Our office is located in Saint Mandé, (Metro line 1, Saint Mandé station). Gym available in the building.
Information about Ubisoft
Ubisoft offers the same job opportunities to all, without any distinction of gender, ethnicity, religion, sexual orientation, social status, disability, or age. Ubisoft ensures the development of an inclusive work environment which mirrors the diversity of our gamers’ community.