At Guidewire, we make software that offers Property and Casualty (P&C) Insurance companies the tools to take care of their customers when they need it the most, whether that’s a time of crisis, a natural disaster, an accident, or exposure to cyber risks. We build the core applications that insurance companies use to sell and underwrite policies, settle claims, and bill their customers. We also have a portfolio of innovative products serving the needs of P&C insurance companies in areas such as data management, digital online portals, and predictive analytics. We run these products on the Guidewire Cloud Platform, and we help hundreds of insurance providers all over the world to handle billions of dollars of business.We are proud to be voted a Top Cloud Employer on Glassdoor by our own employees and positioned as a market leader by industry experts like Gartner. We have a fun work environment and a culture that lives by our core values of integrity, rationality, and collegiality.
About the Job
Site Reliability Engineering (SRE) brings together software and systems engineering to design and operate large-scale, highly distributed, and fault-tolerant systems. As a Site Reliability Engineer (SRE) at Guidewire, you will join a team focused on automating tasks to enhance system efficiency and reliability. This role involves ensuring the stable operation of Guidewire's cloud platform (GWCP) and InsuranceSuite products, collaborating with developers to meet various requirements. Your work will support numerous customers and transactions daily.
To learn more about GWCP and it’s tenancy model, you can read more here: https://medium.com/guidewire-engineering-blog/guidewire-cloud-why-hybrid-tenancy-is-the-right-choice-part-2-of-2-ba22c9888bb8 .
What You’ll Do
- Collaborate with engineering teams to provide feedback and contribute code where needed, enhancing product functionality and resilience.
- Participate in on-call rotations to ensure 24x7 availability of services.
- Design and develop tools to support 24x7 follow-the-sun operations for critical production systems.
- Automate deployment tasks for core products and infrastructure, maintaining a robust automation framework.
- Monitor and optimize the performance of applications on the Guidewire Cloud Platform, ensuring reliability and efficiency.
- Develop and maintain observability tools, metrics, and dashboards, including self-healing mechanisms for increased reliability.
- Foster a culture of reliability by promoting blameless postmortems, SLO tracking, and continuous learning from incidents.
- Proactively identify and address infrastructure issues to minimize business impact.
- Develop system documentation and training materials to empower and educate team members.
Who You Are
- Skilled in programming with Python or Go for building internal tools, CLIs, and APIs; familiarity with Java and Spring Boot is a plus.
- Exceptional troubleshooting skills, with a proactive, critical approach to solving complex issues.
- Proficient in containerization technologies, with hands-on expertise in Docker, Helm, Kubernetes (EKS), CNI, and Ingress networking.
- Strong knowledge of Kubernetes concepts (pods, deployments, services, statefulsets, ingress etc.) and the Operator pattern.
- Experienced with Terraform, including developing and testing complex modules.
- Advanced experience with AWS, including custom tool development using AWS SDK.
- Solid understanding of Single Sign-On (SSO), SAML, and OAuth protocols; experience with Okta is a bonus.
- Skilled in using observability tools such as Prometheus, OpenTelemetry, or Datadog for proactive monitoring.
- Production-At-Scale support background in a heavily microservice-based world.
- Familiar with agile methodologies, including Scrum and Kanban, to enhance software development processes.
- Excellent communication skills, with the ability to explain complex technical concepts to diverse audiences.
Other Requirements
- Bachelor’s Degree in Computer Science or a related field.
- Ability to read, write, and speak English
- We provide 24x7 support to our customers, so we expect you to take turns with your teammates being on-call for weekend production emergencies or to provide rotating weekend operational support
- Travel – Expect occasional travel (less than 5%) to other Guidewire offices for training and team meetings
Bonus Points
- Kubernetes or AWS certifications
- Contributions to open source projects
- Familiar with Kubevela (OAM) or Crossplane for Kubernetes-native infrastructure management
- Experience in managing large scale Aurora PostgreSQL clusters and Aurora Serverless
- Experience with TeamCity CI or GitHub actions