–
We are looking for a seasoned Site Reliability Engineer (SRE) Architect with 8-10 years of experience to join our team. The ideal candidate will play a critical role in designing, implementing, and managing scalable, resilient, and secure systems, driving observability, incident management, and automation practices.
Key Responsibilities:
Observability and Monitoring:
Architect and implement observability solutions using tools like Grafana, Prometheus, ELK Stack, and Kibana.
Define and set up actionable alerts, dashboards, and reporting systems.
Drive proactive monitoring strategies to detect and resolve issues before they impact customers.
DevOps & Infrastructure Automation:
Lead infrastructure as code (IaC) implementation using Helm and Terraform.
Manage containerized environments with Docker and Kubernetes in Azure or hybrid cloud setups.
Design, build, and optimize CI/CD pipelines with tools like Git, Jenkins, Argo CD, and Flux CD.
Incident and Problem Management:
Define and establish SLI (Service Level Indicators), SLA (Service Level Agreements), and SLO (Service Level Objectives).
Lead the design and implementation of Incident Management and Problem Management processes.
Utilize ITSM tools such as BMC Remedy or similar tools to manage incidents effectively.
On-Call Management:
Set up robust on-call processes using tools like Opsgenie, PagerDuty, or equivalent.
Coordinate and manage escalations during major incidents.
Team Leadership and Collaboration:
Collaborate effectively with development teams, product managers, and other stakeholders.
Mentor and manage teams, fostering a culture of ownership and reliability engineering.
Collaborate with cross-functional and cross-regional teams to ensure alignment of operational goals with business objectives.
Resilience and Scalability:
Architect resilient and scalable systems with a focus on automation, performance optimization, and security.
Conduct regular chaos engineering and postmortem analyses to improve system reliability.
Key Qualifications:
Experience:
8-10 years of hands-on experience in SRE, DevOps, or related roles.
Proven track record of designing and managing highly scalable, available, and secure systems.
Technical Skills:
Expertise in Grafana, Prometheus, ELK Stack, and Kibana for observability.
Hands-on experience with Helm, Terraform, Azure, Docker, and Kubernetes.
Proficient in CI/CD processes and tools such as Git, Jenkins, Argo CD, and Flux CD.
Process & Tools:
Strong knowledge of ITSM tools like BMC Remedy, Opsgenie, or PagerDuty.
Experience defining SLI, SLA, SLO, and managing incident and problem resolution processes.
Soft Skills:
Strong leadership and team management abilities.
Excellent problem-solving, communication, and collaboration skills
We are looking for a seasoned Site Reliability Engineer (SRE) Architect with 8-10 years of experience to join our team. The ideal candidate will play a critical role in designing, implementing, and managing scalable, resilient, and secure systems, driving observability, incident management, and automation practices.
Key Responsibilities:
Observability and Monitoring:
Architect and implement observability solutions using tools like Grafana, Prometheus, ELK Stack, and Kibana.
Define and set up actionable alerts, dashboards, and reporting systems.
Drive proactive monitoring strategies to detect and resolve issues before they impact customers.
DevOps & Infrastructure Automation:
Lead infrastructure as code (IaC) implementation using Helm and Terraform.
Manage containerized environments with Docker and Kubernetes in Azure or hybrid cloud setups.
Design, build, and optimize CI/CD pipelines with tools like Git, Jenkins, Argo CD, and Flux CD.
Incident and Problem Management:
Define and establish SLI (Service Level Indicators), SLA (Service Level Agreements), and SLO (Service Level Objectives).
Lead the design and implementation of Incident Management and Problem Management processes.
Utilize ITSM tools such as BMC Remedy or similar tools to manage incidents effectively.
On-Call Management:
Set up robust on-call processes using tools like Opsgenie, PagerDuty, or equivalent.
Coordinate and manage escalations during major incidents.
Team Leadership and Collaboration:
Collaborate effectively with development teams, product managers, and other stakeholders.
Mentor and manage teams, fostering a culture of ownership and reliability engineering.
Collaborate with cross-functional and cross-regional teams to ensure alignment of operational goals with business objectives.
Resilience and Scalability:
Architect resilient and scalable systems with a focus on automation, performance optimization, and security.
Conduct regular chaos engineering and postmortem analyses to improve system reliability.
Key Qualifications:
Experience:
8-10 years of hands-on experience in SRE, DevOps, or related roles.
Proven track record of designing and managing highly scalable, available, and secure systems.
Technical Skills:
Expertise in Grafana, Prometheus, ELK Stack, and Kibana for observability.
Hands-on experience with Helm, Terraform, Azure, Docker, and Kubernetes.
Proficient in CI/CD processes and tools such as Git, Jenkins, Argo CD, and Flux CD.
Process & Tools:
Strong knowledge of ITSM tools like BMC Remedy, Opsgenie, or PagerDuty.
Experience defining SLI, SLA, SLO, and managing incident and problem resolution processes.
Soft Skills:
Strong leadership and team management abilities.
Excellent problem-solving, communication, and collaboration skills
Culture of caring. At GlobalLogic, we prioritize a culture of caring. Across every region and department, at every level, we consistently put people first. From day one, you’ll experience an inclusive culture of acceptance and belonging, where you’ll have the chance to build meaningful connections with collaborative teammates, supportive managers, and compassionate leaders.
Learning and development. We are committed to your continuous learning and development. You’ll learn and grow daily in an environment with many opportunities to try new things, sharpen your skills, and advance your career at GlobalLogic. With our Career Navigator tool as just one example, GlobalLogic offers a rich array of programs, training curricula, and hands-on opportunities to grow personally and professionally.
Interesting & meaningful work. GlobalLogic is known for engineering impact for and with clients around the world. As part of our team, you’ll have the chance to work on projects that matter. Each is a unique opportunity to engage your curiosity and creative problem-solving skills as you help clients reimagine what’s possible and bring new solutions to market. In the process, you’ll have the privilege of working on some of the most cutting-edge and impactful solutions shaping the world today.
Balance and flexibility. We believe in the importance of balance and flexibility. With many functional career areas, roles, and work arrangements, you can explore ways of achieving the perfect balance between your work and life. Your life extends beyond the office, and we always do our best to help you integrate and balance the best of work and life, having fun along the way!
High-trust organization. We are a high-trust organization where integrity is key. By joining GlobalLogic, you’re placing your trust in a safe, reliable, and ethical global company. Integrity and trust are a cornerstone of our value proposition to our employees and clients. You will find truthfulness, candor, and integrity in everything we do.
GlobalLogic, a Hitachi Group Company, is a trusted digital engineering partner to the world’s largest and most forward-thinking companies. Since 2000, we’ve been at the forefront of the digital revolution – helping create some of the most innovative and widely used digital products and experiences. Today we continue to collaborate with clients in transforming businesses and redefining industries through intelligent products, platforms, and services.