SRE- AI Infrastructure

Primary skills: Containers, Kubernative, Devops, Python, Golang, TDD, Linux
Years of experience: 5 – 7+ Years

The Service Operations team at “Product Platform” Systems is responsible for building and operating the platform and infrastructure that enables us to deliver our groundbreaking capabilities to enterprise customers.
As a site reliability engineer on this team, you will lead key system engineering and automation functions, enhancing our capabilities to provide a reliable and scalable service for customers, in a hybrid deployment pattern.

How you will make an impact:

Assume broad responsibilities for successful delivery of our “Product Platform” services in a hybrid model including but not limited to, deployment, configuration, integrations, and ongoing operations.
Systems and application administration for multiple customer facing production environments (hosted and on-premise), with a continued focus on improving efficiencies, availability, and supportability.
Take ownership for ongoing updates, upgrades and patches on customer environments.
Lead efforts to triage, debug and fix issues related to network, storage, scheduling, applications, and systems, for proactive and reactive incident resolution and root cause analysis.
Augment ongoing efforts to design and develop automation for deployments, updates and upgrades of the entire “Product Platform” software stack.
Build the systems and tools for centralized command and control of distributed environments.
Partner and collaborate with product and engineering teams to improve the security posture and operational readiness of our systems with the flexibility to integrate into unique customer environments.
Participate in on-call rotation responsibilities.

Basic Qualifications:

Bachelors and/or Masters in CS /EE or related field.
5+ years of hands-on experience as an SRE with focus on systems and infrastructure for cloud/SaaS production requirements.
Extensive experience building, configuring, securing and administering Linux systems large-scale production environments.
Strong scripting /programming skills (Python preferable) with experience with automated deployment systems, e.g. Ansible, Terraform, etc.
Systematic problem-solving approach to troubleshooting, and the desire to solve the root cause of common problems in 24×7 environments.

Preferred Qualifications:

Deep understanding of DNS, DHCP, LDAP, NFS, Kerberos, PAM, PXE, SNMP, SSH, HTTP/S, NTP, troubleshooting network performance issues.
Knowledge of software development processes and methods, CI/CD pipelines and experience with common version control software.
Knowledge of virtualization, multiple hypervisor technologies, Kubernetes cluster administration and management.
Experience deploying applications and managing infrastructure in public clouds (AWS, Azure, GCP).
Experience with monitoring and logging systems and the ability to identify new technologies as appropriate.
Configuration and maintenance of web servers, load balancers, databases, storage systems and messaging systems.
A passion to design for high availability and scale, with the discipline and desire for extensive automation.
Strong communication skills with the ability and willingness to work with diverse teams, and customers, across multiple time zones.

SRE- AI Infrastructure

Apply for this position