Yale School of Management

Site Reliability Engineer

Detailed Job Listing:

job family: managerial and professional  

STARS Requisition: 64394BR 
University Job Title: Software Engineer, Site Reliability SOM 
Department Job Title: Site Reliability Engineer

Grade: P5

Position Focus: 

Reporting to the Yale School of Management (Yale SOM) Manager of Infrastructure Services, the Site Reliability Engineer is responsible for the design, implementation, and management of infrastructure-related systems and services. These technologies encompass High Performance Computing, Virtualization, and Cloud Solutions.  With a deep technical understanding of modern scheduling, performance optimization, parallel file systems, and networking technologies, script and build infrastructure services through code. Working closely with faculty and staff, help define and implement solutions to solve complex research problems and deploy repeatable and rapid services.

Essential Duties: 

  1. Working as a member of the operations team to maintain and enhance existing compute and infrastructure services.
  2. Identify and implement performance and stability improvements based on data and usage patterns.
  3. Establish strategy and processes consistent with the goals of the school, in its academic and research programs.
  4. Engage with customers to develop a keen understanding of their goals, requirements, and technical needs – and help to define and deliver high-value solutions that meet these needs.
  5. Build automation using software and system expertise. This automation will cover the full solution stack - from creating and managing infrastructure to scaling data processing for massive amounts of research data.
  6. Explore cutting edge cloud native solutions to cost effectively provide services to the institution.
  7. Generate and analyze weekly, monthly, quarterly, and annual operational reports to ensure achievement of metric driven goals and operational efficiency.
  8. Manage relationships with all constituencies within the school, in other university departments, outside vendors, and consultants.
  9. Provide consultation and make recommendations for allocation of resources, product selection, and priorities to Yale SOM IT senior staff as well as the Yale SOM community.
  10. Ensure the highest level of service delivery for current and evolving technologies in complex, multi-vendor, multi-platform environment.
  11. Develop relationships with Technology Partners and help them innovate, paving the way for the next generation of technology solutions.
  12. Assist in troubleshooting user related issues, including job submission, package management, permission, and service management.
  13. Other projects and duties as assigned.

Required Education & Experience: 

Bachelor’s degree and four years of relevant experience in service delivery or an equivalent combination of education and experience.

Required Skills & Abilities: 

  1. Experience with Red Hat Satellite, Cloudforms and Ansible Tower or similar services.
  2. Working knowledge of configuration management and infrastructure as code (IaC) – Ansible, Terraform, etc.
  3. Experience with DevOps tools: Git, Continuous Integration and Continuous Deployment.
  4. Experience with at least one job scheduler such as LSF, SLURM, xcat, OpenShift, Kubernetes, Docker Swam or similar.
  5. Experience with large storage systems and parallel file systems – GPFS, Lustre, BeeGFS.
  6. Excellent interpersonal skills and superior customer service orientation to manage the needs of faculty, staff, and students.
  7. Self-driven and positive can-do attitude.
  8. Team player with ability to work collegially with peers, colleagues and school leadership.
  9. Proven excellent oral and written communication and presentation skills.
  10. Ability to manage operations in a fast-paced and changing environment.
  11. Demonstrated project management skills on large complex projects.
  12. Demonstrated ability to effectively lead service delivery teams.
  13. Ability to work under pressure and manage projects and priorities in a highly complex and dynamic environment.
  14. Ability to represent the school well in working collegially with peers and colleagues within and outside the university.


Bachelor’s degree in a scientific or computational field plus two years of academic or scientific industry experience. Experience with system administration of Linux servers – RHEL. Experience with at least one Cloud Platform – Azure or AWS. Global orientation; experience working across countries and regions, and fluency in more than one language.


  • Customer Service Focus – Listening carefully to and understanding customers’ needs and proactively responding to those needs in a consistent and timely manner.
  • Teamwork/Communication – Working cooperatively to achieve common goals. Support cooperation, collaboration and the sharing of information.
  • Product Excellence – Provide the best quality product available and continuously upgrade standards to maintain quality.
  • Leadership – Provide direction and motivation to others through communication, modeling appropriate behavior, optimism and high achievement.
  • Innovative – Openness to new ideas and their implementation. Ability to react and adapt to changing situations appropriately.
  • Strategic Thinking – Recognize opportunities, identify critical, high pay-off activities and prioritize them to attain goals.

Apply Now