Utah IT Jobs

Mobile utah department of workforce services Logo

Job Information

Ellie Mae Sr Engineer, Site Reliability Engineering 210276 in Provo, Utah

Sr Engineer, Site Reliability Engineering 210276

Provo, UT /

Technology & Operations – Cloud Operations /

Full time

Ellie Mae is the leading cloud-based platform provider for the mortgage finance industry. Ellie Mae’s technology solutions enable lenders to originate more loans, reduce origination costs, and reduce the time to close, all while ensuring the highest levels of compliance, quality and efficiency. Visit‪ EllieMae.comto learn more.

We are looking for a Sr. SRE Engineer to join our growing AIQ Cloud Operations team. This is a hands-on reliability engineering role to scale AIQ Cloud Operations for our growing customer base and to deliver high performance and availability. We have an unwavering zeal to make our Customers successful. The ideal candidates take pride in restoring production incidents, avoiding production issues, and is a true believer of automation, a practitioner of Observability and internalizes the concepts of SLI, SLO, MTTR and Error Budgets. He/She will leverage prior experience and learnings to solve complex problems associated with running production environments at massive scale in multi-tenant environments.

The role has direct influence on improving production Quality of Service and metrics for Availability, Scalability, Performance, etc.  This is a fantastic opportunity to work and collaborate closely with our Tech Support, Software Engineering, Architecture, Infrastructure, DBA, SRE, DevOps and Cloud Platform teams at ICE Mortgage Technology and Partner with our Site Reliability Engineers (SRE) who are responsible for ensuring Ellie Mae services are highly available, reliable, secure and scalable. 

Responsibilities & Objectives

  • Employ deep troubleshooting and scripting skills to improve the availability, performance, and security of Ellie Mae Services.

  • Takes a central role as Incident Manager during critical incidents focusing on minimizing MTTR & MTTD

  • Implementation of proactive monitoring, alerting, trend analysis and self-healing systems – Internalize the three pillars of observability.

  • Participate in on-call rotations, driving restoration and repair of service-impacting issues

  • Conduct Root Cause Analysis and drive repair of Problem Records in order to prevent recurrence through to closure including, but not limited to, resolution of product/service defects or design changes, infrastructure changes, or operational changes

  • Define non-functional requirements as part of the product lifecycle to influence the new designs, standards, and methods for scalable, highly available distributed systems

  • Work with Engineering leadership to build shared services that meet the requirements and need of the platform and application teams

  • Author system support documents and update production application service run books where needed

  • Contribute to product development / engineering as needed to ensure Quality of Service of Highly Available services

  • Ability to identify a problem, detect toil and pro-actively move towards automation

  • Ensure services are designed with 24/7 availability and operational readiness and rigor

  • Technical hands-on knowledge in order to function as a technology leader

  • Operations Escalations and troubleshooting production issues

Requirements

  • 5+ years of Systems Engineering in 24x7 Production Reliability environments

  • BS in Computer Science, Computer Engineering, Math, or equivalent professional experience

  • Experience with Datacenter Technologies including Public Cloud (AWS/Azure)

  • Familiarity with Containerization concepts like Docker, and PaaS services on AWS.

  • Excellent scripting skills and experience using infrastructure automation tools such as Terraform

  • Experience with elastically scalable, fault tolerance and other cloud architecture patterns

  • Excellent troubleshooter, utilizing a systematic problem-solving approach spanning code, systems, and network theory & protocols (TCP/IP, UDP, ICMP) ability to read a packet capture/TCPdump, etc.

  • Demonstrated experience in designing, analyzing, and diagnosing large-scale distributed systems + Windows Server and/or Linux systems internals (system libraries, file systems, client-server protocols)

  • Databases/VMWare/Docker/Micro-services/Forensic-Analysis experience is a big plus

  • Demonstrated strength in SaaS services, experience in massive scale web operations

  • Strong interpersonal skills and a demonstrated ability to effectively work with other SRE and Engineering resources in offshore locations

  • Ability to travel to other office locations, both domestic and international, as needed, for meetings and planning sessions (post pandemic restrictions)

#LI-TM1

Ellie Mae is an equal opportunity and affirmative action employer. Women, minorities, people with disabilities, and veterans are encouraged to apply.

We do not accept resumes from headhunters, placement agencies, or other suppliers that have not signed a formal agreement with us.

DirectEmployers