
Job Information
Ellie Mae Sr Engineer, Site Reliability Engineering 210275 in Provo, Utah
Sr Engineer, Site Reliability Engineering 210275
Provo, UT /
Technology & Operations – Cloud Operations /
Full time
Ellie Mae is the leading cloud-based platform provider for the mortgage finance industry. Ellie Mae’s technology solutions enable lenders to originate more loans, reduce origination costs, and reduce the time to close, all while ensuring the highest levels of compliance, quality and efficiency. Visit EllieMae.comto learn more.
We are looking for a Sr. SRE Engineer to join our growing AIQ Cloud Operations team. This is a hands-on reliability engineering role to scale AIQ Cloud Operations for our growing customer base and to deliver high performance and availability. We have an unwavering zeal to make our Customers successful. The ideal candidates take pride in restoring production incidents, avoiding production issues, and is a true believer of automation, a practitioner of Observability and internalizes the concepts of SLI, SLO, MTTR and Error Budgets. He/She will leverage prior experience and learnings to solve complex problems associated with running production environments at massive scale in multi-tenant environments.
The role has direct influence on improving production Quality of Service and metrics for Availability, Scalability, Performance, etc. This is a fantastic opportunity to work and collaborate closely with our Tech Support, Software Engineering, Architecture, Infrastructure, DBA, SRE, DevOps and Cloud Platform teams at ICE Mortgage Technology and Partner with our Site Reliability Engineers (SRE) who are responsible for ensuring Ellie Mae services are highly available, reliable, secure and scalable.
Responsibilities & Objectives
Employ deep troubleshooting and scripting skills to improve the availability, performance, and security of Ellie Mae Services.
Takes a central role as Incident Manager during critical incidents focusing on minimizing MTTR & MTTD
Implementation of proactive monitoring, alerting, trend analysis and self-healing systems – Internalize the three pillars of observability.
Participate in on-call rotations, driving restoration and repair of service-impacting issues
Conduct Root Cause Analysis and drive repair of Problem Records in order to prevent recurrence through to closure including, but not limited to, resolution of product/service defects or design changes, infrastructure changes, or operational changes
Define non-functional requirements as part of the product lifecycle to influence the new designs, standards, and methods for scalable, highly available distributed systems
Work with Engineering leadership to build shared services that meet the requirements and need of the platform and application teams
Author system support documents and update production application service run books where needed
Contribute to product development / engineering as needed to ensure Quality of Service of Highly Available services
Ability to identify a problem, detect toil and pro-actively move towards automation
Ensure services are designed with 24/7 availability and operational readiness and rigor
Technical hands-on knowledge in order to function as a technology leader
Operations Escalations and troubleshooting production issues
Requirements
5+ years of Systems Engineering in 24x7 Production Reliability environments
BS in Computer Science, Computer Engineering, Math, or equivalent professional experience
Experience with Datacenter Technologies including Public Cloud (AWS/Azure)
Familiarity with Containerization concepts like Docker, and PaaS services on AWS.
Excellent scripting skills and experience using infrastructure automation tools such as Terraform
Experience with elastically scalable, fault tolerance and other cloud architecture patterns
Excellent troubleshooter, utilizing a systematic problem-solving approach spanning code, systems, and network theory & protocols (TCP/IP, UDP, ICMP) ability to read a packet capture/TCPdump, etc.
Demonstrated experience in designing, analyzing, and diagnosing large-scale distributed systems + Windows Server and/or Linux systems internals (system libraries, file systems, client-server protocols)
Databases/VMWare/Docker/Micro-services/Forensic-Analysis experience is a big plus
Demonstrated strength in SaaS services, experience in massive scale web operations
Strong interpersonal skills and a demonstrated ability to effectively work with other SRE and Engineering resources in offshore locations
Ability to travel to other office locations, both domestic and international, as needed, for meetings and planning sessions (post pandemic restrictions)
#LI-TM1
Ellie Mae is an equal opportunity and affirmative action employer. Women, minorities, people with disabilities, and veterans are encouraged to apply.
We do not accept resumes from headhunters, placement agencies, or other suppliers that have not signed a formal agreement with us.