
Job Information
Willis Towers Watson Site Reliability Engineer Architect - Monitoring and Alerting in South Jordan, Utah
Our engineering team has built the largest private Medicare marketplace in the country. We passionately focus on the continuous improvement of the systems we build and the culture we promote. We build a platform that provides the best possible support to our customers who are shopping for insurance, and where our insurance carriers can be confident that their products are accurately and impartially represented.
The Role
We are looking for an Architect to join our Monitoring and Alerting team. This position is responsible for leading the design of platforms used by two dozen engineering teams to instrument monitoring and alerting for application and infrastructure resources. This team is embedded within our Platform Engineering Group.
We operate in a complex, multi-tenant, hybrid cloud and on-premises infrastructure that spans both the Windows and Linux OS. We strive for security, reliability, and automation in line with DevOps and Site Reliability Engineering principles. If you are passionate about learning and improvement through metrics and automation, and passionate about engendering that mindset in others, we want to hear from you.
Responsibilities
Communication
Explore new ways of improving communication between other Site Reliability Engineers and with other teams
Promote inclusion and collaboration between various functional disciplines
Write and maintain architectural, stakeholder, and policy documentation
Innovation
Encourage and inspire others to innovate
Make improvements to internal processes to reduce lead time and increase deployment frequency
Identify improvements to the quality, security, and performance of our infrastructure
Increase the velocity with which teams deliver, leveraging expertise from various functional disciplines
Identify how to remediate production incidents more quickly and safely while reducing the frequency of outages
Participate in department Communities of Practice, actively engaging with other teams and departments to collaborate on best practices and implementation strategy
Adhere to and advocate for best practices, including Infrastructure as Code, monitoring, high availability, disaster recovery, security, and DevOps methodologies
Use intuition, experience and understanding to create SLIs, SLOs, and SLAs
Contribute to capacity planning, advise and consult with teams who will be load/stress testing
Keep up with industry innovations, recommending new tools or practices when appropriate
Team Culture
Guide the culture and attitude of the team in an optimistic, proactive, and encouraging direction
Foster an environment where it is safe to fail and to learn from failure
Actively mentor peers, developing their expertise and inspiring others to innovate
Promote inclusion and collaboration between various functional disciplines
Initiative
Take an active role in estimating, planning, and managing your own tasks while aligning your efforts with your team
Provide timely assistance and remediation solutions during critical situations and production incidents
Document and share "lessons learned" from production, including blameless postmortems and root cause analysis
The Requirements
10+ years of hands-on experience with a majority of the following technologies, along with a willingness to become proficient in the remaining areas:
Windows and Linux Servers
Infrastructure monitoring with tools like Zabbix, Sensu, SolarWinds, etc.
Application Performance Monitoring with tools like New Relic, DataDog, etc.
Log Aggregation tools like SumoLogic, Splunk, ELK, etc.
Cloud platforms, preferably with Azure
VMware
Secrets management with Consul and Vault or similar systems
Configuration management tools like Salt, Ansible, and Terraform
Active Directory
Firewalls and load balancers such as F5
Web servers, including IIS and NGINX
Database Server Infrastructure like Microsoft SQL Server and PostgreSQL
Continuous Integration and Continuous Delivery with tools like TeamCity, Octopus Deploy, Concourse, or Azure DevOps
Network theory and protocols such as DNS, DHCP, proxy servers, and firewalls
Security operations with tools for SAST, DAST, RAST, and WAF
Proficiency, high-comfort, and familiarity with:
One or more programming languages, such as C#, JavaScript, Python or Go
One or more scripting languages, such as PowerShell and BASH
Command line tools such as (git, netcat, npm, terraform, etc.)
Bachelor's degree in computer science, information systems, or related field strongly preferred; high school diploma required
EOE, including disability/vets (https://cdn-static.findly.com/wp-content/uploads/sites/954/2019/09/Equal-Employment-Opportunities-EEO-Policy.pdf)