Site Reliability Architect
This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Site Reliability Architect in the United States.
In this role, you will lead the technical strategy for enterprise-scale SaaS environments, ensuring reliability, resiliency, and high availability across complex hybrid cloud and on-premise systems. You will drive architecture decisions that support fault-tolerant, self-healing systems and mentor SRE and software engineering teams to build a proactive reliability culture. This position balances hands-on technical work with strategic oversight, including defining SLIs, SLOs, and error budgets, optimizing cloud infrastructure, and implementing automation-first solutions. You will operate in a fast-paced, high-volume environment where your expertise directly impacts system performance, stability, and user experience. The role encourages innovation, embraces DevSecOps best practices, and provides opportunities to shape the future of SRE practices at scale. This is a fully remote position for candidates within EST or CST time zones.
Accountabilities:
- Architect highly reliable, fault-tolerant, and self-healing systems with a resiliency-by-design approach.
- Lead technical strategy for hybrid cloud and on-premise SaaS environments, including multi-region and multi-cloud workloads.
- Mentor and guide SRE and software engineering teams in monitoring, observability, incident management, and production readiness practices.
- Define SLIs, SLOs, and error budgets to balance system stability with feature velocity.
- Drive automation through Infrastructure as Code using Terraform, CDK, or similar tools to create repeatable, audit-ready environments.
- Promote proactive reliability practices such as chaos engineering, feature flagging, architectural decision records, and production readiness reviews.
- Optimize cloud spend while maintaining high system performance and compliance with DevSecOps frameworks (e.g., GDPR, HIPAA, HITRUST).
- Collaborate across teams to implement secure, high-performance networking and compute architectures, including Kubernetes management, load balancing, and Zero Trust security models.
- Evaluate and responsibly adopt AI tools to enhance operational productivity and system innovation.
Requirements:
- Bachelor’s or Master’s degree in Computer Science, Information Systems, or a related field, with relevant professional experience.
- 10+ years in SRE or DevOps roles, with at least 4 years in enterprise SaaS environments.
- 4+ years of software development experience contributing to cloud-hosted SaaS products.
- Proven experience managing multi-cloud or multi-region distributed workloads.
- Deep understanding of cloud networking, including DNS, TCP/IP, load balancing, and Zero Trust security models.
- Strong programming skills in Go, Python, Java, C#, or similar languages for internal tooling and automation.
- Expert-level knowledge of Kubernetes architecture, multi-cluster management, and stateful workloads.
- Experience operating in DevSecOps environments with compliance guardrails.
- Ability to define, monitor, and optimize reliability metrics and incident management processes.
- Excellent communication skills to mentor teams and influence technical decision-making.
Benefits:
- Competitive base salary range of $170,000–$185,000 per year, excluding variable compensation.
- Fully remote role for candidates in EST or CST time zones.
- Comprehensive health, dental, and vision coverage.
- Paid time off and company-paid holidays.
- 401(k) retirement plan with company match.
- Opportunities to work in a high-impact, innovative technology environment.
- Programs supporting professional development and continuous learning.