My client, a global hedge fund, is actively seeking a hands on a highly skilled and motivated SRE to join their team. As an SRE, you will play a critical role in driving the adoption of Site Reliability Engineering practices within their organization. The ideal candidate will have a strong technical background and a passion for driving operational efficiency and continuous improvement.
The role:
- Drive the adoption of SRE principles, methodologies, and best practices across the organization.
- Collaborate closely with application development teams to ensure the successful deployment and operation of applications, including early-stage support during development.
- Establish and monitor key metrics, performance indicators, and service level objectives (SLOs) to ensure the reliability and availability of critical systems.
- Identify opportunities to eliminate toil through automation, code improvements, and process optimizations.
- Conduct root cause analyses for system failures and incidents, and implement engineering solutions to prevent future occurrences.
- Lead incident management and resolution efforts, ensuring timely and effective response to incidents, and driving post-incident reviews and process improvements.
- Work closely with cross-functional teams, including infrastructure, networking, and security, to optimize system performance, scalability, and security.
- Collaborate with stakeholders to define and refine service-level agreements (SLAs) and operational requirements.
- Stay abreast of industry trends and emerging technologies in Site Reliability Engineering, and leverage them to drive innovation and enhance operational efficiency.
What you offer:
- Bachelor's or Master's degree in Computer Science, Engineering, or a related field.
- Extensive experience in Site Reliability Engineering or a related field, with a strong understanding of SRE principles, practices, and tools.
- Strong technical background with expertise in areas such as distributed systems, cloud computing, network architecture, and software development.
- Strong experience in Python, also with automation and configuration management tools.
- Solid understanding of monitoring and observability frameworks, incident management, and post-incident analysis.
- Excellent problem-solving and troubleshooting skills, with the ability to analyze complex systems and identify areas for improvement.
- Strong leadership skills, with the ability to inspire and motivate a team, and foster a culture of collaboration, innovation, and continuous improvement.
- Excellent communication and interpersonal skills, with the ability to effectively communicate technical concepts to both technical and non-technical stakeholders.
- Experience in the financial industry or hedge fund environment highly preferred.
The sell:
- Extremely competitive compensation and medical benefits
- Brand name that will open doors for your career in the future
- Many opportunities for internal mobility and long term career growth