(Senior) SRE & Infra Engineer Job at Prime Intellect, San Francisco, CA

SlFGTWUwT1poR2pEaDhFNXk2aE9Mb0ZMU0E9PQ==
  • Prime Intellect
  • San Francisco, CA

Job Description

Building the Future of Decentralized AI Development

At Prime Intellect, we're building the foundation for decentralized AI development at scale. Our platform combines powerful distributed training infrastructure with an intuitive developer experience, enabling researchers and engineers to train state-of-the-art models collaboratively.

We recently raised $15mm in funding (total of $20mm raised) led by Founders Fund, with participation from Menlo Ventures and prominent angels including Andrej Karpathy (Eureka AI, Tesla, OpenAI), Tri Dao (Chief Scientific Officer of Together AI), Dylan Patel (SemiAnalysis), Clem Delangue (Huggingface), Emad Mostaque (Stability AI) and many others.

Role Impact

This hybrid role spans across platform reliability and infrastructure engineering. You'll be instrumental in:

  • Infrastructure Reliability: Ensuring high availability, fault tolerance, and performance across internal research and external customers’ GPU cluster environments.

  • Cluster Onboarding & Support: Automating GPU cluster onboarding, handling support requests, and troubleshooting operational challenges.

  • Observability, Security & Feature Development: Enhancing monitoring, logging, and security systems, and developing new backend features to boost platform functionality.

Core Technical Responsibilities

Operational Excellence & Support

  • Cluster Onboarding: Develop and automate procedures to integrate internal research clusters and external customer deployments.

  • Incident Management: Lead efforts in incident detection, response, and postmortem analysis to drive continuous improvement.

  • Support Engineering: Address platform support requests by diagnosing and resolving reliability issues promptly.

Infrastructure Automation & Reliability

  • Monitoring & Observability: Design and implement comprehensive observability solutions using tools like Prometheus and Grafana, ensuring proactive detection of issues.

  • Automation & Orchestration: Utilize tools such as Ansible, Terraform, and Kubernetes to streamline infrastructure management and automation.

Backend & Feature Development

  • New Feature Engineering: Collaborate with the engineering team to design and implement backend features.

  • API and Service Development: Enhance our platform’s REST APIs and backend services to support new capabilities and improve overall performance.

  • System Integration: Ensure seamless integration of new features into our existing infrastructure, maintaining high reliability and security standards.

Technical Requirements

Reliability & SRE Skills

  • Incident & Monitoring Expertise: Proven experience with monitoring tools (e.g., Prometheus, Grafana) and incident management practices.

  • Automation Proficiency: Strong skills in infrastructure automation with Ansible, Terraform, or similar.

  • Observability & Logging: Deep understanding of logging frameworks, alerting systems, and proactive monitoring solutions.

Development & Infrastructure Skills

  • Backend Engineering: Proficiency in Python for developing automation scripts, REST APIs, and backend support tools.

  • Container & Cloud Technologies: Hands-on experience with Kubernetes and cloud platforms (GCP preferred).

Nice to Have

  • Familiarity with GPU computing and AI/ML training infrastructure.

  • Experience contributing to open-source infrastructure projects.

  • Knowledge of high-performance networking and real-time systems.

What We Offer

  • Competitive compensation with significant equity and token incentives

  • Flexible work arrangement (remote or San Francisco office)

  • Full visa sponsorship and relocation support

  • Professional development budget for courses and conferences

  • Regular team off-sites and conference attendance

  • Opportunity to shape the future of decentralized AI development

Growth Opportunity

You'll join a team of experienced engineers and researchers working on cutting-edge problems in AI infrastructure. We believe in open development and encourage team members to contribute to the broader AI community through research and open-source contributions.

We value potential over perfection - if you're passionate about democratizing AI development and have experience in either platform or infrastructure development (ideally both), we want to talk to you.

Ready to help shape the future of AI? Apply now and join us in our mission to make powerful AI models accessible to everyone.

Job Tags

Remote job, Work at office, Visa sponsorship, Relocation package, Flexible hours,

Similar Jobs

Medasource

Drug & Supplement Retail Inspector (Entry-Level Science Graduates in the Kentucky Area) Job at Medasource

 ...Drug & Supplement Retail Inspector (Entry-Level Science Graduates in the Kentucky Area) Contract: 12 Months (10/20/2025 10/30/2026) | Strong Possibility of Extension Pay: $25/hr (W2) | Equipment and mileage reimbursement provided Location: Multiple openings across... 

Confidential

Senior Accountant Job at Confidential

 ...This is a hybrid in office 3/days each week. Location is in Houston, TX. The senior accountant is responsible for centralized accounting functions within the operating unit, directly supporting the accounting manager. This process-oriented individual will work through... 

Forsyth Barnes

Senior Analyst, Business Intelligence (Ref: 185794) Job at Forsyth Barnes

 ...Title: Senior Analyst, Business Intelligence (Tableau) Salary: $85,000-115,000 + Bonus Location: Midtown Manhattan 3 days/week Contact: ****@*****.*** Position Overview This organization is on the lookout for an experienced Senior Analyst... 

MONADNOCK REGION EMERGENCY PHYSICIANS PLLC

Emergency Medicine Physician Job at MONADNOCK REGION EMERGENCY PHYSICIANS PLLC

 ...We are seeking a dedicated Emergency Physician to join our small, independent Emergency Medicine group. Our team staffs the Monadnock Community Hospital Emergency Department, a 25-bed Critical Access Hospital (CAH) that provides comprehensive care, including inpatient... 

COMMUNITIES FIRST, INC

Maintenance Technician - Gary Indiana Job at COMMUNITIES FIRST, INC

 ...whose mission is to build healthy, vibrant communities through economic development, affordable housing and innovative programming. CFI is focused on providing safe, quality affordable housing, increasing economic opportunities, and improving the quality of life of the...