About AIONAION is building the next generation of AI cloud platform by transforming the future of high-performance computing (HPC) through its decentralized AI cloud. Purpose-built for bare-metal performance, AION democratizes access to compute power for AI training, fine-tuning, inference, data labeling, and beyond.
By leveraging underutilized resources such as idle GPUs and data centers, AION provides a scalable, cost-effective, and sustainable solution tailored for developers, researchers, and enterprises. The platform's innovative Proof of Compute Contribution (PoCC) protocol rewards contributors based on performance, creating a transparent and efficient ecosystem.
Integrated with Tether (USD₮ & USD₮0) for stability and regulatory clarity, AION eliminates volatility, ensuring predictable costs and seamless transactions. With cutting-edge partnerships and a USD-backed economy, AION is pioneering the commoditization of high-performance compute, empowering global innovation and bridging the AI wealth gap.
Led by high-pedigree founders with previous exits, AION is well-funded by major VCs with strategic global partnerships. Headquartered in the US with global presence, the company is building its initial core team in India.
Who you areYou are a reliability-focused engineer with deep expertise in cloud-native systems and infrastructure automation. You thrive on building robust monitoring solutions and creating self-healing infrastructure. You understand the challenges of maintaining high availability across distributed systems and have experience implementing SRE best practices. You're passionate about creating production-ready environments that can scale efficiently and recover automatically from failures.
Technical Skills & Experience • 3-8 years of experience in Site Reliability Engineering or DevOps (exceptional candidates with different experience profiles will be considered)
- A Tier1 college education or previous work experience at FAANG/top startups is preferred but not required
- Cloud Platforms: Deep expertise with AWS, GCP, or Azure infrastructure services
- Kubernetes: Advanced knowledge of Kubernetes operations, cluster management, and troubleshooting
- Infrastructure as Code: Strong experience with Terraform, Pulumi, or similar IaC tools
- Observability: Expertise implementing comprehensive monitoring using Prometheus, Grafana, and ELK stack
- Service Mesh: Experience with Istio, Linkerd, or similar service mesh technologies
- Networking: Understanding of network architectures, DNS, load balancing, and security groups
- CI/CD: Knowledge of automated deployment pipelines and GitOps workflows
- Scripting: Proficiency in Bash, Python, or Go for automation scripts
- Container Technologies: Deep understanding of Docker, containerd, and OCI specifications
- Security: Knowledge of infrastructure security best practices and compliance requirements
- Incident Management: Experience with incident response, post-mortems, and developing SOP documentation
Key Responsibilities • Responsible for designing and implementing comprehensive monitoring and alerting systems across all AION platforms.
- Develop automation for infrastructure provisioning, scaling, and recovery using Terraform and Kubernetes.
- Create and maintain runbooks and playbooks for handling common operational scenarios and incidents.
- Responsible for implementing service mesh solutions for observability, traffic management, and security.
- Design and implement logging systems that provide visibility into complex distributed systems.
- Responsible for capacity planning and resource optimization across cloud environments.
- Implement CI/CD pipelines for reliable and consistent deployments across all environments.
- Design and build self-healing systems that automatically recover from common failure modes.
- Develop infrastructure for both the compute platform and data annotation services with consistent reliability practices.
- Responsible for designing and implementing disaster recovery strategies and testing procedures.
- Create and maintain production, staging, and development environments with appropriate isolation.
- Collaborate with security teams to implement infrastructure security best practices and compliance requirements.
LocationIndividuals in this role are expected to relocate to Bangalore, though exceptions can be made. We offer a hybrid working setup with 3 days in-office setup. Employees would have flexibility to work from anywhere for a few months during a year.
Why Join Us • Be part of a mission-driven team at the intersection of web3 and AI, tackling some of the most exciting challenges in the industry.
- Join the ground floor of an AI startup, with the opportunity to make a significant impact on the company and the industry.
- Collaborate with top-tier talent from the tech industry.
- Competitive salary and benefits package.
- Flexible work environment with opportunities for professional growth and development.
If you are a skilled and motivated Site Reliability Engineer with a passion for building reliable, scalable infrastructure for cutting-edge compute systems, we would love to hear from you.