We are looking for a Distributed Systems / GPU Infrastructure Engineer to help architect and scale the core infrastructure behind the CapaCloud decentralized GPU network.
You will work on GPU orchestration, node infrastructure, distributed computing systems, workload scheduling, performance optimization, and platform reliability.
This is a high-impact engineering role for someone passionate about building the next generation of decentralized AI infrastructure.
Key Responsibilities
- Design and build scalable distributed GPU infrastructure
- Develop systems for node orchestration and workload scheduling
- Optimize GPU utilization and compute performance
- Build fault-tolerant infrastructure for decentralized environments
- Improve network reliability, scalability, and uptime
- Develop deployment automation and infrastructure tooling
- Work with AI and blockchain teams to integrate compute systems
- Monitor infrastructure performance and troubleshoot bottlenecks
- Contribute to backend architecture and cloud-native systems
- Implement secure infrastructure best practicesRequired Skills & Experience
- Strong experience with distributed systems and backend infrastructure
- Experience with Kubernetes, Docker, and container orchestration
- Strong Linux systems administration knowledge
- Experience with GPU infrastructure and CUDA environments
- Proficiency in Go, Rust, Python, or similar backend languages
- Experience with cloud infrastructure platforms
- Understanding of networking, virtualization, and load balancing
- Experience building scalable APIs and infrastructure services
- Familiarity with monitoring tools and observability stacks
- Strong debugging and performance optimization skillsNice To Have
- Experience in decentralized infrastructure or Web3
- Experience with AI/ML infrastructure
- Bare-metal infrastructure experience
- Experience with distributed storage systems
- Knowledge of peer-to-peer networking systems
- Open-source contributionsWhat Success Looks Like
- Reliable decentralized GPU orchestration system
- High-performance compute scheduling infrastructure
- Reduced latency and improved GPU efficiency
- Stable infrastructure scaling across multiple regions
- Strong uptime and system reliability metricsEmployment Type