Platform Engineer (Machine Learning)
About Us
Orion Labs is a small, fast-growing software development and DevOps consultancy based in Cape Town, South Africa. We build scalable software platforms, cloud infrastructure, machine learning systems, and internal developer tooling for startups and enterprise clients around the world. We’re collaborative and hands-on with dynamic technologies. You’ll join a supportive team with real ownership, lots of room to learn, and clear opportunities to grow your career while doing meaningful work for great clients.
Job Description
We are seeking a strong mid-to-senior Platform Engineer to help design and operate cloud platforms, machine learning infrastructure, evaluation systems, and developer environments. The primary focus of this role is building a machine learning competition and evaluation platform where participants submit ML projects that are executed and scored in secure, GPU-enabled runtime environments. This role sits at the intersection of cloud infrastructure, developer platforms, automation, ML/GPU compute, and distributed systems. You will work on projects such as ML competition and evaluation platforms, GPU-based compute environments, Dockerized and sandboxed runtime environments, secure multi-tenant cloud infrastructure, model evaluation and benchmarking systems, infrastructure automation and provisioning, internal developer tooling and orchestration systems, and CI/CD and deployment automation. There may also be opportunities to contribute to internal R&D around AI inference and LLM infrastructure.
Responsibilities:
- Design and maintain cloud infrastructure primarily on AWS.
- Build reusable platform tooling and automation systems used by other engineers.
- Provision and manage infrastructure using Terraform/OpenTofu.
- Configure and maintain Docker-based environments and containerized workloads.
- Develop Python automation and orchestration scripts.
- Build and operate GPU-enabled compute environments for machine learning workloads, including sandboxed runtime environments for user-submitted code.
- Build and improve CI/CD workflows for efficient software delivery.
- Monitor, troubleshoot, and optimize infrastructure reliability and performance.
- Improve developer workflows, internal tooling, and platform usability.
- Contribute to architecture and operational decisions across projects.
- Ensure security best practices are followed throughout the platform lifecycle.
- Stay up-to-date with the latest cloud, ML infrastructure, and platform engineering trends.
Qualifications:
- Bachelor’s degree in Computer Science, Engineering, Information Systems, or a related field (or equivalent practical experience).
- 5–7 years of hands-on experience operating production cloud infrastructure (mid to senior level).
- Strong experience with Linux systems administration.
- Strong experience with Docker and containerized environments.
- Strong experience with AWS or similar cloud platforms.
- Strong experience with Terraform or other infrastructure-as-code tooling.
- Proficiency in Python scripting and automation.
- Proficiency with Git and GitHub workflows.
- Hands-on experience with GPU-enabled workloads and CUDA environments.
- Familiarity with machine learning infrastructure and common ML frameworks (PyTorch, TensorFlow, etc.) from a platform/integration perspective.
- Solid understanding of networking and cloud security fundamentals.
- Excellent problem-solving and troubleshooting skills.
- Strong communication and collaboration abilities.
- Ability to operate independently and balance speed with operational stability.
Nice to Have:
- Experience building competition, evaluation, or benchmarking platforms.
- Experience with sandboxing and secure multi-tenant execution of user-submitted code.
- Experience with Kubernetes.
- Experience with CI/CD systems such as GitHub Actions.
- Experience building internal platforms or developer tooling.
- Familiarity with model serving frameworks (vLLM, Ollama, etc.) or LLM inference systems.
Technical Stack:
- Cloud: AWS (primary), with exposure to other cloud platforms
- Infrastructure as Code: Terraform / OpenTofu
- Containers: Docker, container runtimes, Kubernetes (nice to have)
- Automation & Scripting: Python, Bash
- CI/CD: GitHub Actions and similar systems
- ML & GPU Infrastructure: GPU workloads, CUDA, sandboxed runtime environments, ML frameworks (PyTorch, TensorFlow)
- Operating Systems: Linux
- Tools: Git, GitHub, Slack, VS Code
Benefits
We offer competitive compensation, flexible hours, a learning and certification budget, and a supportive, low-ego team that values work–life balance and growth. This role is based in Cape Town. Full-time is preferred, with contract-to-permanent arrangements also considered.
Inclusive Hiring
We hire for skill, potential, and values. In South Africa’s diverse context, we welcome applicants from all backgrounds and aim to provide a fair, inclusive process. If you have the skills and drive, we’d love to hear from you.