Job Description
Are you ready to architect the future of Artificial Intelligence?
Nexus Systems is pioneering the technological landscape leading up to 2026. We are looking for a visionary Senior AI Infrastructure Architect to lead our high-performance computing initiatives. In this role, you will be at the forefront of deploying next-generation Large Language Models (LLMs) and revolutionizing how machines learn and reason.
If you are passionate about optimizing neural networks, managing massive GPU clusters, and solving complex scalability challenges, we want to hear from you. Join a team that is not just building software, but defining the digital reality of tomorrow.
Why Join Us?
- Work with cutting-edge AI hardware (H100, H200 GPUs).
- Competitive equity package and benefits.
- Flexible remote-first culture with offices in Austin, TX.
Core Responsibilities:
Responsibilities
- Design, deploy, and manage large-scale GPU clusters optimized for training and inference of LLMs.
- Implement advanced MLOps pipelines to automate model training, validation, and deployment cycles.
- Optimize system latency and throughput for real-time AI applications.
- Collaborate with data scientists and researchers to translate theoretical models into efficient production code.
- Ensure high availability, security, and compliance of AI infrastructure across cloud and on-premise environments.
- Drive architectural decisions regarding hardware selection, software stack, and networking protocols.
Qualifications:
Qualifications
- 5+ years of experience in systems engineering, DevOps, or Machine Learning Operations (MLOps).
- Deep expertise in Python, C++, and CUDA programming.
- Strong proficiency with Kubernetes, Docker, and cloud platforms (AWS, GCP, or Azure).
- Experience managing clusters of NVIDIA GPUs and high-bandwidth networking (InfiniBand/RoCE).
- Experience with PyTorch, TensorFlow, or JAX frameworks.
- Bachelor’s or Master’s degree in Computer Science, Electrical Engineering, or a related technical field.
Skills: Python, Kubernetes, MLOps, CUDA, PyTorch, AWS, GCP, GPU Clustering, System Architecture, Machine Learning, Docker, High-Performance Computing.
Category: Information Technology