Magic is hiring a

Distributed Compute Engineer

Job Overview

  • Posted 6 days ago
  • Full Time
  • San Francisco, CA, USA
  • 90000

Roles & Responsibilities

About the role:Β As a distributed systems engineer for compute, you will build the stack and systems that enable 1T+ parameter model training and efficient inference on Magic’s GPU clusters.Β 

Β 

What you might work on:Β 

  1. Develop and maintain the software stack to support large-scale, highly available AI training and inference infrastructure
  2. Implement and optimize systems for data processing and inference using technologies like Ray, Redis,
  3. Message Queues (Kafka), distributed communication libraries (gRPC, ZeroMQ) and HPC technologies
  4. Orchestrate fine-grained data movement using Rust, C++ and NCCL or UCX
  5. Design and manage high-performance storage and caching solutions to support data-intensive applications
  6. Build with an eye towards fault-tolerance, performance and observability
  7. Hack on the internals of deep learning frameworks (PyTorch, Jax) in a distributed setting
  8. Troubleshoot and resolve complex issues across GPU resources, networking, OS, drivers, and cloud environments. Automate fault detection and recovery processes

Skills Required

  • Machine Learning
  • Python

Find more jobs at Magic

There are no results matching your search.

Reset
AISolvesThat Β© 2024 All rights reserved