AI Infra engineer Job at Artech, Morrisville, NC

VTUxTjRiZ3VXcWJyZmsrNGdDS2h6cFNUUlE9PQ==
  • Artech
  • Morrisville, NC

Job Description

Title: AI Infra Engineer
Duration: 10+ Months
Location: Morrisville, NC, 27560

Short Description:
This role combines IT operations, hardware troubleshooting, and AI infrastructure expertise. expect to handle day-to-day system administration, diagnose and resolve issues, and ensure optimal performance for ML workloads.

Key Responsibilities
  • Hardware Management and Troubleshooting: Monitor and maintain GPU servers/workstations, including diagnosing and resolving hardware failures (e.g., GPU faults, power issues, cooling problems). Coordinate repairs, replacements, or upgrades as needed to ensure system uptime.
  • Software and Driver Management: Install, update, and configure CUDA drivers, Linux operating systems (e.g., Ubuntu or CentOS), and related dependencies. Ensure compatibility across hardware and software stacks for seamless ML operations.
  • Performance Benchmarking: Run and analyze MLPerf benchmarks to evaluate system performance, identify bottlenecks, and optimize configurations for ML training tasks.
  • System Diagnostics and Problem Resolution: Proactively monitor systems for issues, perform root-cause analysis on failures or performance degradation, and implement fixes. This includes debugging kernel errors, network issues, or resource contention during LLM training.
  • General Infrastructure Ops: Implement best practices for security, backups, logging, and monitoring. Handle routine maintenance, such as firmware updates, patch management, and capacity planning for the GPU cluster.

Required Qualifications
  • - Proven experience (3+ years) in managing GPU-accelerated servers or high-performance computing (HPC) environments, preferably in AI/ML contexts.
  • - Strong knowledge of Linux system administration, including shell scripting, package management, and networking.
  • - Hands-on experience with NVIDIA CUDA toolkit, drivers, and GPU hardware (e.g., A100, H100, or similar).
  • - Familiarity with ML benchmarking tools like MLPerf and frameworks such as TensorFlow, PyTorch, or Hugging Face for LLM training.
  • - Ability to diagnose hardware and software issues using tools like nvidia-smi, dmesg, top/htop, or Prometheus/Grafana for monitoring.
  • - Understanding of AI infrastructure ops, including containerization (Docker/Kubernetes) and orchestration for distributed training.
  • - Excellent problem-solving skills with a proactive approach to preventing downtime.

Preferred Skills
  • - Experience with cluster management tools like Slurm, Kubernetes, or Ray for scaling ML workloads.
  • - Knowledge of hardware diagnostics for servers (e.g., IPMI, BIOS configuration, RAID setups).
  • - Background in IT operations with AI focus, such as DevOps for ML (MLOps).
  • - Certifications like RHCE (Red Hat Certified Engineer), NVIDIA certifications, or similar.
  • - Ability to work independently in a remote or on-site setup, with strong communication skills for reporting issues.

Job Tags

Remote work

Similar Jobs

Paychex

Sales Executive - PEO Job at Paychex

 ...Imagine Your Future with Us! Since 1971, Paychex has been at the forefront of simplifying HR, payroll, and benefits for American businesses. Our digital HR technology and advisory solutions cater to the changing needs of employers and their employees. With our award-winning... 

EMCOR Group

BMS Service Sales Executive Job at EMCOR Group

 ...Bms Service Sales Executive Control Solutions Group, an EMCOR Group company, is seeking a highly motivated BMS Service Sales Executive to join our team and drive sales for our Building Management System Service Department. As a Service Sales Executive, under the direction... 

LHH

Human Resources Generalist Job at LHH

 ...Lead employee engagement initiatives and support a workplace culture aligned with the companys values. Oversee fullcycle recruiting for hourly, skilled trades, and salaried roles, including sourcing, interviewing, selection, and coordination with hiring leaders... 

Purple Drive

GEN AI Engineer Job at Purple Drive

 ...Overview: ****************Local Preferred*********************** Role Overview We are seeking a talented Generative AI Engineer with strong expertise in Python and modern AI/ML frameworks to design, build, and deploy Generative AI-powered solutions . The... 

IsaacMorris

Account Manager - Walmart Job at IsaacMorris

 ...properties. Job Summary We are looking for a detail-oriented Walmart Account Manager to support the day-to-day management of our...  ...cycle Help monitor store and ecom item setups, pricing, and ecommerce content updates, including product listings and images...