Site Reliability Engineer | NVIDIA | Jobs Alert | Latest Job Update 2022 Career Height
NVIDIA’s invention of the GPU in 1999 sparked the expansion of the PC gaming market, redefined modern special effects, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI — subsequent era of computing — with the GPU acting because the brain of computers, robots, and self-driving cars which will perceive and understand the planet. Today, NVIDIA is increasingly referred to as “the AI computing company.”
Site Reliability Engineer Job Description
Site Dependability Engineering is an engineering discipline that uses a combination of software and systems engineering practises to design, build, and maintain large-scale production systems with high efficiency and availability. This is a highly specialised field that necessitates knowledge of various systems, networking, coding, databases, capacity management, continuous delivery, and deployment. The Analyst within SRE is responsible for measuring and analysing the entire system, which includes SRE.
The NVIDIA GeForce Now (GFN) team is looking for a Site Reliability Engineer / Analyst (SRE/A). SRE at NVIDIA ensures that our internal and external facing GPU cloud gaming services provide the reliability and uptime promised to users while also allowing developers to make changes to the existing system through careful preparation and planning while keeping capacity, latency, and performance in mind.
SRE is also a mindset as well as a set of engineering approaches to running better production systems and optimising them. Much of our software development is focused on automating manual tasks, improving performance, and increasing the efficiency of production systems. As SREs are in charge of the big picture of how our systems interact with one another, we employ a diverse set of tools and approaches to address a wide range of issues. Limiting time spent on reactive operational work, conducting blameless postmortems, and proactively identifying potential outages all contribute to iterative improvement, which is critical to both product quality and interesting and dynamic day-to-day work.
This person will be in charge of service response, as well as collecting and cleaning data, creating and managing data pipelines, performing statistical analyses, and producing reports and dashboards with the goal of improving system performance and availability. We collaborate with Service Owners to improve service reliability. The GFN Service is a promising new service in the rapidly expanding game streaming industry. This is a once-in-a-lifetime opportunity to work with Nvidia GPUs on multiple operating systems, cloud platforms, and proprietary hardware.
- Working with several database systems including NOSQL systems.
- Help service owners define and monitor health metrics.
- Capacity management including short- and long-term forecasting
- Design and implement mitigations and controls for availability risk management.
- Rapidly debug and triage incidents and user-reported issues ‘
- Make valuable contribution to the overall health, performance, and reliability of GFN and Infrastructure Services.
- Create data pipelines and dashboards to improve observability of the system and components.
- Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity management and launch reviews.
- Maintain services once they are live by measuring and monitoring availability, latency and overall system health.
- Create statistical models to identify, classify, and measure dynamic systems.
- Perform failure analysis
- Practice balanced incident response and blameless postmortems.
- Be part of an on call rotation to support production systems
- Graduate degree in a STEM field or equivalent practical experience.
- 2 or more years site reliability engineering experience working on large scale distributed micro services in a production environment with experience in performing analytics to support SRE mission.
- 2 or more years in a data analytics or data science role.
- Proven experience with the data science life cycle.
- Strong background in probability and statistics.
- Experience with a statistical language such as R.
- SRE mindset e.g. Error budgeting, SLO’s, SLA’s and SRE KPI’s.
- Clear SRE Observability understanding and experience in building new tools and automation using languages such as Python.
- Clear understanding on Incident management, change management and problem management process. Ability to detection of all service-impacting issues, accurate triage, partner communication, impact containment, service restoration, and post-incident follow-up.
- Proven strengths in problem-solving and root causing issues, while continuously seeking ways to drive optimization, efficiency and the bottom line.
- Experience with monitoring and/or automation platforms crafting alerts and building service
- Excellent communication, presentation, social, and analytical skills; the ability to communicate complex interaction concepts clearly and persuasively across different audiences and varying levels of the organization.
- Previous experience with Datadog, Prometheus, alert manager or similar monitoring systems.
- Background with optimizing Spark jobs.
- Experience using Failure Modes & Effects Analysis and/or Fault Tree Analysis
- Experience creating dashboards with Shiny or equivalent tooling.
- Able to write highly complex SQL queries.
- Background with Kibana, Grafana, Elastic Search, and Watchers is a plus
- Experience with Stack Storm and GitLab CICD is additional advantage
Аррly Link is given belоw jоin us fоr Reсent Uрdаte