SRE-Datadog

We have an exciting opportunity and mission to work with a pre-public technology company in the medical field build their core Site Reliability team and mature their culture around resiliency.

Required Experience:

+ Years
Job Locations:

Remote

Location Restrictions:

Remote

Basic Qualifications and

We have an exciting opportunity and mission to work with a pre-public technology company in the medical field build their core Site Reliability team and mature their culture around resiliency. 

We are looking to draft top talent who desire the chance to dig into continuity impacting problems with industry SMEs in an environment that is greenfield with stability opportunities. 

If you are passionate about fixing what's broken, but also improving the standards forhuman processes, response and communication between technology teams and their supported business areas.

Project description:

·        We have an infrastructure monitoring tool that we need someone to come in and own and manage. The tool is Datadog.

·        We currently have one other person doing this type of work and they require an additional person to help manage the workload.

·        The primary function of this position is to use Datadog to understand the issues the infrastructure environment is facing and then respond + remediate any issues.

·        The bulk of this work will be BreakFix and Debugging.

·        This resource must be proficient in/understand the code for Python and Linux because those are the primary languages being used in the environment.

Technology environment the consultant will be working in:

·        Python, Linux, DataDog, Kubernetes, Terraform, Ansible, Github  Actions, Snowflake, Postgres SQL.

Responsibilities

Consultants day to day responsibilities:

·        Own, setup and manage Datadog

·        Perform Break Fix work on various issues that pop up.

           o   EX: Jobs Failing, Data Pipeline Failures

·        Work with the product owners in the Infrastructure group to make sure they products and tools are working optimally and uptime is maintained.

           o   A main goal for this task is to avoid surprises in the environment.

·        Debugging issues tied to Linux(hosted) and Python

Required Skills and Experience

Required skills:

·        Prior Experience as a Site Reliability Engineer using Datadog

             o   Must have experience working with experience performing break fix, troubleshooting, and debugging Linux and Python

·        Must have prior experience being on a non-call team that is focused on maintaining uptime.

 

Desired Skills:

·        Experience configuring and troubleshooting data and services infrastructure on Kubernetes

Interested in this position?
Fill out the form below!