Job Description Summary
The Site Reliability Engineering team is responsible for the reliability and performance of tools worldwide. We obsess over availability by building tools and engineering new systems to automate our platform. We are software engineers with full visibility and influence across the entire stack.
We create tooling, deliver and operate customer environments both on-prem and in the cloud using cloud native technologies
Job Description
Roles and Responsibilities
In this role, you will:
• Develop automated solutions to predict and address potential problems before they result in a service interruption
• Oversee and adapt monitoring and alerting systems
Want more jobs like this?
Get jobs in Bucharest, Romania delivered to your inbox every week.
• Collaborate with all GE business units worldwide, providing a bastion technical expertise
• Identify potential process improvements across the entire engineering organization
• Define and drive architectural enhancements into system to mitigate potential failure points
• Provide impact assessment and mitigation plan for changes going into the production environment
• Investigate root cause of severe and systemic outages, identify corrective actions
• Establish performance baseline, capacity thresholds, correlate events, and define monitoring/alerting criteria
• Provide technical coaching and direction to more junior teammates
Education Qualification
Bachelor's Degree in Computer Science or "STEM" Majors (Science, Technology, Engineering and Math) with advanced experience.
Desired CharacteristicsTechnical Expertise:
• Excellent knowledge of Linux system internals
• Excellent knowledge of Kubernetes for cluster management of containers
• Strong analytical and problem solving skills
• Experience with all stages of an agile software development lifecycle (CI/CD)
• Familiar with largecluster deployment tools (Helm, Kustomize)
• Demonstrated ability to script around repeatable tasks (Go, Ruby, Python, Bash)
• Experience with developing cloud-native applications (High Availability)
• Able to dive into any level of a modern internet service (schedulers, containers, Linux kernel,
caching, object storage, distributed filesystems, RDBMS, NoSQL, etc.)
• Comfortable with network troubleshooting (tcpdump, routing, proxies, firewalls, load balancers,
etc.)
• Able to troubleshoot and debug applications (C, Java, Go)
• Proficient in configuration management systems (Chef, Terraform, Ansible, Puppet, Salt)
• Experience with configuring, customizing, and extending monitoring tools (Sensu, Grafana, Prometheus, Graphite, Splunk, etc.)
• Experience deploying and managing infrastructure on public clouds (AWS, GCP, or Azure)
• Comfortable using Git on the command line
Leadership:
• Influences through others; builds direct and "behind the scenes" support for ideas. Preemptively
sees downstream consequences and effectively tailors influencing strategy to support a
• positive outcome.
• Able to verbalize what is behind decisions and downstream implications. Continuously
reflecting on success and failures to improve performance and decision-making. Understands and encourages change when needed.
• Proactively identifies and removes project obstacles or barriers on behalf of the team.
• Able to navigate accountability in a matrixed organization.
• Self-starter; communicates and demonstrates a shared sense of purpose. Learns from failure.
Personal Attributes:
• Critical thinker; able to quickly adapt to changing environments
• A hacker or tinkerer at heart
• Risk taker, not afraid to think outside the box or challenge the status quo
• Emotional Intelligence, ability to influence up and out and the ability to work independently
• Must be a team player with a strong desire to win
• Passionate about continuously learning
• Highly organized and efficient; able to balance competing priorities and execute accordingly
• Strong oral and written communication skills.
Additional Information
Relocation Assistance Provided: No