Skip to main contentA logo with &quat;the muse&quat; in dark blue text.

Production System Engineer - San Jose

AT TikTok
TikTok

Production System Engineer - San Jose

San Jose, CA

Responsibilities

Unlocking the secrets of ByteDance's global tech empire, the Data Systems Infrastructure (DSI) team stands as the unseen architects behind the scenes. In a thrilling dance of technology and innovation, we propel the company's meteoric rise by constructing and orchestrating colossal data fortresses, taming the life cycle of server fleets, conjuring Cloud solutions, and crafting a symphony of infrastructure services. Our mission is to ensure scalability and unwavering reliability, making sure ByteDance's digital footprint leaves an indelible mark on the world.

Embark on an exciting expedition to explore the rapidly expanding ByteDance domain in the United States, Europe, and Asia. Here, the Data Systems Infrastructure (DSI) team is crafting monumental data citadels that encircle the planet, sheltering legions of hundreds of thousands of servers. As the maestro of our production systems, you will embark on a captivating odyssey, taming the life cycles of these servers. Your adventure will begin with the orchestration of their initial deployment, navigating the intricate terrain of OS installation, summoning services like a digital magician, and maintaining vigilant watch over our inventory. But, like any epic tale, there will be times of challenge when you become a troubleshooter extraordinaire, mending and restoring with unwavering dedication. Eventually, you'll guide them into the sunset, orchestrating their decommissioning and ensuring their rebirth through recycling, all while contributing to the pulsating rhythm of ByteDance's technological evolution.

Want more jobs like this?

Get Software Engineering jobs in San Jose, CA delivered to your inbox every week.

By signing up, you agree to our Terms of Service & Privacy Policy.


Responsibilities:
- Operation: As a Production Systems Engineer, your mission is to contribute to enhancing the stability, efficiency, effectiveness, and scalability of our data center and server operations, platform, and service on a worldwide scale.
- Lifecycle Enhancement: Participate in and enhance the entire lifecycle of the server fleet - from system design/introduction consultation to launch reviews, deployment, operation, and retirement.
- Automation: Develop and deploy tools and solutions to enhance the automation, reliability, scalability, and operability of servers in the datacenter.
- Monitoring: Develop and deploy tools and solutions for improving the availability, latency, and overall service of the datacenter infrastructure, server, and network health.
- Disaster Recovery: Troubleshoot and resolve complex technical issues in a high-pressure, time-sensitive environment. Conduct high-level root-cause analysis for service interruption and establish preventive measures. Practice sustainable incident response and postmortem.
- Cross-team Collaboration: Collaborate with stakeholders such as infrastructure architects, project managers, data center operations engineers, platform developers, supply chain teams, and our internal customers to comprehend overarching business objectives. Additionally, you will have the chance to design and implement innovative solutions for our Core IDCs and CDN/Edge.
- On-call: Engage in our on-call support spanning across regions and incident response teams to address critical issues in the production environment.

Qualifications

Minimum Qualifications:
- Education: Bachelor's degree in Computer Science, Electronic Engineering, relevant technical field, or equivalent practical experience.
- Experience in at least one of the areas below:
- Server Operations: Demonstrated proficiency in Linux system administration tasks. Possessed an in-depth comprehension of Linux kernels, drivers, and modules. Capable of scripting in Bash and Python to automate routine system operations, encompassing skills such as system configuration, performance tuning, and security management within the Linux environment. Had an in-depth understanding of server hardware, and was able to conduct troubleshooting or diagnostics. Experience participating in the planning, delivery, and operation of large-scale data centers in different countries.
- Tooling Adaptation, Deployment, and Maintenance: Proficient in customizing operation and maintenance tools to satisfy specific demands for new server hardware. Competent in managing the entire software tool lifecycle, ranging from deployment to continuous maintenance. This encompasses tasks associated with facilitating the monitoring of server performance, effectively provisioning resources, timely handling of fault management, and conducting repairs to guarantee the smooth operation of new server hardware. Experience in developing and maintaining hardware, network, or service monitoring software for more than 10,000 servers.
- Communication: Experience in managing and coordinating teams in the global context.

Preferred Qualification:
- 3 years of work experience in related filed.
- Data Center: An intermediate level of expertise is preferred. We are looking for individuals who are proficient in areas ranging from OS installations and break-fix operations to significant projects such as planning and operations (encompassing the entire infrastructure lifecycle), as well as new design-build or retrofit activities for existing systems.
- Proficiency in the operation and maintenance of GPU server is strongly preferred.
- Full Stack Software Development: Actively, we are in search of individuals proficient in full stack software development. The ideal candidates are expected to possess the following preferred skills:
- Be capable of creating and integrating RESTful APIs. This encompasses expertise in using Flask for Python-based back-end development to establish robust API endpoints.
- Have a profound understanding of JavaScript and be capable of leveraging it, along with Node.js, for both front-end and back-end development tasks.
- Demonstrate proficiency in SQL for efficient database management, including designing database schemas, composing queries, and ensuring data integrity; be familiar with Redis.
- Possess experience in Ansible Configuration Management, Application Deployment, and Task Execution.

#LI-MZ3

Job Information

[For Pay Transparency] Compensation Description (annually)

The base salary range for this position in the selected city is $87480 - $221920 annually.

Compensation may vary outside of this range depending on a number of factors, including a candidate's qualifications, skills, competencies and experience, and location. Base pay is one part of the Total Package that is provided to compensate and recognize employees for their work, and this role may be eligible for additional discretionary bonuses/incentives, and restricted stock units.

Benefits may vary depending on the nature of employment and the country work location. Employees have day one access to medical, dental, and vision insurance, a 401(k) savings plan with company match, paid parental leave, short-term and long-term disability coverage, life insurance, wellbeing benefits, among others. Employees also receive 10 paid holidays per year, 10 paid sick days per year and 17 days of Paid Personal Time (prorated upon hire with increasing accruals by tenure).

The Company reserves the right to modify or change these benefits programs at any time, with or without notice.

For Los Angeles County (unincorporated) Candidates:

Qualified applicants with arrest or conviction records will be considered for employment in accordance with all federal, state, and local laws including the Los Angeles County Fair Chance Ordinance for Employers and the California Fair Chance Act. Our company believes that criminal history may have a direct, adverse and negative relationship on the following job duties, potentially resulting in the withdrawal of the conditional offer of employment:

1. Interacting and occasionally having unsupervised contact with internal/external clients and/or colleagues;

2. Appropriately handling and managing confidential information including proprietary and trade secret information and access to information technology systems; and

3. Exercising sound judgment.

Client-provided location(s): San Jose, CA, USA
Job ID: TikTok-6776750658629404942
Employment Type: Other

Perks and Benefits

  • Health and Wellness

    • Health Insurance
    • Dental Insurance
    • Vision Insurance
    • HSA
    • Life Insurance
    • Fitness Subsidies
    • Short-Term Disability
    • Long-Term Disability
    • On-Site Gym
    • Mental Health Benefits
    • Virtual Fitness Classes
  • Parental Benefits

    • Fertility Benefits
    • Adoption Assistance Program
    • Family Support Resources
  • Work Flexibility

    • Flexible Work Hours
    • Hybrid Work Opportunities
  • Office Life and Perks

    • Casual Dress
    • Snacks
    • Pet-friendly Office
    • Happy Hours
    • Some Meals Provided
    • Company Outings
    • On-Site Cafeteria
    • Holiday Events
  • Vacation and Time Off

    • Paid Vacation
    • Paid Holidays
    • Personal/Sick Days
    • Leave of Absence
  • Financial and Retirement

    • 401(K) With Company Matching
    • Performance Bonus
    • Company Equity
  • Professional Development

    • Promote From Within
    • Access to Online Courses
    • Leadership Training Program
    • Associate or Rotational Training Program
    • Mentor Program
  • Diversity and Inclusion

    • Diversity, Equity, and Inclusion Program
    • Employee Resource Groups (ERG)

Company Videos

Hear directly from employees about what it is like to work at TikTok.