At Fitbit, our mission is to help people lead healthier, more active lives by empowering them with data, inspiration and guidance to reach their goals. We started our journey in 2007 as a team of two with one big idea. Today, that idea has become a movement. Fitbit is now a publicly traded company creating award-winning products and services that are available across the globe. We’re transforming the way the world sees health & fitness. In fact, the Fitbit Community has taken enough steps to walk from the Sun to Pluto. Our culture combines the spirit of startup with the advantages of being public, offering a competitive benefits package and amazing perks. As part of our team, you’ll have the opportunity to grow your career, contribute your ideas to life-changing products and services, and above all have fun doing it.
In our newest Fitbit office in Bucharest, located in the heart of the city, we are planning to build on the foundation laid by the Vector Watch team. We are looking to keep growing and this role will be fundamental to the continued success of Fitbit as we build exciting new products and services. Think you’ve found your fit? See what we’re looking for below and apply today.
About the team:
As Fitbit continues to build microservices and begins its migration to Google Cloud our system is growing ever more complex. Our use of technologies such as Kafka, ZooKeeper, Cassandra, Elasticsearch, and Finagle/Finatra have been increasing significantly and in some cases have led to incidents as we encounter their rough edges or the limits of the company's knowledge of those applications and frameworks. Additionally, our legacy codebases that utilize Spring and Hibernate often require developers with deep expertise in those frameworks to pitch in for refactoring or handling incidents.
In order to remain effective the SRE team needs to expand its skill set to include those technologies from a whitebox perspective, so we can better respond to incidents and to advise teams using them.
Site Reliability Engineers are responsible for the pulse of the software ecosystem. We monitor and improve the system and suggest improvements for implementation by others. The name of the game is automating our job, because hiring linearly with our traffic growth is unsustainable. We are involved in incident and change management. We also act as consultants for engineers when new code and services are getting ready to launch.
Our application stack is based mostly on Java, however, most of our operations automation is developed in Python. The major components we use daily are
- OS: Linux
- Frameworks: Hibernate, Spring, Finagle, Finatra, Thrift
- Databases: MySQL, Cassandra
- Messaging: Kafka
- Caching: Memcached, Redis
- Logging and Monitoring: Prometheus, Graphite, StatsD, Nagios, Logstash, Kibana
- Other: Aurora/Mesos, Tomcat, Elasticsearch, Puppet, Ansible, Terraform
Challenges for you:
- You will write code in Python and perhaps Java, and not just for classes.
- Dig into the details of how a system, library, or tool works instead of just blindly using it.
- SREs handle problems in live production systems, both on their own and in collaboration with systems and application engineers.
- Keep the company informed about the status of Fitbit services, the impact of known issues, and the progress of ongoing investigations.
- Design and refactor parts of the Fitbit backend system for stability and performance, and write tools and scripts to automate maintenance and monitoring tasks.
- Meet with other teams and attend architecture reviews, and offer advice on how to implement features that are efficient, highly available, and fault-tolerant.
- Are willing to teach and lead others.
- You have 5+ years of experience as a systems/operations engineer or system administrator
- You are comfortable with the Python programming language and ecosystem
- You are very comfortable using and administering Linux servers
- You can work independently with limited supervision
- You can communicate effectively with peers and to tailor your communication to your audience
- You have a willingness to dive in and assist coworkers when incidents arise
- You're willing to participate in the team’s production on-call rotation
- BSc. in Computer Science
- Experience working with high-traffic, scalable web applications and services
- Experience building, deploying, and operating your own web service
- Knowledge of the administration and/or performance tuning of MySQL or Cassandra
- Prior experience being part of an on-call rotation and responding to production incidents
- Experience with cloud computing platforms like AWS or Google Cloud Platform
- Familiarity with configuration management tools like Puppet, Chef or Ansible (we use Puppet and Ansible)
- Experience developing and shepherding processes around change and incident management
- Some familiarity with Java and its ecosystem
Fitbit is proud to be an equal opportunity employer. We recruit, hire, train, promote, pay, and administer all personnel actions without regard to race, color, ancestry, national origin, citizenship, religion, age, sex (including pregnancy, childbirth, and medical conditions related to pregnancy, childbirth, or breastfeeding), sex stereotyping (including assumptions about a person’s appearance or behavior, gender roles, gender expression, or gender identity), sexual orientation, gender, gender identity, gender expression, marital status, medical condition, mental or physical disability, military or veteran status, genetic information or other statuses protected by law. We interpret these protected statuses broadly to include both the actual status and any perceptions and assumptions made regarding these statuses.