At Segment, we believe companies should be able to send their data wherever they want, whenever they want, with no fuss. We make this easy with a single platform that collects, stores and sends data to hundreds of business tools with the flip of a switch. Our goal is to make using data easy, and we’re looking for people to join us on the journey. We are excited about building toward a world where engineers at other companies spend their time working on their core product, rather than spending nights and weekends tweaking their customer data into various formats for 3rd party tools.
Our infrastructure is mostly written in Go (we’re huge fans!), uses Docker containers for our 70 different , and generally uses the . Our small team is providing the data infrastructure for thousands of companies, and as a result we’re already processing terabytes of data each day. We’re rapidly scaling our systems to keep up with our dramatic growth, and we’re looking for folks who love Kafka, NSQ, NoSQL databases, and distributed systems of every flavor.
Our customer data hub is helping companies achieve data nirvana, the blissful state you enter when all of your customer data is clean, complete, and accessible in your data warehouse and various analytics tools. Integrating with the Segment platform enables our customers and partners to a new class of analytics models and marketing automation experiences. Though we have already thousands of companies being built on top of our analytics platform, we’ve only penetrated less than 1% of the market. We are building toward a world where all customer data in the world is flowing through Segment.
Projects you can dive into:
Segment's API pipeline processes billions of messages per day. The incoming messages are simple JSON objects, which must be tracked, parsed, and store with a structured schema. Our API layer needs to allow schemas to be completely dynamic: when our customers issue a track call, each JSON object can introduce us to new properties we haven't seen before. Dealing with flexible data can be incredibly challenging, because we must adjust these schemas on the fly, in realtime.
Performing this process at high scale is extremely challenging because the infrastructure has to be ready for extremely high spikes in reads and writes, as our API promises the ability to process batch historical data from customers. Even during these times, we need to deliver a query speed of under 100ms.
To make things even more interesting, sometimes data contradicts itself. A column might come in for a long time as a string and then start coming in as a number. How should we handle cases where we can't cast from one type to another? How do we propagate type changes in the downstream tools? How do we make sure that the system remains idempotent?
Acting as the middleman for billions of events isn’t easy. We essentially have to build the L4 networking layer, reliably delivering messages from clients in order, but at the L7 layer in the network stack. Where things get complicated is the fact that clients expect us to queue, instead of backing off. Messages get re-ordered, and we need to be prepared for the integrations we’re sending data to fail at any time.
We want to build a queueing topology to handle all of these cases gracefully, and scalably. If an integration’s endpoint goes down, it shouldn’t affect other destinations for that data. And if a customer suddenly batches a ton of data for an integration, they shouldn’t starve message delivery for their neighbors.
Most queueing systems don’t handle this case well. They’re severely limited in terms of partitions, topics, or whatever logical separation they use to provide isolation. We need a system that scales well, but also provides the same sorts of ordering and delivery guarantees that we’d get through a Kafka.
It’s a big, challenging piece of core infrastructure, but we’re feeling the pain more as customers exhaust the pipes for individual integrations.
We take in hundreds of thousands of events every single minute. It’s roughly the number of . By agreeing to process events, pageviews, and click data–we’re effectively shouldering the scalability of all of our customers at once.
To date, we’ve scaled by making the system stateless. When needed, we can boot up more routing nodes, and more workers.
But being stateless limits us. We can’t enrich data as it’s passing through our processing pipeline, join user ids together, or perform other types of advanced analysis based upon the user’s prior actions. We want to be smarter, but combining all this data into a single database comes with serious scaling challenges.
We’d like to expose all of that user data, first as part of an internal query API, and then finally exposed to customers. That way our users can build their own pieces of custom tooling on top of it.
Today, Segment hosts 180 different transforms for matching data from our API to our partners’. But most of those integrations are code we maintain and develop.
In recent years, there’s an , and there are thousands more we’d like to support.
It’s obvious the current system won’t scale to thousands of tools, or the millions of unique use cases that our customers have. But, we’d like to find a way to connect those tools to our hub, so customers can ‘free’ their data without having to run a complex data pipeline themselves.
That’s might involve customers submitting a lambda-esque function, or a container for us to run. we’d love to let our customers and partners supply custom transforms for us to run. It’s a fairly complex isolation and sandboxing challenge–how do we make sure functions don’t misbehave, or affect one another? In some ways we’d almost be running ‘remote code execution as a service’. We’d love your help in getting us to the point where we can scale to support any customer user case.
- Building infrastructure to process terabytes of data per day and thousands of API calls per second
- Use cutting-edge technologies such as AWS, Go, Docker, and Terraform to continue to scale our infrastructure
- Relentlessly measure and optimize as Segment builds the highest-scale and most advanced analytics platform in the world
- CS Degree or equivalent knowledge of data structures and algorithms
- 2+ years of industry experience building and owning large-scale distributed infrastructure
- Expert knowledge developing and debugging in C/C++, Java, or Go