Inside the Architecture: Decoupled Layers, Multicast, and Fat Topics
In the second episode of our podcast series, George Levin and Alexei Lebedev dig into the architecture of AlgoX2 — why we separate compute, storage, and network, why hardware multicast is non-negotiable, and how we keep performance independent of the number of topics.
Introductions
George: Hi, I'm George Levin, co-founder and Chief Business Officer at AlgoX2, a data streaming operating system. In this podcast series, I'm talking with Alexei, our CEO and the creator of AlgoX2. My goal is to understand what we're building and the thinking behind it. Hi, Alexei.
Alexei: How you doing?
George: I'm good. When I pitch AlgoX2 to clients, one of the topics that always comes up is that we separate compute, storage, and networking. Could you explain why that's important and what's so special about this three-layer architecture?
Three Layers: External, Sequencing, Storage
Alexei: AlgoX2 has three main layers. The first is the external one, where outside connections come in: producers and consumers connect to a layer of gateways — that's your interface to the outside world.
Internally, we have things called sequencers, which turn multitudes of racing streams of messages into ordered rivers. That data eventually ends up on disks across multiple machines — your storage subsystem.
Along every hop of the way — from a producer to a consumer, through gateway, sequencer, commit, gateway, and out — there is networking. Every hop can change a host: you can be going from machine A to machine B. We don't have a traditional vertical architecture where you talk to some server that does things for you. You're talking to a small node in the network that takes your message and just sends it along — usually on multiple paths and onto multiple other servers.
Why Decoupling Matters
Alexei: Because everything is a collection of such nodes, you can independently scale the number of commits, the number of gateways, and the number of sequencers. Why would you scale them independently? They have very different throughputs.
- Sequencers are extremely fast. They do nothing but
memcpy. - Commits can be fast or slow depending on the disks you have, but generally they're slower because they're also doing deep indexing on disk. Disks are not as fast as the network.
- Gateways are kind of fast — but you also have TCP, retransmissions, and many clients connecting to a single gateway.
You don't know ahead of time how many of each component you want. It also depends on the deployment. You could have few producers — it doesn't make sense to give them hundreds of gateways. You could have many producers — different story. It's very, very helpful to be able to independently scale these multiple layers. That's what's called a decoupled architecture, where storage and compute are decoupled.
When you run additional processes on the cluster — filtering, transformation, processes that create Iceberg tables, upload, or filter — you get an additional layer that you also have to scale. Depending on how fast each of those processes is, you create more or fewer of them. It's a kind of holy grail in systems architecture to be able to scale different layers independently. That's one of our core design principles.
Kafka and the Limits of Vertical Scaling
George: Is it right that in Kafka, when you scale, you scale through brokers — and each broker has both compute and storage? So you can't scale them independently. If you just run out of disk, you can't add disk; you have to add another broker that has both.
Alexei: I would call Kafka a scale-up architecture. It's a number of single-host, high-functionality brokers. Each broker hosts servers — it accepts network connections from producers and consumers. It's the same process doing it, multiple threads, but the same process. And that process is also writing things to disk. So it fits very nicely within a single operating system.
The process also talks to replicas to make sure your data is replicated before it's served out, but conceptually it's very easy to think of Kafka as essentially a single-process system: there's a broker, you send it a message, it saves it, then publishes it. You can view it as vertical — vertical scaling. If you give it a fatter, faster machine, it scales. It's meaningless to talk about a single broker using ten machines; it always sits within one. That's scale-up.
In a scale-out architecture, you don't have anything like that. You have a bag of processes — 200 gateways, 40 sequencers, 200 commits — running across some servers. Usually pretty uniform: maybe eight commits, four gateways, and a couple of sequencers per server. When you talk to a server, you send it a packet, but it's not getting meaningfully processed there. It gets multicast to two other servers, then somewhere else, then maybe to a consumer — if a consumer is connected. You're really talking to a data network more than a single application. That's a very different architecture.
The Multicast Question
George: Let's talk about multicast. It creates a lot of problems for me personally — whenever I mention we use multicast to clients, it immediately sparks weird questions and concerns. They start saying it's lossy, they don't want to deal with it. Everyone loves InfiniBand as a concept, but nobody likes to deal with it. How do we address the concern that we use multicast a lot?
Alexei: In the end, we address it by showing performance numbers. My love for multicast goes back at least 20 years — actually more. When I did high-frequency trading, you connect to exchanges. Exchanges send out market data, and market data always comes out in the form of multicast.
Say you have a data center in location 1 with multicast market data coming out, and you also have trading in data center 2. If you have a network between those data centers — which you do — you can route the multicast data so that market data goes from data center 1 to data center 2 and is interpreted there by your process.
There's really no faster middleware than a network switch. That's a fundamental fact. A network switch will happily clone any incoming data to ten output ports faster than you can say nanosecond. It is fundamentally a form of middleware. Every other piece of middleware that we mere software engineers build is going to be a reduced version of what a network switch is capable of.
The best you can give your clients is a kind of programmable, lossless network switch in terms of throughput. You can't do faster than that. You can do millions of times slower than that — which a lot of people have achieved. Our goal is to be somewhere in between. We want to give the users of our platform essentially the power of that network switch.
Why Hardware Replication Wins
Alexei: Let me explain why multicast is so important. Suppose you have a data streaming platform on a 100 Gbps network, and you feed it a 100 Gbps stream. With one producer and ten consumers, here's what happens: the broker has to clone that 100 Gbps stream ten times — once for each consumer — and send it out. But because all consumers share the broker's single output line, together they're limited to 100 Gbps total — which means each gets only 10 Gbps. To stay real-time, the producer is now also limited to 10 Gbps.
That's a blocking architecture: a single node cannot sustain the maximum capacity of the system. The fraction of the load given to a single node is higher than that node's capacity, and the node starts blocking.
Now, with multicast, you can have the broker broadcast the data so it hits ten different servers, each running an output gateway, and the data gets sent to clients from there. You've achieved 100 Gbps in and 100 Gbps out times ten — a terabit of data coming out. The numbers are theoretical — you're not going to achieve line rate on a 100 Gbps network — but the argument stands. The only way to extract maximum performance is hardware replication.
The User Doesn't Have to Care
George: It's important to explain to everyone listening — anyone wondering whether they need to learn InfiniBand or multicast — that this is an internal part of the system. It's not them who need to run it.
Alexei: That's the nice thing — that's why we call it a streaming operating system. It should not be your problem. If it's your problem, you're taking the weight of the world upon your shoulders, which you should not do — in the same way you and I don't think about how PCI Express works. It just solves problems.
George: I definitely don't think about it.
Alexei: You should not. In the same way, the users of our data streaming platform should not be concerned with what tricks we used to get their data served faster. From the outside, it's just a familiar protocol. On the inside, we probe the hardware we're running on, and if it's capable of certain things, we have drivers to take advantage of that. That's a property of an operating system: drivers for specialized hardware, taking advantage of hardware accelerators while keeping a user-friendly, unchanged API — and a generic software path always available.
Fat Topics, Part One: Too Many Files
George: Let's discuss another topic that comes up often, especially with trading firms — the problem of fat topics. We need to explain it a lot. Could you tell me more about it?
Alexei: First, what's the problem of fat topics? Essentially it's a problem when you have more than ten topics.
In a classic, standard, version-one data streaming platform design, every subject published by a producer creates something like four files: it's saved in two locations (two files), and each location has a messages file plus an index file (another factor of two). And that's before partitioning.
If every topic is partitioned into, say, ten subtopics, external data comes in, lands on one of the ten partitions, and each one gets written to four files. So a single topic produces forty open files in a running cluster. With a thousand topics, you're at forty thousand files — and you actually start running into problems much sooner.
When you append to many small files on a given machine, you've essentially turned your disk access pattern into random access. No disk — SSD, spinning, or NVMe — is going to be fast on random access. Under sustained load, performance gets terrible. That's one issue.
The second issue is OS overhead just to maintain that many open files. You hit FD limits. You have to do a lot of tweaking just to get it to work.
How We Solve It
Alexei: The way we solve it: the number of files written in an X2 cluster is independent of the number of subjects produced by users. We write fat — in a different sense — append-only files, and then produce deep indexing using log-structured merge trees. That approach reduces the number of open files and keeps performance largely independent of the number of topics.
Fat Topics, Part Two: Filtering at the Edge
George: Does that mean we can do context filtering for our users — create kind of virtual streams based on the contents of the data we stream?
Alexei: That's a different issue. The first problem is when you have lots of topics. The second problem is when topics contain a lot of information that needs to be filtered out at the edge.
There are several approaches. One thing we allow is to run subprocesses inside an X2 cluster — essentially producers and consumers. They read internal data at nanosecond speeds, because they use the same delivery mechanism the system itself uses, and they can produce filtered outputs, write additional streams, or reformat things for the gateway. You could serve a different protocol — binary versus ASCII, EBCDIC, whatever.
We allowed those sidecar processes because it's a very large and useful class of things to do. There are creative ways to filter so-called fat streams by adding a sidecar to the gateway. The answer is yes — with some design considerations for the user. It's certainly easier than spawning another cluster that reads data from your cluster and produces some filtering downstream.
Wrap-Up
George: I think this will be enough for our second episode. Thanks a lot for your time. Subscribe, stay tuned — there will be more episodes.
Alexei: My pleasure. Bye.
Want more context on the platform? Read about the product, or get in touch with us on the contact page.