# Architecture This document describes the high-level architecture of Vector. It assumes some familiarity with the user-facing concepts and focuses instead on how they're wired together internally. The goal is to provide a starting point for navigating the code and to assist in understanding Vector's behavior and constraints. ## Overview From a user's perspective, Vector runs a configuration consisting of a directed, acyclic graph of sources, transforms, and sinks. In many ways, this logical representation of the config maps rather directly to the way that it is laid out and run internally. ![logical](https://user-images.githubusercontent.com/333505/152071180-daa5ea73-d588-4471-b34f-79ffb5e6c295.png) A reasonably accurate mental model of running the above configuration is that Vector spins up each component as a Tokio task and wires them together with channels. Below, we'll go through each type of component in more detail, discussing how it is translated into a running task and connected to the rest of the topology. ## Component construction After parsing and validating a user's configuration, we are left with a `Config` struct containing (among other things) the collections of `SourceConfig`s, `TransformConfig`s, and `SinkConfig`s corresponding to each of the configured components. Each of those traits has their own `build` method that constructs the component. This building occurs largely in the `src/topology/builder.rs` file, along with some setup for the initial wiring into the topology. ### Sources When a source config is built, the result is mainly two tasks: the "server" task of the source itself, and a "pump" task that forwards its output on to the rest of the system. Construction begins by setting up a `SourceSender` for the configured source, which handles the ability to send events to each of the outputs defined by the `SourceConfig::outputs` method. The outputs are built into the sender in such a way that attempting to send to an unknown output will result in a panic. Along with each of these outputs that are added to the sender, a corresponding `Fanout` instance is created which will live in a pump task and handling "fanning out" events to every downstream component that lists this source as an input. Each of these individual pump tasks (one per output) consists purely of forwarding events into the `Fanout`. The final pump task (one per source) simply spawns each of the output-specific pump tasks and drives them to completion. Finally, the server task itself is built. `SourceConfig::build` takes a `SourceContext` as an argument, one field of which is the `SourceSender` built above. The result of the build function is wrapped in some simple shutdown handling before being inserted into the topology. ![image](https://user-images.githubusercontent.com/333505/156249003-ba6e31de-d296-42da-a9b6-2451e607df80.png) ### Transforms After sources, transforms are the next to be built. They're a bit simpler than sources and mostly jump straight into `TransformConfig::build`. They can define `outputs` that are translated into `Fanout` instances in much the same way as sources, and also have an option to `enable_concurrency`. How that works exactly depends on the type of transform. #### Synchronous The simplest type of transform is a `Function` transform. They run synchronously on a single event and write to a single, unnamed output. One step above function transforms is the newer `Synchronous` transform. These are inclusive of function transforms, with the additional ability of being able to write to multiple outputs if desired. From the perspective of the topology, both of these are run in exactly the same way. In the simplest case, where the `enable_concurrency` flag is not enabled (this is not configuration, but a static attribute of the of transform), the resulting transform task is built to pull a chunk of events from its input channel, process those events via the `transform` method into a `TransformOutputsBuf` (essentially a container of `Vec` for each of the transform's defined outputs). Once the whole chunk of events has been processed into outputs, those outputs are drained into the respective `Fanout` instances. If the `enable_concurrency` flag is enabled, the process is slightly more complicated. For each chunk of input events, instead of being processed inline, a new task is spawned that does the work of processing the events into outputs. Since there is some overhead to spawning tasks, Vector attempts to pull larger chunks of events from the input for transforms running in this mode. The main transform tasks then tracks the completion of those work tasks in the same order that they were spawned (ensuring that we don't reorder the resulting outputs). When a task completes, the main task receives the resulting `TransformOutputsBuf` and takes care of draining it into the respective fanouts. To ensure that we don't buffer an infinite amount of events within those work tasks, the main task limits the maximum number that can be in flight simultaneously. Spawning new tasks allows Tokio's work-stealing scheduler to step in and spread the CPU work across multiple threads when there is a need and available capacity to do so. ![image](https://user-images.githubusercontent.com/333505/156249361-9a91f61a-445a-403c-92eb-609f2249b3a9.png) #### Task A task-style transform differs from the synchronous variants above in that it has the ability to do arbitrary, asynchronous stream-based transformations of its input. This includes things like emitting outputs after some timeout, independent of incoming events. From the topology's perspective, they're quite simple because they define most of their structure internally and are applied by basically just passing the input channel into the `transform` method. To build the full task, the transform itself is built, some common filtering and telemetry are added by wrapping the input stream, and then the input stream is passed to the `transform` method. This results in an output stream, which is then forwarded to the transform's `Fanout` instance (task transforms do not support multiple outputs). ![image](https://user-images.githubusercontent.com/333505/156249430-5f82a1e0-8caa-49fe-88b8-290b6ed06ad7.png) ### Sinks Sinks have two components that make building them somewhat more complex than either sources or transforms: healthchecks and buffers. Healthchecks are one-off tasks that run at startup with the goal of discovering any issues that may prevent the sink from running properly (e.g. permissions or connectivity issues) and notifying the user in a nice way. They can be enabled or disabled both individually and at a global level, and the user can choose whether a failing healthcheck should prevent Vector from starting. Buffers are a configurable mechanism for dealing with backpressure. By default, like the rest of Vector, sinks will buffer some small-ish number of events in memory before propagating backpressure upstream. Buffer configuration allows individual sinks to change that behavior, choosing between memory and disk for where to store the buffered events, setting a maximum size, and deciding what should happen when the buffer is full (backpressure or load shedding). Disk buffers in particular add some complexity in topology construction due to the fact that they're persistent across both config reloads and process restarts. They are built normally with their corresponding sink most of the time, but they are also stashed to the side for the case when topology construction fails after a buffer has been built. This allows a subsequent build (likely of the previous configuration during a rollback) to pull from the already-built buffer without worrying about the persistence of the contents. Once the healthcheck and buffer are built, the sink itself is constructed via `SinkConfig::build`. The surrounding task is defined to first finalize its use of the buffer (removing it from the in-case-of-error stash), filter and wrap the input stream with telemetry, and then pass it to `VectorSink::run`. ![image](https://user-images.githubusercontent.com/333505/156249509-fd1b1ae6-7193-4fda-a33e-bbd128d63c87.png) ## Connecting components Once component construction is complete, we're left with a collection of yet-to-be-spawned tasks, as well as handles to their inputs and outputs. More specifically, a component input is the sender side of a channel (or buffer) that acts as the component's input stream. A component output in this case is a `fanout::ControlChannel`, via which Vector can send control messages that add or remove destinations for the component's actual output stream. Given those definitions, the fundamental process of wiring up a Vector topology is one of adding the appropriate inputs to the appropriate outputs. As a simple example, consider the following config: ```toml [sources.foo] type = "stdin" [sinks.bar] type = "console" inputs = ["foo"] encoding.codec = "json" ``` After the component construction phase, we'll be left with the tasks for each component (which we'll ignore for now) as well as a collection of inputs and outputs. In this case, each of those collections will hold one single item. There will be an output corresponding to the source `foo`, consisting of a control channel connected to the `Fanout` instance within the source's pump task, and there will be an input corresponding to the sink `bar`, consisting of the sender side of the sink's input channel/buffer (for our purposes here they're equivalent). To wire up this topology, the `connect_diff` function on `RunningTopology` will see that `bar` specifies `foo` as an input, take a clone of `bar`'s input, and send that to `foo`'s output control channel. This results in the `Fanout` associated with the source `foo` adding a new sender to its list of destinations to write new events to. And voilĂ , the sink will receive events from the source on its input stream. The actual code for this logic (found in `src/topology/running.rs`) is significantly more complex than what's needed for this very simple example, mostly because it is oriented around applying modifications to an existing running topology rather than simply starting a new one up from scratch. While complex, this allows us to have a single path through which all topology actions occur. Without it, we'd need all the same complexity to support reloads, but it would be a secondary path that's less well exercised than the common startup path. Even so, this is an area where we're always looking for ways that we can simplify and make things more robust. ## Spawning the topology Once everything is properly connected, the final step is the spawn the actual tasks for each component. There is a bit of bookkeeping to build tracing spans for each, set up error handlers, and track shutdown, but the handles to the running tasks are stored and Vector is off to the races.