Independent technology for modern publishing, memberships, subscriptions and newsletters.

0 0 46 JavaScript

Established Tinybird filter architecture (#25480)

no ref
- Where possible, moved all varieties of filters into a central filter
pipe (session_data). This could get renamed to be more descriptive, but
I struggled to find a good name, and it would introduce more changes to
review.
- Updated tests; minimally, because we want to preserve behavior here.
This effort was largely removing the tests that depended on the removed
deprecated pipes/endpoints.
- Added readme describing structure and purpose.

When reviewing the Tinybird endpoints and filter logic, it became clear
there was a number of discrepancies that ultimately led to confusion and
inconsistency. This PR is an attempt to alleviate that. Let's do some
plumbing...

**Overview**
We have several kinds of filters in Tinybird. Let's call them:

always filters [date filters, site uuid, timezone]
(page) hit-level filters [pathname, post_uuid, member_status,
session filters
These need to be treated distinctly.

Always filters are so named because they are always applied - they are
our primary and secondary keys in analytics_events: ENGINE_SORTING_KEY
"site_uuid, timestamp". They optimize queries by reducing the data set.

Page hit data/filters are those that may change on any particular
page_hit. When clicking around, a user is navigating to different
pathnames which have unique poat_uuids. If they come to a site and then
log in, their member_status may change (currently).

Session data/filters are those that are wrapped up or attributed to the
session itself. For the various UTM parameters, we set those based on
the first page hit in the session within mv_session_data. Same for
source; these are commonly attribution elements.

**Architecture**
With this is mind, we have to differentiate in where and how we filter
by these various categories. We need to make sure when filtering by page
hit data, that we check every hit in a session, but when checking for
session data, that we only check the session (even if we include every
hit from that session in the end results).

This can be accomplished fairly simply: for any query, we filter by page
hit data, then session data, then take an intersection of the two. This
is what filtered_sessions.pipe does.

Endpoints themselves only need to specify the return data. And in some
cases, if the filters change the behavior, they made need to be
specified (e.g. api_top_pages needs pathname because otherwise you end
up seeing all session data for that path, which doesn't particularly
make sense in this case).

**Note that site_uuid is unique and kept around for query optimization
across the board within Clickhouse.**

Steve Larson committed 5mo ago

f885a259feaaaf09264ef2421fa13b6568aa94c1

Parent: 56ea3ff

Committed by GitHub <noreply@github.com> on 11/25/2025, 2:00:22 PM