Established Tinybird filter architecture (#25480)
no ref - Where possible, moved all varieties of filters into a central filter pipe (session_data). This could get renamed to be more descriptive, but I struggled to find a good name, and it would introduce more changes to review. - Updated tests; minimally, because we want to preserve behavior here. This effort was largely removing the tests that depended on the removed deprecated pipes/endpoints. - Added readme describing structure and purpose. When reviewing the Tinybird endpoints and filter logic, it became clear there was a number of discrepancies that ultimately led to confusion and inconsistency. This PR is an attempt to alleviate that. Let's do some plumbing... **Overview** We have several kinds of filters in Tinybird. Let's call them: always filters [date filters, site uuid, timezone] (page) hit-level filters [pathname, post_uuid, member_status, session filters These need to be treated distinctly. Always filters are so named because they are always applied - they are our primary and secondary keys in analytics_events: ENGINE_SORTING_KEY "site_uuid, timestamp". They optimize queries by reducing the data set. Page hit data/filters are those that may change on any particular page_hit. When clicking around, a user is navigating to different pathnames which have unique poat_uuids. If they come to a site and then log in, their member_status may change (currently). Session data/filters are those that are wrapped up or attributed to the session itself. For the various UTM parameters, we set those based on the first page hit in the session within mv_session_data. Same for source; these are commonly attribution elements. **Architecture** With this is mind, we have to differentiate in where and how we filter by these various categories. We need to make sure when filtering by page hit data, that we check every hit in a session, but when checking for session data, that we only check the session (even if we include every hit from that session in the end results). This can be accomplished fairly simply: for any query, we filter by page hit data, then session data, then take an intersection of the two. This is what filtered_sessions.pipe does. Endpoints themselves only need to specify the return data. And in some cases, if the filters change the behavior, they made need to be specified (e.g. api_top_pages needs pathname because otherwise you end up seeing all session data for that path, which doesn't particularly make sense in this case). **Note that site_uuid is unique and kept around for query optimization across the board within Clickhouse.**
S
Steve Larson committed
f885a259feaaaf09264ef2421fa13b6568aa94c1
Parent: 56ea3ff
Committed by GitHub <noreply@github.com>
on 11/25/2025, 2:00:22 PM