Skip to content

[AISP-312] Add Signals + ML end-to-end tutorial #1256

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 59 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
59 commits
Select commit Hold shift + click to select a range
0664c70
Initial Signals Docs
Jack-Keene Apr 7, 2025
321d5d5
de-marketingify intro
Jack-Keene Apr 11, 2025
29af49e
add missing image
Jack-Keene Apr 11, 2025
9effec5
add note for google colab
Jack-Keene Apr 11, 2025
889087a
update spelling/grammar/us terms
Jack-Keene Apr 11, 2025
a035621
separate sources into batch and stream
Jack-Keene Apr 11, 2025
4ea4e8a
Add signals docs cli tutorial (#1212)
ilias1111 Apr 11, 2025
3ed1987
rename filter to criteria
Jack-Keene Apr 14, 2025
5810bce
remove unused API syntax
Jack-Keene Apr 14, 2025
9931fba
Add missing batch content (#1218)
agnessnowplow Apr 25, 2025
eb235c5
Fix/signals feedback (#1229)
agnessnowplow May 8, 2025
535cb5c
Apply suggestions from code review
matus-tomlein May 23, 2025
7fcc79e
Apply suggestions from code review
matus-tomlein May 23, 2025
edd31a3
Apply suggestions from code review
matus-tomlein May 23, 2025
0505299
Suggestion from review
matus-tomlein May 23, 2025
c88ca78
Update with changes in the latest API and SDK version
matus-tomlein May 23, 2025
6149798
AISP 446 448 (#1262)
agnessnowplow May 28, 2025
5f491c3
Adjust Signals overview for custom entities and interventions
jethron Jun 10, 2025
a19364b
Bump snowbrige componentVersion to 3.1.1 (#1144)
colmsnowplow Apr 9, 2025
294c345
Clarify WebView behaviour (#1203)
mscwilson Apr 9, 2025
6ae599a
BQ Loader 2.0.1 (#1205)
spenes Apr 9, 2025
43854d6
Remove references to Hudi as it’s no longer officially supported (#1204)
stanch Apr 9, 2025
a4388bb
Add Extensions to features comparison (#1202)
mscwilson Apr 9, 2025
a860487
Add docs for GTM Variable Template 1.1.0 (#1163)
greg-el Apr 9, 2025
fbc0e61
Lake Loader 0.6.2 (#1211)
istreeter Apr 11, 2025
9ff0156
Integration of Shadcn UI and Lucide Icons (#1194)
AH-Avalanche Apr 14, 2025
420384d
Release Attribution version 0.5.0 (#1215)
github-actions[bot] Apr 16, 2025
0bb11dc
Enrich 5.3.0 (#1173)
spenes Apr 16, 2025
2bca02c
[create-pull-request] automated change (#1221)
github-actions[bot] Apr 22, 2025
8a93f15
Upgrade iglu server to 0.14.1 (#1223)
oguzhanunlu Apr 22, 2025
89f7bd9
Document JS element tracking plugin (#1224)
jethron Apr 29, 2025
07a84c4
Add Bigquery support to the Data Quality Dashboard (#1228)
johnmicahreid Apr 30, 2025
2d27ad2
Rename ID service to Cookie Extension service (#1219)
igneel64 May 1, 2025
1f6d189
Add lake loader 0.6.3 & Mini 0.23.0 (#1231)
oguzhanunlu May 6, 2025
fc6360b
Add Product Fruits (#1181)
mscwilson May 7, 2025
b511ae6
Fix local link references (#1227)
jethron May 7, 2025
ffb8aad
Document Server Anonymization in PHP tracker (#1216)
jethron May 7, 2025
4781fad
US spelling (colour > color) (#1234)
jethron May 8, 2025
5d97fbb
Clarification for Event Specifications plugin (#1235)
igneel64 May 8, 2025
7accbc8
[create-pull-request] automated change (#1236)
github-actions[bot] May 8, 2025
ed7edf7
Update index.md (#1230)
radeleye May 8, 2025
60973d1
Font System Consolidation, Heading Hierarchy Optimization, and Improv…
AH-Avalanche May 12, 2025
60c2f14
[create-pull-request] automated change (#1237)
github-actions[bot] May 12, 2025
0adfc57
[create-pull-request] automated change (#1238)
github-actions[bot] May 12, 2025
f903ad2
Shadcn UI Theme Variables Implementation from Design System (#1217)
AH-Avalanche May 12, 2025
7cea1be
Update custom.css (#1239)
AH-Avalanche May 13, 2025
1c8e94c
Bug fix for Tutorial Layout (#1241)
AH-Avalanche May 14, 2025
b861e3a
Fix table name in Web to Unified migration guide (#1207)
mscwilson May 15, 2025
8e1fc2e
Add Attribution package to trackers table (#1193)
mscwilson May 15, 2025
5383204
Add a top-level events section (#1233)
mscwilson May 15, 2025
0059c0d
Add glossary (#1242)
mscwilson May 15, 2025
b2be1b5
Add a page about timestamps (#1247)
mscwilson May 15, 2025
68147fb
Fix wrong link in Glossary (#1248)
mscwilson May 19, 2025
c4766d5
[create-pull-request] automated change (#1251)
github-actions[bot] May 19, 2025
e4785b5
Updated screenshots to show the new navbar or no nav bar (#1252)
cksnp May 19, 2025
578b98e
Update dependencies (#1245)
mscwilson May 20, 2025
bad4cde
Fix mistake in column name (#1255)
mscwilson May 20, 2025
0bb9350
Add Signals+ML end-to-end tutorial
pif May 20, 2025
1a18feb
align with demo notebook
pif May 20, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
1 change: 1 addition & 0 deletions .github/styles/Snowplow/Acronyms.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ second: '(?:\b[A-Z][a-z]+ )+\(([A-Z]{3,5})\)'
exceptions:
- API
- ASP
- CDN
- CLI
- CPU
- CSS
Expand Down
1 change: 1 addition & 0 deletions .github/styles/Snowplow/Headings.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ exceptions:
- CLI
- Cosmos
- Docker
- DOM
- Emmet
- gRPC
- I
Expand Down
2 changes: 2 additions & 0 deletions .github/styles/config/vocabularies/snowplow/accept.txt
Original file line number Diff line number Diff line change
Expand Up @@ -139,3 +139,5 @@ READMEs
[uU]tils
[vV]iewport
[wW]alkthrough

agentic
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -28,3 +28,6 @@ yarn-error.log*
# Python
__pycache__
manifests

# Local Netlify folder
.netlify
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ Vale only checks normal prose. Text that's marked as code—code blocks or in-li

To install the extension, find "[Vale VSCode](https://marketplace.visualstudio.com/items?itemName=ChrisChinchilla.vale-vscode)" in the Extensions Marketplace within VS Code, then click **Install**.

The Vale extension will automatically check files when they're opened or saved. It underlines the flagged sections in different colours, based on the severity of the alert - red for errors, orange for warnings, and blue for suggestions. Mouse-over the underlined section to see the alert message, or check the VS Code **Problems** tab.
The Vale extension will automatically check files when they're opened or saved. It underlines the flagged sections in different colors, based on the severity of the alert - red for errors, orange for warnings, and blue for suggestions. Mouse-over the underlined section to see the alert message, or check the VS Code **Problems** tab.

### Vale command-line interface

Expand Down
2 changes: 1 addition & 1 deletion docs/api-reference/failed-events/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -116,7 +116,7 @@ Adapter failure schema can be found [here](https://github.com/snowplow/iglu-cent

## Tracker protocol violation

This failure type is produced by the [Enrich](/docs/pipeline/enrichments/index.md) application, when an HTTP request does not conform to our [Snowplow Tracker Protocol](/docs/sources/trackers/snowplow-tracker-protocol/index.md).
This failure type is produced by the [Enrich](/docs/pipeline/enrichments/index.md) application, when an HTTP request does not conform to our [Snowplow Tracker Protocol](/docs/events/index.md).

<details>

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,4 +19,4 @@ To run the loader, mount your config file into the docker image, and then provid
--iglu-config /myconfig/iglu.hocon
`}</CodeBlock>

Where `loader.hocon` is loader's [configuration file](/docs/api-reference/loaders-storage-targets/bigquery-loader/#configuring-the-loader) and `iglu.hocon` is [iglu resolver](/docs/api-reference/iglu/iglu-resolver/index.md) configuration.
Where `loader.hocon` is loader's [configuration file](/docs/api-reference/loaders-storage-targets/bigquery-loader/index.md#configuring-the-loader) and `iglu.hocon` is [iglu resolver](/docs/api-reference/iglu/iglu-resolver/index.md) configuration.

This file was deleted.

Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,6 @@ import {versions} from '@site/src/componentVersions';
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
import DeltaConfig from '@site/docs/api-reference/loaders-storage-targets/lake-loader/configuration-reference/_delta_config.md';
import HudiConfig from '@site/docs/api-reference/loaders-storage-targets/lake-loader/configuration-reference/_hudi_config.md';
import IcebergBigLakeConfig from '@site/docs/api-reference/loaders-storage-targets/lake-loader/configuration-reference/_iceberg_biglake_config.md';
import IcebergGlueConfig from '@site/docs/api-reference/loaders-storage-targets/lake-loader/configuration-reference/_iceberg_glue_config.md';
import PubsubConfig from '@site/docs/api-reference/loaders-storage-targets/lake-loader/configuration-reference/_pubsub_config.md';
Expand Down Expand Up @@ -52,23 +51,6 @@ import Admonition from '@theme/Admonition';
</table>
</TabItem>

<TabItem value="hudi" label="Hudi">
<Admonition type="note" title="Alternative Docker image">
To use the Lake Loader with Hudi support, pull the appropriate alternative image from Docker Hub, e.g. <code>snowplow/lake-loader-aws:{`${versions.lakeLoader}`}-hudi</code>.
</Admonition>
<table>
<thead>
<tr>
<th>Parameter</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<HudiConfig/>
</tbody>
</table>
</TabItem>

</Tabs>

### Streams configuration
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ The Lake Loader is an application that loads Snowplow events to a cloud storage

:::info Open Table Formats

The Lake Loader supports the three major Open Table Formats: [Delta](https://delta.io/), [Iceberg](https://iceberg.apache.org/) and [Hudi](https://hudi.apache.org/).
The Lake Loader supports the two major Open Table Formats: [Delta](https://delta.io/) and [Iceberg](https://iceberg.apache.org/).

For Iceberg tables, the loader supports [AWS Glue](https://docs.aws.amazon.com/glue/) as catalog.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,10 @@ sidebar_position: 0

This is a complete list of the options that can be configured in the postgres loader's HOCON config file. The [example configs in github](https://github.com/snowplow-incubator/snowplow-postgres-loader/tree/master/config) show how to prepare an input file.

<table class="has-fixed-layout"><tbody><tr><td><code>input.type</code></td><td>Required. Can be "Kinesis", "PubSub" or "Local". Configures where input events will be read from.</td></tr><tr><td><code>input.streamName</code></td><td>Required when <code>input.type</code> is Kinesis. Name of the Kinesis stream to read from.</td></tr><tr><td><code>input.region</code></td><td>Required when <code>input.type</code> is Kinesis. AWS region in which the Kinesis stream resides.</td></tr><tr><td><code>input.initialPosition</code></td><td>Optional. Used when <code>input.type</code> is Kinesis. Use "TRIM_HORIZON" (the default) to start streaming at the last untrimmed record in the shard, which is the oldest data record in the shard. Or use "LATEST" to start streaming just after the most recent record in the shard.</td></tr><tr><td><code>input.retrievalMode.type</code></td><td>Optional. When <code>input.type</code> is Kinesis, this sets the polling mode for retrieving records. Can be "FanOut" (the default) or "Polling".</td></tr><tr><td><code>input.retrievalMode.maxRecords</code></td><td>Optional. Used when <code>input.retrievalMode.type</code> is "Polling". Configures how many records are fetched in each poll of the kinesis stream. Default 10000.</td></tr><tr><td><code>input.projectId</code></td><td>Required when <code>input.type</code> is PubSub. The name of your GCP project.</td></tr><tr><td><code>input.subscriptionId</code></td><td>Required when <code>input.type</code> is PubSub. Id of the PubSub subscription to read events from</td></tr><tr><td><code>input.path</code></td><td>Required when <code>input.type</code> is Local. Path for event source. It can be directory or file. If it is directory, all the files under given directory will be read recursively. Also, given path can be both absolute path or relative path w.r.t. executable.</td></tr><tr><td><code>output.good.host</code></td><td>Required. Hostname of the postgres database.</td></tr><tr><td><code>output.good.port</code></td><td>Optional. Port number of the postgres database. Default 5432.</td></tr><tr><td><code>output.good.database</code></td><td>Required. Name of the postgres database.</td></tr><tr><td><code>output.good.username</code></td><td>Required. Postgres role name to use when connecting to the database</td></tr><tr><td><code>output.good.password</code></td><td>Required. Password for the postgres user.</td></tr><tr><td><code>output.good.schema</code></td><td>Required. The Postgres schema in which to create tables and write events.</td></tr><tr><td><code>output.good.sslMode</code></td><td>Optional. Configures how the client and server agree on ssl protection. Default "REQUIRE"</td></tr><tr><td><code>output.bad.type</code></td><td>Optional. Can be "Kinesis", "PubSub", "Local" or "Noop". Configures where failed events will be sent. Default is "Noop" which means failed events will be discarded</td></tr><tr><td><code>output.bad.streamName</code></td><td>Required when <code>bad.type</code> is Kinesis. Name of the Kinesis stream to write to.</td></tr><tr><td><code>output.bad.region</code></td><td>Required when <code>bad.type</code> is Kinesis. AWS region in which the Kinesis stream resides.</td></tr><tr><td><code>output.bad.projectId</code></td><td>Required when <code>bad.type</code> is PubSub. The name of your GCP project.</td></tr><tr><td><code>output.bad.topicId</code></td><td>Required when <code>bad.type</code> is PubSub. Id of the PubSub topic to write failed events to</td></tr><tr><td><code>output.bad.path</code></td><td>Required when <code>bad.type</code> is Local. Path of the file to write failed events</td></tr><tr><td><code>purpose</code></td><td>Optional. Set this to "ENRICHED_EVENTS" (the default) when reading the stream of enriched events in tsv format. Set this to "JSON" when reading a stream of self-describing json, e.g. snowplow [bad rows](https://github.com/snowplow/iglu-central/tree/master/schemas/com.snowplowanalytics.snowplow.badrows).</td></tr><tr><td><code>monitoring.metrics.cloudWatch</code></td><td>Optional boolean, with default true. For kinesis input, this is used to disable sending metrics to cloudwatch.</td></tr></tbody></table>
<table className="has-fixed-layout"><tbody><tr><td><code>input.type</code></td><td>Required. Can be "Kinesis", "PubSub" or "Local". Configures where input events will be read from.</td></tr><tr><td><code>input.streamName</code></td><td>Required when <code>input.type</code> is Kinesis. Name of the Kinesis stream to read from.</td></tr><tr><td><code>input.region</code></td><td>Required when <code>input.type</code> is Kinesis. AWS region in which the Kinesis stream resides.</td></tr><tr><td><code>input.initialPosition</code></td><td>Optional. Used when <code>input.type</code> is Kinesis. Use "TRIM_HORIZON" (the default) to start streaming at the last untrimmed record in the shard, which is the oldest data record in the shard. Or use "LATEST" to start streaming just after the most recent record in the shard.</td></tr><tr><td><code>input.retrievalMode.type</code></td><td>Optional. When <code>input.type</code> is Kinesis, this sets the polling mode for retrieving records. Can be "FanOut" (the default) or "Polling".</td></tr><tr><td><code>input.retrievalMode.maxRecords</code></td><td>Optional. Used when <code>input.retrievalMode.type</code> is "Polling". Configures how many records are fetched in each poll of the kinesis stream. Default 10000.</td></tr><tr><td><code>input.projectId</code></td><td>Required when <code>input.type</code> is PubSub. The name of your GCP project.</td></tr><tr><td><code>input.subscriptionId</code></td><td>Required when <code>input.type</code> is PubSub. Id of the PubSub subscription to read events from</td></tr><tr><td><code>input.path</code></td><td>Required when <code>input.type</code> is Local. Path for event source. It can be directory or file. If it is directory, all the files under given directory will be read recursively. Also, given path can be both absolute path or relative path w.r.t. executable.</td></tr><tr><td><code>output.good.host</code></td><td>Required. Hostname of the postgres database.</td></tr><tr><td><code>output.good.port</code></td><td>Optional. Port number of the postgres database. Default 5432.</td></tr><tr><td><code>output.good.database</code></td><td>Required. Name of the postgres database.</td></tr><tr><td><code>output.good.username</code></td><td>Required. Postgres role name to use when connecting to the database</td></tr><tr><td><code>output.good.password</code></td><td>Required. Password for the postgres user.</td></tr><tr><td><code>output.good.schema</code></td><td>Required. The Postgres schema in which to create tables and write events.</td></tr><tr><td><code>output.good.sslMode</code></td><td>Optional. Configures how the client and server agree on ssl protection. Default "REQUIRE"</td></tr><tr><td><code>output.bad.type</code></td><td>Optional. Can be "Kinesis", "PubSub", "Local" or "Noop". Configures where failed events will be sent. Default is "Noop" which means failed events will be discarded</td></tr><tr><td><code>output.bad.streamName</code></td><td>Required when <code>bad.type</code> is Kinesis. Name of the Kinesis stream to write to.</td></tr><tr><td><code>output.bad.region</code></td><td>Required when <code>bad.type</code> is Kinesis. AWS region in which the Kinesis stream resides.</td></tr><tr><td><code>output.bad.projectId</code></td><td>Required when <code>bad.type</code> is PubSub. The name of your GCP project.</td></tr><tr><td><code>output.bad.topicId</code></td><td>Required when <code>bad.type</code> is PubSub. Id of the PubSub topic to write failed events to</td></tr><tr><td><code>output.bad.path</code></td><td>Required when <code>bad.type</code> is Local. Path of the file to write failed events</td></tr><tr><td><code>purpose</code></td><td>Optional. Set this to "ENRICHED_EVENTS" (the default) when reading the stream of enriched events in tsv format. Set this to "JSON" when reading a stream of self-describing json, e.g. snowplow [bad rows](https://github.com/snowplow/iglu-central/tree/master/schemas/com.snowplowanalytics.snowplow.badrows).</td></tr><tr><td><code>monitoring.metrics.cloudWatch</code></td><td>Optional boolean, with default true. For kinesis input, this is used to disable sending metrics to cloudwatch.</td></tr></tbody></table>

#### Advanced options

We believe these advanced options are set to sensible defaults, and hopefully you won't need to ever change them.

<table class="has-fixed-layout"><tbody><tr><td><code>backoffPolicy.minBackoff</code></td><td>If producer (PubSub or Kinesis) fails to send item, it will retry to send it again. This field configures backoff time for first retry. Every retry will double the backoff time of previous one.</td></tr><tr><td><code>backoffPolicy.maxBackoff</code></td><td>Maximum backoff time for retry. After this value is reached, backoff time will no more increase.</td></tr><tr><td><code>input.checkpointSettings.maxBatchSize</code></td><td>Used when <code>input.type</code> is Kinesis. Determines the max number of records to aggregate before checkpointing the records. Default is 1000.</td></tr><tr><td><code>input.checkpointSettings.maxBatchWait</code></td><td>Used when <code>input.type</code> is Kinesis. Determines the max amount of time to wait before checkpointing the records. Default is 10 seconds.</td></tr><tr><td><code>input.checkpointSettings.maxConcurrent</code></td><td>Used when <code>input.type</code> is PubSub. The max number of concurrent evaluation for checkpointer.</td></tr><tr><td><code>output.good.maxConnections</code></td><td>Maximum number of connections database pool is allowed to reach. Default 10</td></tr><tr><td><code>output.good.threadPoolSize</code></td><td>Size of the thread pool for blocking database operations. Default is value of "maxConnections"</td></tr><tr><td><code>output.bad.delayThreshold</code></td><td>Set the delay threshold to use for batching. After this amount of time has elapsed (counting from the first element added), the elements will be wrapped up in a batch and sent. Default 200 milliseconds</td></tr><tr><td><code>output.bad.maxBatchSize</code></td><td>A batch of messages will be emitted when the number of events in batch reaches the given size. Default 500</td></tr><tr><td><code>output.bad.maxBatchBytes</code></td><td>A batch of messages will be emitted when the size of the batch reaches the given size. Default 5 MB</td></tr></tbody></table>
<table className="has-fixed-layout"><tbody><tr><td><code>backoffPolicy.minBackoff</code></td><td>If producer (PubSub or Kinesis) fails to send item, it will retry to send it again. This field configures backoff time for first retry. Every retry will double the backoff time of previous one.</td></tr><tr><td><code>backoffPolicy.maxBackoff</code></td><td>Maximum backoff time for retry. After this value is reached, backoff time will no more increase.</td></tr><tr><td><code>input.checkpointSettings.maxBatchSize</code></td><td>Used when <code>input.type</code> is Kinesis. Determines the max number of records to aggregate before checkpointing the records. Default is 1000.</td></tr><tr><td><code>input.checkpointSettings.maxBatchWait</code></td><td>Used when <code>input.type</code> is Kinesis. Determines the max amount of time to wait before checkpointing the records. Default is 10 seconds.</td></tr><tr><td><code>input.checkpointSettings.maxConcurrent</code></td><td>Used when <code>input.type</code> is PubSub. The max number of concurrent evaluation for checkpointer.</td></tr><tr><td><code>output.good.maxConnections</code></td><td>Maximum number of connections database pool is allowed to reach. Default 10</td></tr><tr><td><code>output.good.threadPoolSize</code></td><td>Size of the thread pool for blocking database operations. Default is value of "maxConnections"</td></tr><tr><td><code>output.bad.delayThreshold</code></td><td>Set the delay threshold to use for batching. After this amount of time has elapsed (counting from the first element added), the elements will be wrapped up in a batch and sent. Default 200 milliseconds</td></tr><tr><td><code>output.bad.maxBatchSize</code></td><td>A batch of messages will be emitted when the number of events in batch reaches the given size. Default 500</td></tr><tr><td><code>output.bad.maxBatchBytes</code></td><td>A batch of messages will be emitted when the size of the batch reaches the given size. Default 5 MB</td></tr></tbody></table>
Loading