Skip to content

Commit 6183d6f

Browse files
authored
Add Scraping a website tutorial (#127)
* feat: add "USE CASES" section to sidebar and introduce new "Analyze Website" documentation - Updated astro.config.mjs to include "USE CASES" in navigation - Added comprehensive "Analyze Website" guide with step-by-step instructions - Created new MDX file with detailed workflow for scraping, AI processing, and database storage - Included code snippets, flow definitions, and setup instructions for the new use case - Enhanced project documentation to support new feature demonstration and testing * chore: update documentation, fix formatting, and clarify migration instructions - Corrected formatting and indentation in the documentation for better readability - Updated CLI version references and environment variable examples - Clarified the purpose of separate files and migration files in the workflow - Fixed minor typos and improved consistency in code snippets and tips - Enhanced troubleshooting section for clarity and completeness - Summarized total code size as less than 200 lines for better understanding * docs: update documentation for analyze-website workflow steps and tips - Clarify the purpose of the destination table creation - Standardize section headers with hyphens - Add a tip about pgflow's advantages - Correct formatting of code blocks and step descriptions - Improve consistency and clarity across the documentation * feat: add comprehensive website analysis workflow with modular tasks and DAG definition - Introduced new task functions for web scraping, AI summarization, keyword extraction, and database saving - Defined a complete flow connecting these tasks with parallel execution of AI steps - Added migration and compilation steps for registering the flow in Postgres - Updated documentation to guide users through setup, execution, and troubleshooting - Enhanced modularity and retry logic for resilient, efficient processing within Postgres environment * docs: update setup instructions for analyzing websites with enhanced prerequisites and environment variable configuration - Clarify required setup steps for local Supabase project and pgflow installation - Add links to installation and getting started guides - Specify how to add OpenAI API key to environment variables in the project - Improve caution note with detailed instructions for setup process * chore: add database migration for websites analysis table Create a new migration file to set up the websites table for storing analysis results, including fields for user ID, URL, AI-generated summary, tags, and timestamps. Apply the migration to update the local database schema. * docs: update analyze-website workflow documentation to highlight parallel execution and pgflow advantages - Added diagram and explanation of parallel AI steps - Emphasized pgflow's automatic parallelization and retry logic - Clarified benefits for IO-bound operations and execution efficiency * refactor: update documentation and flow definitions for website analysis workflow - Corrected diagram description for clarity - Improved wording in database migration instructions - Clarified task placement and purpose in code snippets - Enhanced flow definition explanation with naming conventions and best practices - Broadened comments to improve readability and maintainability of the workflow - Minor formatting adjustments for consistency and clarity in the documentation * docs: Update CLAUDE.md with documentation style guidelines and formatting standards * docs: update instructions for adding OpenAI API key in analyze-website documentation - Clarify the requirement for a valid OpenAI API key for AI analysis steps - Add detailed steps to obtain and add the API key to the environment variables - Include code block formatting for the environment variable addition - Emphasize the importance of the API key to prevent authentication errors * refactor: organize website analysis tasks into modular functions with documentation - Created four specialized task functions for website scraping, AI summarization, tag extraction, and database storage - Added detailed inline comments and usage instructions for each function - Updated documentation to include function purposes, file locations, and HTML sanitization note - Improved code structure to facilitate retries and parallel execution in workflows - Minor formatting and clarity improvements across the task files and documentation * refactor: improve AI response validation and update documentation for HTML sanitization - Clarify that regex-based HTML sanitization is simplified and recommend dedicated libraries for production - Enhance Zod schema usage with consistent naming (e.g., SummarySchema, TagsSchema) - Add detailed comments and tips about OpenAI structured outputs for better validation - Minor code formatting and documentation improvements for clarity and robustness * docs: update documentation for task functions, AI analysis, and parallel execution Refactors the documentation to clarify the creation of task functions, their purposes, and the use of OpenAI's structured outputs. Highlights the parallel execution feature of pgflow, explaining how it improves efficiency and failure handling. Adds notes on execution flow and updates code snippets for clarity. No functional code changes included. * chore: update instructions for creating and running the analyze_website worker - Added detailed steps for creating the Edge Function worker - Included code snippets for worker setup and configuration - Modified terminal commands to start the worker process - Clarified the process for testing the flow after setup * chore: update documentation, fix typos, improve clarity, and adjust setup instructions - Clarified and corrected wording in the documentation for better readability - Fixed typos and improved consistency in the instructions and explanations - Updated code snippets and comments for accuracy and clarity - Added missing environment variable details and environment setup notes - Corrected migration command paths and flow compilation instructions - Enhanced notes on production setup and environment configuration - Minor formatting adjustments for better flow and comprehension * refactor: remove user_id from websites table and related code to simplify data model - Updated SQL migration to exclude user_id from websites table - Removed user_id parameter from saveWebsite function and related types - Adjusted flow to no longer pass user_id when saving website data - Modified SQL trigger to generate UUIDs without user context - Overall, streamlined data structure by eliminating user association from websites records * docs: update website analysis workflow documentation and related code examples Refactor and enhance the documentation for the analyze-website MDX, including improved workflow descriptions, task function details, and flow definition guidelines. Correct version references, clarify setup steps, and add troubleshooting tips. Also, update import statements to reflect the use of JSR packages and modern practices. Minor formatting and consistency improvements across the documentation and code snippets. * docs: add new naming steps documentation and update flow breakdown with note link - Introduced a new MDX file for naming workflow best practices - Updated existing flow documentation to include a note linking to naming guidelines - Removed detailed in-line design pattern explanation from flow, replacing with reference link - Ensured consistency in documentation structure and clarity in referencing naming conventions * chore: update documentation, rename task file, improve HTML processing, and clarify flow setup - Renamed summarizeWithAI.ts to summarize.ts for consistency - Enhanced HTML sanitization note to recommend html-to-text for production - Updated flow definition to reference the renamed summarize task - Improved instructions for running and testing the workflow - Clarified environment variable setup and migration steps - Minor formatting and wording improvements throughout the documentation * chore: update environment variables, remove outdated import map instructions, and adjust run commands in documentation - Corrected environment variable formatting for OpenAI API key in MDX content - Removed deprecated import map example for JSR packages, highlighting current usage - Updated code block formatting for start, serve, and curl commands for clarity - Modified SQL execution instructions to include proper frame and title annotations * docs: clarify flow versioning requirement and add Discord contact info - Updated documentation to specify that creating a new flow requires a unique slug - Added a section with a link to the project's Discord for community support * docs: clarify flow versioning process and note upcoming development improvements Updated documentation to specify the need for creating a new flow with a unique slug when adding or removing steps, and added a note that the process is currently manual but improvements are forthcoming. * docs: add character usage guidelines and update documentation notes - Introduced guidelines on character replacements for documentation - Clarified usage of hyphens, straight quotes, ellipsis, and spaces - Updated the documentation to emphasize compatibility and best practices - Minor formatting adjustments in the documentation file * feat: add comprehensive website analysis workflow with pgflow Implement a 4-step pgflow-based ETL pipeline for web scraping, AI analysis, and database storage. Includes creation of database table, task functions for scraping and AI processing, flow definition, compilation, migration, and setup instructions. Enhances modularity, parallelism, and resilience of the process. * fix(docs): remove outdated instructions and clarify documentation content - Removed deprecated database setup steps - Eliminated unnecessary details about API rate-limiting and error handling - Cleaned up section on file organization for better clarity * refactor: update tutorials section and rename use-cases to tutorials - Renamed 'use-cases' directory and references to 'tutorials' for clarity - Moved analyze-website.md from use-cases to tutorials - Added a new tutorials index page with an introductory description and link card * fix: correct tag schema and response handling in analyze-website tutorial - Renamed 'keywords' to 'tags' in the schema for consistency - Updated log message to reflect 'tags' instead of 'keywords' - Changed model identifier from 'gpt-4o' to 'gpt-4' for accuracy - Adjusted response parsing to use 'tags' instead of 'keywords' in output - Minor formatting and content updates for clarity and correctness * fix: update POST request method in scripts and documentation for consistency - Changed curl commands from default GET to POST in package.json scripts, documentation, and code snippets - Ensured all examples correctly use the -X POST flag for function invocation - Minor formatting adjustments for clarity and consistency across multiple files * docs: update tutorial content, clarify pgflow architecture, and improve instructions - Added mention of 4-step ETL process and workflow diagram - Simplified explanation of pgflow storing DAGs and execution history - Updated database setup instructions for clarity - Corrected code block language tags for consistency - Enhanced parallel execution explanation and Edge Worker setup steps - Improved troubleshooting section with clearer guidance - Updated Discord link to reflect community engagement - Overall, refined documentation for better readability and accuracy * chore: add script to replace special characters in files Introduce a new script to standardize special characters such as em-dash, curly quotes, curly apostrophes, ellipsis, and non-breaking spaces across files. Update documentation to include usage instructions. This helps improve compatibility and consistency in the codebase. The script is added as a new executable file. * fix: add environment variable checks for API keys and database URL Ensure required environment variables are present before initializing OpenAI and Supabase clients, preventing runtime errors. Also, update documentation references for URL encoding and environment setup, and correct a typo in the compile-time issues section. * refactor(docs): update tutorial title, description, and related links to reflect AI web scraper workflow - Renamed analyze-website.md to build-ai-web-scraper.md - Updated tutorial title, description, and content to focus on building an AI-powered web scraper - Enhanced tip section to emphasize pgflow's role in managing AI workflows - Modified index to link to the new tutorial with updated title and description * docs: update documentation to clarify that pgflow orchestration is contained within the Supabase project * refactor(docs): rename tutorial content from build-ai-web-scraper to ai-web-scraper/backend and add new index page - Renamed the tutorial MDX file to better reflect its focus on backend implementation - Added a new index MDX for the AI Web Scraper tutorial series overview - Updated links and descriptions to improve clarity and navigation for users - Minor formatting adjustments for consistency and readability * docs: update tutorial titles and descriptions, and modify index page layout for clarity * chore: update navigation structure, add tutorials and concepts sections, and revise documentation content - Reorganized navigation labels and autogenerate directories for better structure - Added new tutorial entry for AI Web Scraper with badge - Replaced existing concepts and comparisons sections with new "HOW TO" and "CONCEPTS" sections - Updated getting-started guide to include links to tutorials, how-to guides, and concepts - Modified tutorial index page to update title, description, and sidebar order - Overall improvements to documentation organization and content clarity * docs: update tutorials metadata and remove deprecated link from getting-started guide
1 parent 0c1f998 commit 6183d6f

File tree

11 files changed

+693
-12
lines changed

11 files changed

+693
-12
lines changed

CLAUDE.md

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,26 @@ Never use:
1515

1616
The only exception is in class names, where "Pgflow" can be used (PascalCase).
1717

18+
## ⚠️ CHARACTER USAGE GUIDELINES ⚠️
19+
20+
**IMPORTANT**: Never use the following characters in documentation or code comments. Always use the alternatives listed below:
21+
22+
- **Em-dash (—)**: Use hyphen (-) instead
23+
- **Curly quotes ("" '')**: Use straight quotes ("" '') instead
24+
- **Right single quote/curly apostrophe (')**: Use straight apostrophe (') instead
25+
- **Ellipsis character (…)**: Use three periods (...) instead
26+
- **Non-breaking space**: Use regular space instead
27+
28+
This ensures compatibility across different editors and environments.
29+
30+
### Quick Fix Command
31+
32+
To replace all these characters in a file, use this script:
33+
34+
```bash
35+
./scripts/replace-special-chars.sh <file_path>
36+
```
37+
1838
## ⚠️ MVP STATUS AND DEVELOPMENT PHILOSOPHY ⚠️
1939

2040
**IMPORTANT**: pgflow is currently a Minimum Viable Product (MVP) in very early stages of development. When working on this codebase:
@@ -48,6 +68,7 @@ When suggesting changes or improvements, bias heavily toward solutions that can
4868
- **Naming**: Use camelCase for variables/functions, PascalCase for classes/types
4969
- **Error Handling**: Use proper error types and handle errors appropriately
5070
- **File Structure**: Monorepo structure with packages in pkgs/ directory
71+
- **Documentation Style**: Use impersonal, factual language. Avoid "we" and "our" when describing technical concepts, flows, or processes. Only use "you" when directly instructing the reader. Focus on what the system does, not who is doing it.
5172

5273
## Packages
5374

examples/playground/package.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
"build": "next build",
77
"start": "next start",
88
"start-functions": "npx supabase@latest functions serve",
9-
"start-workers": "for workerIndex in 0 1; do curl http://127.0.0.1:54321/functions/v1/analyze_website_worker_$workerIndex; done",
9+
"start-workers": "for workerIndex in 0 1; do curl -X POST http://127.0.0.1:54321/functions/v1/analyze_website_worker_$workerIndex; done",
1010
"gen-types": "npx supabase@latest gen types --local --schema public --schema pgflow > supabase/functions/database-types.d.ts"
1111
},
1212
"dependencies": {

pkgs/edge-worker/supabase/call

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,8 +15,7 @@ function_name="$1"
1515
data="$2"
1616

1717
API_URL=http://localhost:50321
18-
curl \
19-
--request POST \
18+
curl -X POST \
2019
"${API_URL}/functions/v1/${function_name}" \
2120
--header 'Authorization: Bearer '${ANON_KEY}'' \
2221
--header 'Content-Type: application/json' \

pkgs/website/astro.config.mjs

Lines changed: 24 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -95,16 +95,34 @@ export default defineConfig({
9595
autogenerate: { directory: 'getting-started/' },
9696
},
9797
{
98-
label: 'CONCEPTS',
99-
autogenerate: { directory: 'concepts/', collapsed: true },
98+
label: 'TUTORIALS',
99+
badge: 'NEW!',
100+
collapsed: true,
101+
items: [
102+
{
103+
label: 'AI Web Scraper',
104+
badge: 'NEW!',
105+
autogenerate: {
106+
directory: 'tutorials/ai-web-scraper/',
107+
collapsed: true,
108+
},
109+
},
110+
],
100111
},
101112
{
102-
label: 'COMPARISONS',
103-
autogenerate: { directory: 'comparisons/', collapsed: true },
113+
label: 'HOW TO',
114+
collapsed: true,
115+
autogenerate: { directory: 'how-to/' },
104116
},
105117
{
106-
label: 'HOW TO',
107-
autogenerate: { directory: 'how-to/', collapsed: true },
118+
label: 'CONCEPTS',
119+
collapsed: true,
120+
autogenerate: { directory: 'concepts/' },
121+
},
122+
{
123+
label: 'COMPARISONS',
124+
collapsed: true,
125+
autogenerate: { directory: 'comparisons/' },
108126
},
109127
{
110128
label: 'FAQ - Common Questions',

pkgs/website/src/content/docs/edge-worker/getting-started/create-first-worker.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,7 @@ Before starting, please read the [Install Edge Worker](/edge-worker/getting-star
6666
(replace `<function-name>` with your function name):
6767

6868
```bash frame="none"
69-
curl http://localhost:54321/functions/v1/<function-name>
69+
curl -X POST http://localhost:54321/functions/v1/<function-name>
7070
```
7171

7272
This will boot a new instance and start your worker:

pkgs/website/src/content/docs/getting-started/run-flow.mdx

Lines changed: 22 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -81,7 +81,7 @@ serving the request with supabase/functions/greet_user_worker
8181
In a new terminal, send an HTTP request to start your worker:
8282

8383
```bash frame="none"
84-
curl http://localhost:54321/functions/v1/greet_user_worker
84+
curl -X POST http://localhost:54321/functions/v1/greet_user_worker
8585
```
8686

8787
You should see output in your Edge Runtime terminal indicating the worker has started:
@@ -195,10 +195,30 @@ This completes the pgflow getting started guide. You now have all the basics nee
195195

196196
## Next steps
197197

198+
### Tutorials
199+
200+
Put your pgflow knowledge into practice with hands-on tutorials:
201+
202+
<CardGrid>
203+
<LinkCard title="Build an AI Web Scraper" href="/tutorials/ai-web-scraper/" description="Create a workflow that scrapes webpages, analyzes content with OpenAI, and stores results in Postgres"/>
204+
</CardGrid>
205+
206+
### How-to Guides
207+
208+
Learn essential pgflow techniques:
209+
198210
<CardGrid>
199211
<LinkCard title="Monitor flow execution" href="/how-to/monitor-flow-execution/" description="Learn how to monitor and observe your pgflow workflow execution in detail"/>
200212
<LinkCard title="Organize Flows code" href="/how-to/organize-flows-code/" description="Learn how to structure your pgflow code for maintainability and reusability"/>
201-
<LinkCard title="Change existing Flow options" href="/how-to/update-flow-options/" description="Learn how to safely update configuration options for existing flows"/>
213+
<LinkCard title="Create reusable tasks" href="/how-to/create-reusable-tasks/" description="Build modular tasks that can be shared across multiple workflows"/>
202214
<LinkCard title="Version your Flows" href="/how-to/version-your-flows/" description="Learn how to safely update your flows without breaking existing runs"/>
215+
</CardGrid>
216+
217+
### Concepts
218+
219+
Deepen your understanding of pgflow:
220+
221+
<CardGrid>
222+
<LinkCard title="How pgflow works" href="/concepts/how-pgflow-works/" description="Understand the core architecture and execution model of pgflow"/>
203223
<LinkCard title="Understand Flow DSL" href="/concepts/flow-dsl/" description="How pgflow's TypeScript DSL works to create type-safe, data-driven workflows"/>
204224
</CardGrid>
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
---
2+
title: Naming workflow steps effectively
3+
description: Best practices for naming steps in your pgflow flows
4+
---
5+
6+
Step naming is an important design decision that affects workflow readability and maintainability. After analyzing multiple pgflow projects, these patterns have proven effective.
7+
8+
## Recommended approach: Hybrid naming
9+
10+
- Use **nouns** for steps that produce data other steps depend on
11+
- Use **verb-noun** combinations for terminal actions or utility steps
12+
13+
```ts
14+
// Data-producing steps use nouns
15+
.step({ slug: "website" }, ...)
16+
.step({ slug: "summary", dependsOn: ["website"] }, ...)
17+
18+
// Terminal action step uses verb-noun
19+
.step({ slug: "saveToDb", dependsOn: ["summary"] }, ...)
20+
```
21+
22+
## Why this works well
23+
24+
1. When accessing data from dependent steps, nouns create more intuitive property access:
25+
```ts
26+
// Clean and reads naturally
27+
({ website }) => summarizeWithAI(website.content)
28+
```
29+
30+
2. Terminal steps that don't have any dependents benefit from action-oriented naming that clearly describes what they're doing
31+
32+
## Use camelCase for step slugs
33+
34+
```ts
35+
.step({ slug: "websiteContent" }, ...) // Correct
36+
.step({ slug: "website_content" }, ...) // Avoid
37+
```
38+
39+
Step slugs are used as identifiers in TypeScript and must match exactly when referenced in dependency arrays. Following JavaScript conventions with camelCase helps maintain consistency.
40+
41+
While this guide recommends the hybrid pattern, the most important thing is consistency within your project. Document the chosen convention and apply it throughout the codebase.

0 commit comments

Comments
 (0)