Skip to content

Commit 2bb69f0

Browse files
AndreaFrancisseverolhoestq
authored
Datasets: Adding doc for DuckDB CLI integration (#1297)
* Adding doc for duckdb cli integration * Apply code review suggestions * Apply suggestions from code review Co-authored-by: Sylvain Lesage <sylvain.lesage@huggingface.co> Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> * Apply code review suggestions * Apply suggestions from code review Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> * Fix statistics output * Adding ref for other APIs * Add more information about when to use read_parquet --------- Co-authored-by: Sylvain Lesage <sylvain.lesage@huggingface.co> Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
1 parent a70eb7e commit 2bb69f0

7 files changed

+594
-24
lines changed

docs/hub/_toctree.yml

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -165,6 +165,17 @@
165165
title: Datasets
166166
- local: datasets-duckdb
167167
title: DuckDB
168+
sections:
169+
- local: datasets-duckdb-auth
170+
title: Authentication for private and gated datasets
171+
- local: datasets-duckdb-select
172+
title: Query datasets
173+
- local: datasets-duckdb-sql
174+
title: Perform SQL operations
175+
- local: datasets-duckdb-combine-and-export
176+
title: Combine datasets and export
177+
- local: datasets-duckdb-vector-similarity-search
178+
title: Perform vector similarity search
168179
- local: datasets-pandas
169180
title: Pandas
170181
- local: datasets-webdataset

docs/hub/datasets-duckdb-auth.md

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
# Authentication for private and gated datasets
2+
3+
To access private or gated datasets, you need to configure your Hugging Face Token in the DuckDB Secrets Manager.
4+
5+
Visit [Hugging Face Settings - Tokens](https://huggingface.co/settings/tokens) to obtain your access token.
6+
7+
DuckDB supports two providers for managing secrets:
8+
9+
- `CONFIG`: Requires the user to pass all configuration information into the CREATE SECRET statement.
10+
- `CREDENTIAL_CHAIN`: Automatically tries to fetch credentials. For the Hugging Face token, it will try to get it from `~/.cache/huggingface/token`.
11+
12+
For more information about DuckDB Secrets visit the [Secrets Manager](https://duckdb.org/docs/configuration/secrets_manager.html) guide.
13+
14+
## Creating a secret with `CONFIG` provider
15+
16+
To create a secret using the CONFIG provider, use the following command:
17+
18+
```bash
19+
CREATE SECRET hf_token (TYPE HUGGINGFACE, TOKEN 'your_hf_token');
20+
```
21+
22+
Replace `your_hf_token` with your actual Hugging Face token.
23+
24+
## Creating a secret with `CREDENTIAL_CHAIN` provider
25+
26+
To create a secret using the CREDENTIAL_CHAIN provider, use the following command:
27+
28+
```bash
29+
CREATE SECRET hf_token (TYPE HUGGINGFACE, PROVIDER credential_chain);
30+
```
31+
32+
This command automatically retrieves the stored token from `~/.cache/huggingface/token`.
33+
34+
First you need to [Login with your Hugging Face account](../huggingface_hub/quick-start#login), for example using:
35+
36+
```bash
37+
huggingface-cli login
38+
```
39+
40+
Alternatively, you can set your Hugging Face token as an environment variable:
41+
42+
```bash
43+
export HF_TOKEN="hf_xxxxxxxxxxxxx"
44+
```
45+
46+
For more information on authentication, see the [Hugging Face authentication](https://huggingface.co/docs/huggingface_hub/main/en/quick-start#authentication) documentation.
Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
# Combine datasets and export
2+
3+
In this section, we'll demonstrate how to combine two datasets and export the result. The first dataset is in CSV format, and the second dataset is in Parquet format. Let's start by examining our datasets:
4+
5+
The first will be [TheFusion21/PokemonCards](https://huggingface.co/datasets/TheFusion21/PokemonCards):
6+
7+
```bash
8+
FROM 'hf://datasets/TheFusion21/PokemonCards/train.csv' LIMIT 3;
9+
┌─────────┬──────────────────────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬────────────┬───────┬─────────────────┐
10+
│ id │ image_url │ caption │ name │ hp │ set_name │
11+
│ varchar │ varchar │ varchar │ varchar │ int64 │ varchar │
12+
├─────────┼──────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼────────────┼───────┼─────────────────┤
13+
│ pl3-1 │ https://images.pok… │ A Basic, SP Pokemon Card of type Darkness with the title Absol G and 70 HP of rarity Rare Holo from the set Supreme Victors. It has … │ Absol G │ 70 │ Supreme Victors │
14+
│ ex12-1 │ https://images.pok… │ A Stage 1 Pokemon Card of type Colorless with the title Aerodactyl and 70 HP of rarity Rare Holo evolved from Mysterious Fossil from … │ Aerodactyl │ 70 │ Legend Maker │
15+
│ xy5-1 │ https://images.pok… │ A Basic Pokemon Card of type Grass with the title Weedle and 50 HP of rarity Common from the set Primal Clash and the flavor text: It… │ Weedle │ 50 │ Primal Clash │
16+
└─────────┴──────────────────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴────────────┴───────┴─────────────────┘
17+
```
18+
19+
And the second one will be [wanghaofan/pokemon-wiki-captions](https://huggingface.co/datasets/wanghaofan/pokemon-wiki-captions):
20+
21+
```bash
22+
FROM 'hf://datasets/wanghaofan/pokemon-wiki-captions/data/*.parquet' LIMIT 3;
23+
24+
┌──────────────────────┬───────────┬──────────┬──────────────────────────────────────────────────────────────┬────────────────────────────────────────────────────────────────────────────────────────────────────┐
25+
│ image │ name_en │ name_zh │ text_en │ text_zh │
26+
│ struct(bytes blob,… │ varchar │ varchar │ varchar │ varchar │
27+
├──────────────────────┼───────────┼──────────┼──────────────────────────────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────┤
28+
│ {'bytes': \x89PNG\… │ abomasnow │ 暴雪王 │ Grass attributes,Blizzard King standing on two feet, with … │ 草属性,双脚站立的暴雪王,全身白色的绒毛,淡紫色的眼睛,几缕长条装的毛皮盖着它的嘴巴 │
29+
│ {'bytes': \x89PNG\… │ abra │ 凯西 │ Super power attributes, the whole body is yellow, the head… │ 超能力属性,通体黄色,头部外形类似狐狸,尖尖鼻子,手和脚上都有三个指头,长尾巴末端带着一个褐色圆环 │
30+
│ {'bytes': \x89PNG\… │ absol │ 阿勃梭鲁 │ Evil attribute, with white hair, blue-gray part without ha… │ 恶属性,有白色毛发,没毛发的部分是蓝灰色,头右边类似弓的角,红色眼睛 │
31+
└──────────────────────┴───────────┴──────────┴──────────────────────────────────────────────────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────┘
32+
33+
```
34+
35+
Now, let's try to combine these two datasets by joining on the `name` column:
36+
37+
```bash
38+
SELECT a.image_url
39+
, a.caption AS card_caption
40+
, a.name
41+
, a.hp
42+
, b.text_en as wiki_caption
43+
FROM 'hf://datasets/TheFusion21/PokemonCards/train.csv' a
44+
JOIN 'hf://datasets/wanghaofan/pokemon-wiki-captions/data/*.parquet' b
45+
ON LOWER(a.name) = b.name_en
46+
LIMIT 3;
47+
48+
┌──────────────────────┬──────────────────────┬────────────┬───────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
49+
│ image_url │ card_caption │ name │ hp │ wiki_caption │
50+
│ varchar │ varchar │ varchar │ int64 │ varchar │
51+
├──────────────────────┼──────────────────────┼────────────┼───────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
52+
│ https://images.pok… │ A Stage 1 Pokemon … │ Aerodactyl │ 70 │ A Pokémon with rock attributes, gray body, blue pupils, purple inner wings, two sharp claws on the wings, jagged teeth, and an arrow-like … │
53+
│ https://images.pok… │ A Basic Pokemon Ca… │ Weedle │ 50 │ Insect-like, caterpillar-like in appearance, with a khaki-yellow body, seven pairs of pink gastropods, a pink nose, a sharp poisonous need… │
54+
│ https://images.pok… │ A Basic Pokemon Ca… │ Caterpie │ 50 │ Insect attributes, caterpillar appearance, green back, white abdomen, Y-shaped red antennae on the head, yellow spindle-shaped tail, two p… │
55+
└──────────────────────┴──────────────────────┴────────────┴───────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
56+
57+
```
58+
59+
We can export the result to a Parquet file using the `COPY` command:
60+
61+
```bash
62+
COPY (SELECT a.image_url
63+
, a.caption AS card_caption
64+
, a.name
65+
, a.hp
66+
, b.text_en as wiki_caption
67+
FROM 'hf://datasets/TheFusion21/PokemonCards/train.csv' a
68+
JOIN 'hf://datasets/wanghaofan/pokemon-wiki-captions/data/*.parquet' b
69+
ON LOWER(a.name) = b.name_en)
70+
TO 'output.parquet' (FORMAT PARQUET);
71+
```
72+
73+
Let's validate the new Parquet file:
74+
75+
```bash
76+
SELECT COUNT(*) FROM 'output.parquet';
77+
78+
┌──────────────┐
79+
count_star() │
80+
│ int64 │
81+
├──────────────┤
82+
│ 9460 │
83+
└──────────────┘
84+
85+
```
86+
87+
<Tip>
88+
89+
You can also export to [CSV](https://duckdb.org/docs/guides/file_formats/csv_export), [Excel](https://duckdb.org/docs/guides/file_formats/excel_export
90+
) and [JSON](https://duckdb.org/docs/guides/file_formats/json_export
91+
) formats.
92+
93+
</Tip>
94+
95+
Finally, let's push the resulting dataset to the Hub. You can use the Hub UI, the `huggingface_hub` client library and more to upload your Parquet file, see more information [here](./datasets-adding).
96+
97+
And that's it! You've successfully combined two datasets, exported the result, and uploaded it to the Hugging Face Hub.

0 commit comments

Comments
 (0)