Skip to content

Commit edec6fb

Browse files
authored
Merge pull request #15 from pinecone-io/dataset-save
Dataset save
2 parents 9ef8f21 + c0d27f8 commit edec6fb

17 files changed

+1118
-3196
lines changed

.github/workflows/ci.yml

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -22,15 +22,18 @@ jobs:
2222
with:
2323
version: 1.3.2
2424
- name: install dependencies
25-
run: poetry install --with dev
25+
run: poetry install --with dev --all-extras
2626
- name: Run Black Check
2727
run: poetry run black --check .
2828
- name: Run mypy check
2929
run: poetry run mypy .
3030
- name: Run pytest
3131
env:
32-
S3_ACCESS_KEY: ${{ secrets.S3_ACCESS_KEY }}
33-
S3_SECRET: ${{ secrets.S3_SECRET }}
32+
PY_VERSION: ${{ matrix.python-version }}
33+
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
34+
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
35+
PINECONE_API_KEY: ${{ secrets.PINECONE_API_KEY }}
36+
PINECONE_ENVIRONMENT: ${{ secrets.PINECONE_ENVIRONMENT }}
3437
run: poetry run pytest --html=report.html --cov pinecone_datasets
3538
- name: upload pytest report.html
3639
uses: actions/upload-artifact@v3

.gitignore

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,4 +7,5 @@ dist/
77
scratchpad.ipynb
88
.pycache/
99
.pytest_cache/
10-
.coverage
10+
.coverage
11+
poetry.lock

README.md

Lines changed: 112 additions & 97 deletions
Original file line numberDiff line numberDiff line change
@@ -6,11 +6,13 @@
66
pip install pinecone-datasets
77
```
88

9-
## Usage
9+
## Usage - Loading
1010

11-
You can use Pinecone Datasets to load our public datasets or with your own dataset.
11+
You can use Pinecone Datasets to load our public datasets or with your own datasets. Datasets library can be used in 2 main ways: ad-hoc loading of datasets from a path or as a catalog loader for datasets.
1212

13-
### Loading Pinecone Public Datasets
13+
### Loading Pinecone Public Datasets (catalog)
14+
15+
Pinecone hosts a public datasets catalog, you can load a dataset by name using `list_datasets` and `load_dataset` functions. This will use the default catalog endpoint (currently GCS) to list and load datasets.
1416

1517
```python
1618
from pinecone_datasets import list_datasets, load_dataset
@@ -33,159 +35,172 @@ dataset.head()
3335
# └─────┴───────────────────────────┴─────────────────────────────────────┴───────────────────┴──────┘
3436
```
3537

38+
### Expected dataset structure
3639

37-
### Iterating over a Dataset documents and queries
40+
pinecone datasets can load dataset from every storage where it has access (using the default access: s3, gcs or local permissions)
3841

39-
Iterating over documents is useful for upserting but also for different updating. Iterating over queries is helpful in benchmarking
42+
we expect data to be uploaded to the following directory structure:
4043

41-
```python
44+
├── my-subdir # path to where all datasets
45+
│ ├── my-dataset # name of dataset
46+
│ │ ├── metadata.json # dataset metadata (optional, only for listed)
47+
│ │ ├── documents # datasets documents
48+
│ │ │ ├── file1.parquet
49+
│ │ │ └── file2.parquet
50+
│ │ ├── queries # dataset queries
51+
│ │ │ ├── file1.parquet
52+
│ │ │ └── file2.parquet
53+
└── ...
4254

43-
# List Iterator, where every list of size N Dicts with ("id", "metadata", "values", "sparse_values")
44-
dataset.iter_documents(batch_size=n)
55+
The data schema is expected to be as follows:
4556

46-
dataset.iter_queries()
57+
- `documents` directory contains parquet files with the following schema:
58+
- Mandatory: `id: str, values: list[float]`
59+
- Optional: `sparse_values: Dict: indices: List[int], values: List[float]`, `metadata: Dict`, `blob: dict`
60+
- note: blob is a dict that can contain any data, it is not returned when iterating over the dataset and is inteded to be used for storing additional data that is not part of the dataset schema. however, it is sometime useful to store additional data in the dataset, for example, a document text. In future version this may become a first class citizen in the dataset schema.
61+
- `queries` directory contains parquet files with the following schema:
62+
- Mandatory: `vector: list[float], top_k: int`
63+
- Optional: `sparse_vector: Dict: indices: List[int], values: List[float]`, `filter: Dict`
64+
- note: filter is a dict that contain pinecone filters, for more information see [here](https://docs.pinecone.io/docs/metadata-filtering)
65+
66+
in addition, a metadata file is expected to be in the dataset directory, for example: `s3://my-bucket/my-dataset/metadata.json`
67+
68+
```python
69+
from pinecone_datasets.catalog import DatasetMetadata
4770

71+
meta = DatasetMetadata(
72+
name="test_dataset",
73+
created_at="2023-02-17 14:17:01.481785",
74+
documents=2,
75+
queries=2,
76+
source="manual",
77+
bucket="LOCAL",
78+
task="unittests",
79+
dense_model={"name": "bert", "dimension": 3},
80+
sparse_model={"name": "bm25"},
81+
)
4882
```
4983

50-
### upserting to Index
84+
full metadata schema can be found in `pinecone_datasets.catalog.DatasetMetadata.schema`
85+
86+
### Loading your own dataset from catalog
87+
88+
To set you own catalog endpoint, set the environment variable `DATASETS_CATALOG_BASEPATH` to your bucket. Note that pinecone uses the default authentication method for the storage type (gcsfs for GCS and s3fs for S3).
5189

5290
```bash
53-
pip install pinecone-client
91+
export DATASETS_CATALOG_BASEPATH="s3://my-bucket/my-subdir"
5492
```
5593

5694
```python
57-
import pinecone
58-
pinecone.init(api_key="API_KEY", environment="us-west1-gcp")
95+
from pinecone_datasets import list_datasets, load_dataset
96+
97+
list_datasets()
98+
99+
# ["my-dataset", ... ]
59100

60-
pinecone.create_index(name="my-index", dimension=384, pod_type='s1')
101+
dataset = load_dataset("my-dataset")
102+
```
61103

62-
index = pinecone.Index("my-index")
104+
### Loading your own dataset from path
63105

64-
# you can iterate over documents in batches
65-
for batch in dataset.iter_documents(batch_size=100):
66-
index.upsert(vectors=batch)
106+
You can load your own dataset from a local path or a remote path (GCS or S3). Note that pinecone uses the default authentication method for the storage type (gcsfs for GCS and s3fs for S3).
67107

68-
# or upsert the dataset as dataframe
69-
index.upsert_from_dataframe(dataset.drop(columns=["blob"]))
108+
```python
109+
from pinecone_datasets import Dataset
70110

71-
# using gRPC
72-
index = pinecone.GRPCIndex("my-index")
111+
dataset = Dataset("s3://my-bucket/my-subdir/my-dataset")
73112
```
74113

75-
## Advanced Usage
114+
### Loading from a pandas dataframe
115+
116+
Pinecone Datasets enables you to load a dataset from a pandas dataframe. This is useful for loading a dataset from a local file and saving it to a remote storage.
117+
The minimal required data is a documents dataset, and the minimal required columns are `id` and `values`. The `id` column is a unique identifier for the document, and the `values` column is a list of floats representing the document vector.
76118

77-
### Working with your own dataset storage
119+
```python
120+
import pandas as pd
78121

79-
Datasets is using Pinecone's public datasets bucket on GCS, you can use your own bucket by setting the `DATASETS_CATALOG_BASEPATH` environment variable.
122+
df = pd.read_parquet("my-dataset.parquet")
80123

81-
```bash
82-
export PINECONE_DATASETS_ENDPOINT="gs://my-bucket"
124+
dataset = Dataset.from_pandas(df)
83125
```
84126

85-
this will change the default endpoint to your bucket, and upon calling `list_datasets` or `load_dataset` it will scan your bucket and list all datasets.
127+
Please check the documentation for more information on the expected dataframe schema. There's also a column mapping variable that can be used to map the dataframe columns to the expected schema.
86128

87-
Note that you can also use `s3://` as a prefix to your bucket.
88129

89-
### Authenication to your own bucket
130+
## Usage - Accessing data
90131

91-
For now, Pinecone Datastes supports only GCS and S3 buckets, and with default authentication as provided by the fsspec implementation, respectively: `gcsfs` and `s3fs`.
132+
Pinecone Datasets is build on top of pandas. This means that you can use all the pandas API to access the data. In addition, we provide some helper functions to access the data in a more convenient way.
92133

93-
### Using aws key/secret authentication methods
134+
### Accessing documents and queries dataframes
94135

95-
first, to set a new endpoint, set the environment variable `PINECONE_DATASETS_ENDPOINT` to your bucket.
136+
accessing the documents and queries dataframes is done using the `documents` and `queries` properties. These properties are lazy and will only load the data when accessed.
96137

97-
```bash
98-
export PINECONE_DATASETS_ENDPOINT="s3://my-bucket"
138+
```python
139+
document_df: pd.DataFrame = dataset.documents
140+
141+
query_df: pd.DataFrame = dataset.queries
99142
```
100143

101-
then, you can use the `key` and `secret` parameters to pass your credentials to the `list_datasets` and `load_dataset` functions.
102144

103-
```python
104-
st = list_datasets(
105-
key=os.environ.get("S3_ACCESS_KEY"),
106-
secret=os.environ.get("S3_SECRET"),
107-
)
108-
109-
ds = load_dataset(
110-
"test_dataset",
111-
key=os.environ.get("S3_ACCESS_KEY"),
112-
secret=os.environ.get("S3_SECRET"),
113-
)
114-
```
145+
## Usage - Iterating
115146

116-
## For developers
147+
One of the main use cases for Pinecone Datasets is iterating over a dataset. This is useful for upserting a dataset to an index, or for benchmarking. It is also useful for iterating over large datasets - as of today, datasets are not yet lazy, however we are working on it.
117148

118-
This project is using poetry for dependency managemet. supported python version are 3.8+. To start developing, on project root directory run:
119149

120-
```bash
121-
poetry install --with dev
122-
```
150+
```python
123151

124-
To run test locally run
152+
# List Iterator, where every list of size N Dicts with ("id", "values", "sparse_values", "metadata")
153+
dataset.iter_documents(batch_size=n)
154+
155+
# Dict Iterator, where every dict has ("vector", "sparse_vector", "filter", "top_k")
156+
dataset.iter_queries()
125157

126-
```bash
127-
poetry run pytest --cov pinecone_datasets
128158
```
129159

130-
To create a pinecone-public dataset you may need to generate a dataset metadata. For example:
160+
### The 'blob' column
161+
162+
Pinecone dataset ship with a blob column which is inteneded to be used for storing additional data that is not part of the dataset schema. however, it is sometime useful to store additional data in the dataset, for example, a document text. We added a utility function to move data from the blob column to the metadata column. This is useful for example when upserting a dataset to an index and want to use the metadata to store text data.
131163

132164
```python
133-
from pinecone_datasets.catalog import DatasetMetadata
165+
from pinecone_datasets import import_documents_keys_from_blob_to_metadata
134166

135-
meta = DatasetMetadata(
136-
name="test_dataset",
137-
created_at="2023-02-17 14:17:01.481785",
138-
documents=2,
139-
queries=2,
140-
source="manual",
141-
bucket="LOCAL",
142-
task="unittests",
143-
dense_model={"name": "bert", "dimension": 3},
144-
sparse_model={"name": "bm25"},
145-
)
167+
new_dataset = import_documents_keys_from_blob_to_metadata(dataset, keys=["text"])
146168
```
147169

148-
to see the complete schema you can run:
149170

150-
```python
151-
meta.schema()
152-
```
171+
### upserting to Index
172+
173+
When upserting a Dataset to an Index, only the document data will be upserted to the index. The queries data will be ignored.
153174

154-
in order to list a dataset you can save dataset metadata (NOTE: write permission to loacaion is needed)
175+
TODO: add example for API Key adn Environment Variables
155176

156177
```python
157-
dataset = Dataset("non-listed-dataset")
158-
dataset._save_metadata(meta)
159-
```
178+
ds = load_dataset("dataset_name")
160179

161-
### Uploading and listing a dataset.
180+
# If index exists
181+
ds.to_index("index_name")
162182

163-
pinecone datasets can load dataset from every storage where it has access (using the default access: s3, gcs or local permissions)
183+
# If index does not exist use create_index=True, this will create the index with the default pinecone settings and dimension from the dataset metadata.
184+
ds.to_index("index_name", create_index=True)
164185

165-
we expect data to be uploaded to the following directory structure:
186+
```
166187

167-
├── base_path # path to where all datasets
168-
│ ├── dataset_id # name of dataset
169-
│ │ ├── metadata.json # dataset metadata (optional, only for listed)
170-
│ │ ├── documents # datasets documents
171-
│ │ │ ├── file1.parquet
172-
│ │ │ └── file2.parquet
173-
│ │ ├── queries # dataset queries
174-
│ │ │ ├── file1.parquet
175-
│ │ │ └── file2.parquet
176-
└── ...
188+
the `to_index` function also accepts additional parameters
177189

178-
a listed dataset is a dataset that is loaded and listed using `load_dataset` and `list_dataset`
179-
pinecone datasets will scan storage and will list every dataset with metadata file, for example: `s3://my-bucket/my-dataset/metadata.json`
190+
* `batch_size` and `concurrency` - for controlling the upserting process
191+
* `kwargs` - for passing additional parameters to the index creation process
180192

181-
### Accessing a non-listed dataset
182193

183-
to access a non listed dataset you can directly load it via:
194+
## For developers
184195

185-
```python
186-
from pinecone_datasets import Dataset
196+
This project is using poetry for dependency managemet. supported python version are 3.8+. To start developing, on project root directory run:
187197

188-
dataset = Dataset("non-listed-dataset")
198+
```bash
199+
poetry install --with dev
189200
```
190201

202+
To run test locally run
191203

204+
```bash
205+
poetry run pytest --cov pinecone_datasets
206+
```

pinecone_datasets/__init__.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,9 @@
22
.. include:: ../README.md
33
"""
44

5+
__version__ = "0.5.0"
56

6-
__version__ = "0.4.0-alpha"
77

8-
from .dataset import Dataset
8+
from .dataset import Dataset, DatasetInitializationError
99
from .public import list_datasets, load_dataset
10+
from .catalog import DatasetMetadata, DenseModelMetadata

pinecone_datasets/catalog.py

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,19 @@ class DatasetMetadata(BaseModel):
4343
tags: Optional[List[str]]
4444
args: Optional[Dict[str, Any]]
4545

46+
@staticmethod
47+
def empty() -> "DatasetMetadata":
48+
return DatasetMetadata(
49+
name="",
50+
created_at=get_time_now(),
51+
documents=0,
52+
queries=0,
53+
dense_model=DenseModelMetadata(name="", dimension=0),
54+
)
55+
56+
def is_empty(self) -> bool:
57+
return self.name == "" and self.documents == 0 and self.queries == 0
58+
4659

4760
class Catalog(BaseModel):
4861
datasets: List[DatasetMetadata] = []

0 commit comments

Comments
 (0)