Skip to content

Commit bdbd226

Browse files
authored
Package overhaul: update dependencies, improve load time, and more (#48)
## Problem A lot of dependencies are far out of date, preventing this package from being run on modern jupyter notebooks and creating incompatibilities with updated SDK libraries. Also, this package loads very slowly despite not doing very much. It wasn't my intention to change everything in one PR, but since the tests weren't runnable I couldn't assess the impact of the dependency changes without updating and adding tests. And once you start doing that you naturally want to modify other parts of the code. ## Solution - Upgrade all dependencies in pyproject.toml and actions in .github workflows - Tried running tests to assess what broke after upgrading deps, and found myself struggling because the test coverage and organization isn't great. For example, there were tests under "unit tests" which can't run without S3 credentials for a private bucket (which I don't have). Problems like these, and a lack of specific testing for the `load_dataset` and `list_datasets` functions, meant I definitely needed to expand testing. Added tests: - For `load_dataset` and `list_datasets` functions which are the main way people use this package from the examples repository. - Added tests for working with local catalogs since that seemed like an intended behavior of this package but wasn't clearly tested - Added tests for uploading datasets to a google storage bucket using service account credentials, which seems important since that is how we would need to update or add additional datasets to the public set. Refactoring (non-breaking): - I ended up refactoring and adding a lot of tests to scope down the responsibility of the Dataset class by extracting groups of functionality into smaller more focused classes: `DatasetFSWriter`, `DatasetFSReader`. This seemed faster than trying to reason about everything smashed together into one giant `Dataset` class. - Use lazy-loading for heavy dependencies such as `pandas`, `gcsfs`, and `s3fs`. With these changes, importing the `load_dataset` function is now about 8x faster, with import time dropping from a measured ~1.849s down to 0.230s. - Incorporated Dave Rigby's suggestion to use the `fs.glob` function when listing datasets in a Catalog; this significantly cuts down the number of network calls needed to build the catalog list. - When calling `load_dataset`, skip loading all metadata. In the past the entire catalog metadata was loaded to check if a dataset exists before trying to build the `Dataset` object, which is unnecessary and just adds a ton of overhead. If you try to load a dataset that doesn't exist, the error message is already pretty clear. Breaking changes: - Removed `to_pinecone_index`. Having this in creates a coupling between this package and the SDK package that is going to be a perpetual thorn in our side when it comes to maintaining docs and examples. The SDK will continue marching forward whereas most of the other logic in here related to uploading and downloading from buckets into dataframes should not change much. There's really no need for it in here, so I am removing it. - `Dataset.to_catalog` and `Dataset.from_catalog` now error and tells you to use `Catalog.save_dataset` and `Catalog.load_dataset`. I'm not aware of anyone using these legacy methods, but it was a change that felt right to do to make the Catalog class mostly responsible for "where" things are saved to while the Dataset class is responsible for "what" things are saved. Now you can easily download a dataset and save it to local, for example, which wasn't something you could easily reason about when the dataset itself was responsible for writing to a catalog. Despite all these changes, little should change for most users of the package aside from dependency updates. `load_dataset` and `list_datasets` are still the same (although much more performant) and they are by far the most used. ## Usage Most people who just want to load a demo dataset will do something like this ```python from pinecone import Pinecone from pinecone_dataset import load_dataset ds = load_dataset('dataset_name') pc = Pinecone(api_key='key') index = pc.Index(host='host') index.upsert_from_dataframe(df=ds.documents, batch_size=100) ``` ## Type of Change - [x] Bug fix (non-breaking change which fixes an issue) - [x] Breaking change (fix or feature that would cause existing functionality to not work as expected) ## Testing load performance To investigate load performance, I used a python feature called `importtime` and a cool package I added as a dev dependency called `tuna` for visualizing the outputs. ``` poetry run python3 -X importtime -c "from pinecone_datasets import load_dataset" 2> load_times.log poetry run tuna load_times.log ``` ## Testing Besides the tests you see added here, I did some manual testing in a notebook setting using the `1.0.0.dev3` build that I built from this branch.
1 parent b775c7a commit bdbd226

31 files changed

+1034
-1430
lines changed

.github/workflows/PR.yml

Lines changed: 0 additions & 42 deletions
This file was deleted.

.github/workflows/cd.yml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@ name: CD
33
on:
44
workflow_dispatch:
55

6+
67
jobs:
78

89
release:
@@ -17,7 +18,7 @@ jobs:
1718
- name: Install Poetry
1819
uses: snok/install-poetry@v1
1920
with:
20-
version: 1.3.2
21+
version: 1.5.0
2122

2223
- name: Set Version
2324
run: echo "VERSION=$(poetry version -s)" >> $GITHUB_ENV

.github/workflows/ci.yml

Lines changed: 61 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,41 +1,89 @@
11
name: CI
22

3-
on: push
3+
on:
4+
push:
5+
branches:
6+
- main
7+
pull_request:
8+
branches:
9+
- main
410

511
jobs:
6-
run-tests:
7-
name: Run tests
12+
linting:
13+
name: Run lint and type checking
814
runs-on: ubuntu-latest
915
strategy:
16+
fail-fast: false
1017
matrix:
11-
python-version: [3.8, 3.9, '3.10', 3.11]
12-
18+
python-version: ['3.10']
1319
steps:
14-
- uses: actions/checkout@v3
20+
- uses: actions/checkout@v4
1521
- name: Set up Python ${{ matrix.python-version }}
16-
uses: actions/setup-python@v4
22+
uses: actions/setup-python@v5
1723
with:
1824
python-version: ${{ matrix.python-version }}
1925

2026
- name: Install Poetry
2127
uses: snok/install-poetry@v1
2228
with:
23-
version: 1.3.2
29+
version: 1.5.0
2430
- name: install dependencies
2531
run: poetry install --with dev --all-extras
32+
2633
- name: Run Black Check
2734
run: poetry run black --check .
35+
2836
- name: Run mypy check
2937
run: poetry run mypy .
30-
- name: Run pytest
38+
39+
run-tests:
40+
name: Run tests
41+
needs: linting
42+
runs-on: ubuntu-latest
43+
strategy:
44+
fail-fast: false
45+
matrix:
46+
python-version: [3.9, '3.10', 3.11, 3.12, 3.13]
47+
48+
steps:
49+
- uses: actions/checkout@v4
50+
- name: Set up Python ${{ matrix.python-version }}
51+
uses: actions/setup-python@v5
52+
with:
53+
python-version: ${{ matrix.python-version }}
54+
55+
- name: Install Poetry
56+
uses: snok/install-poetry@v1
57+
with:
58+
version: 1.5.0
59+
- name: install dependencies
60+
run: poetry install --with dev --all-extras
61+
62+
- name: Run pytest (unit tests)
3163
env:
3264
PY_VERSION: ${{ matrix.python-version }}
33-
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
34-
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
65+
# AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
66+
# AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
3567
run: poetry run pytest -n 4 --html=report.html --cov pinecone_datasets tests/unit
68+
3669
- name: upload pytest report.html
37-
uses: actions/upload-artifact@v3
70+
uses: actions/upload-artifact@v4
3871
if: always()
3972
with:
4073
name: dataset-pytest-report-py${{ matrix.python-version }}
41-
path: report.html
74+
path: report.html
75+
76+
- name: Write google service account credentials to a file
77+
id: prepare-google-credentials
78+
shell: bash
79+
run: |
80+
secrets_file="$(mktemp)"
81+
echo "$GCS_SERVICE_ACCOUNT_CREDS_BASE64" | base64 -d > $secrets_file
82+
echo "google_credentials_file=$secrets_file" >> $GITHUB_OUTPUT
83+
env:
84+
GCS_SERVICE_ACCOUNT_CREDS_BASE64: '${{ secrets.GCS_SERVICE_ACCOUNT_CREDS_BASE64 }}'
85+
86+
- name: Run pytest (integration tests)
87+
run: poetry run pytest tests/integration
88+
env:
89+
GOOGLE_APPLICATION_CREDENTIALS: ${{ steps.prepare-google-credentials.outputs.google_credentials_file }}

.github/workflows/docs.yml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -12,14 +12,14 @@ jobs:
1212
build:
1313
runs-on: ubuntu-latest
1414
steps:
15-
- uses: actions/checkout@v3
16-
- uses: actions/setup-python@v4
15+
- uses: actions/checkout@v4
16+
- uses: actions/setup-python@v5
1717
with:
1818
python-version: '3.10'
1919
- name: Install Poetry
2020
uses: snok/install-poetry@v1
2121
with:
22-
version: 1.3.2
22+
version: 1.5.0
2323

2424
- run: poetry install --with dev --all-extras
2525
# ADJUST THIS: build your documentation into docs/.

MAINTAINERS.md

Lines changed: 172 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,172 @@
1+
# Pinecone Datasets
2+
3+
### Supported storage options
4+
5+
pinecone_datasets can load datasets from Google Cloud storage, Amazon S3, and local files.
6+
7+
By default, the `load_dataset` and `list_datasets` packages will pull from Pinecone's public GCS bucket at `gs://pinecone-datasets-dev`, but you can interact with catalogs stored in other locations.
8+
9+
```python
10+
from pinecone_datasets import Catalog
11+
12+
# Local catalog
13+
catalog = Catalog(base_path="/path/to/local/catalog")
14+
catalog.list_datasets()
15+
16+
# Google Cloud
17+
catalog = Catalog(base_path="gs://bucket-name")
18+
19+
# S3 catalog
20+
s3_catalog = Catalog(base_path="s3://bucket-name")
21+
```
22+
23+
If you are using Amazon S3 or Google Cloud to access private buckets, you can use environment variables to configure your credentials. For example, if you set a base_path starting with "gs://", the `gcsfs` package will attempt to find credentials by looking in cache locations used by `gcloud auth login` or reading environment variables such as `GOOGLE_APPLICATION_CREDENTIALS`.
24+
25+
## Adding a new dataset to the public datasets repo
26+
27+
Note: Only Pinecone employees with access to the bucket can complete this step.
28+
29+
Prerequisites:
30+
31+
1. Install google cloud CLI
32+
2. Authenticate with `gcloud auth login`
33+
34+
```python
35+
from pinecone_datasets import Catalog, Dataset, DatasetMetadata, DenseModelMetadata
36+
37+
# 1. Prepare pandas dataframes containing your embeddings
38+
documents_df = ...
39+
queries_df = ...
40+
41+
# 2. Create metadata to describe the dataset
42+
import datatime
43+
metadata = DatasetMetadata(
44+
name="new-dataset-name",
45+
created_at=datetime.now().strftime("%Y-%m-%d %H:%M:%S.%f"),
46+
documents=len(documents_df),
47+
queries=len(queries_df),
48+
dense_model=DenseModelMetadata(
49+
name="ada2",
50+
dimension=2,
51+
),
52+
)
53+
54+
# 3. Take all this, and instantiate a Dataset
55+
ds = Dataset.from_pandas(
56+
documents=documents_df,
57+
queries=queries_df,
58+
metadata=metadata
59+
)
60+
61+
# 4. Save to catalog (requires gcloud auth step above)
62+
catalog = Catalog(base_path="gs://pinecone-datasets-dev")
63+
catalog.save_dataset(ds)
64+
```
65+
66+
Afterwards, verify the new dataset appears in list function and can be used
67+
68+
```python
69+
from pinecone_datasets import list_datasets, load_dataset
70+
71+
list_datasets(as_df=True)
72+
73+
ds = load_dataset("new-dataset-name")
74+
ds.documents
75+
ds.head()
76+
```
77+
78+
### Expected dataset structure
79+
80+
The package expects data to be laid out with the following directory structure:
81+
82+
├── my-subdir # path to where all datasets
83+
│ ├── my-dataset # name of dataset
84+
│ │ ├── metadata.json # dataset metadata (optional, only for listed)
85+
│ │ ├── documents # datasets documents
86+
│ │ │ ├── file1.parquet
87+
│ │ │ └── file2.parquet
88+
│ │ ├── queries # dataset queries
89+
│ │ │ ├── file1.parquet
90+
│ │ │ └── file2.parquet
91+
└── ...
92+
93+
The data schema is expected to be as follows:
94+
95+
- `documents` directory contains parquet files with the following schema:
96+
- Mandatory: `id: str, values: list[float]`
97+
- Optional: `sparse_values: Dict: indices: List[int], values: List[float]`, `metadata: Dict`, `blob: dict`
98+
- note: blob is a dict that can contain any data, it is not returned when iterating over the dataset and is inteded to be used for storing additional data that is not part of the dataset schema. however, it is sometime useful to store additional data in the dataset, for example, a document text. In future version this may become a first class citizen in the dataset schema.
99+
- `queries` directory contains parquet files with the following schema:
100+
- Mandatory: `vector: list[float], top_k: int`
101+
- Optional: `sparse_vector: Dict: indices: List[int], values: List[float]`, `filter: Dict`
102+
- note: filter is a dict that contain pinecone filters, for more information see [here](https://docs.pinecone.io/docs/metadata-filtering)
103+
104+
in addition, a metadata file is expected to be in the dataset directory, for example: `s3://my-bucket/my-dataset/metadata.json`
105+
106+
```python
107+
from pinecone_datasets.catalog import DatasetMetadata
108+
109+
meta = DatasetMetadata(
110+
name="test_dataset",
111+
created_at="2023-02-17 14:17:01.481785",
112+
documents=2,
113+
queries=2,
114+
source="manual",
115+
bucket="LOCAL",
116+
task="unittests",
117+
dense_model={"name": "bert", "dimension": 3},
118+
sparse_model={"name": "bm25"},
119+
)
120+
```
121+
122+
full metadata schema can be found in `pinecone_datasets.dataset_metadata.DatasetMetadata.schema`
123+
124+
### The 'blob' column
125+
126+
Pinecone dataset ship with a blob column which is inteneded to be used for storing additional data that is not part of the dataset schema. however, it is sometime useful to store additional data in the dataset, for example, a document text. We added a utility function to move data from the blob column to the metadata column. This is useful for example when upserting a dataset to an index and want to use the metadata to store text data.
127+
128+
```python
129+
from pinecone_datasets import import_documents_keys_from_blob_to_metadata
130+
131+
new_dataset = import_documents_keys_from_blob_to_metadata(dataset, keys=["text"])
132+
```
133+
134+
## Usage saving
135+
136+
You can save your dataset to a catalog managed by you or to a local path or a remote path (GCS or S3).
137+
138+
### Saving a dataset to a Catalog
139+
140+
To set you own catalog endpoint, set the environment variable `DATASETS_CATALOG_BASEPATH` to your bucket. Note that pinecone uses the default authentication method for the storage type (gcsfs for GCS and s3fs for S3).
141+
142+
After this environment variable is set you can save your dataset to the catalog using the `save` function
143+
144+
```python
145+
from pinecone_datasets import Dataset
146+
147+
metadata = DatasetMetadata(**{"name": "my-dataset", ...})
148+
```
149+
150+
151+
### Saving to Path
152+
153+
You can save your dataset to a local path or a remote path (GCS or S3). Note that pinecone uses the default authentication method for the storage type (gcsfs for GCS and s3fs for S3).
154+
155+
```python
156+
dataset = Dataset.from_pandas(documents, queries, metadata)
157+
dataset.to_path("s3://my-bucket/my-subdir/my-dataset")
158+
```
159+
160+
## Running tests
161+
162+
This project is using poetry for dependency managemet. To start developing, on project root directory run:
163+
164+
```bash
165+
poetry install --with dev
166+
```
167+
168+
To run test locally run
169+
170+
```bash
171+
poetry run pytest test/unit --cov pinecone_datasets
172+
```

0 commit comments

Comments
 (0)