|
| 1 | +# Pinecone Datasets |
| 2 | + |
| 3 | +### Supported storage options |
| 4 | + |
| 5 | +pinecone_datasets can load datasets from Google Cloud storage, Amazon S3, and local files. |
| 6 | + |
| 7 | +By default, the `load_dataset` and `list_datasets` packages will pull from Pinecone's public GCS bucket at `gs://pinecone-datasets-dev`, but you can interact with catalogs stored in other locations. |
| 8 | + |
| 9 | +```python |
| 10 | +from pinecone_datasets import Catalog |
| 11 | + |
| 12 | +# Local catalog |
| 13 | +catalog = Catalog(base_path="/path/to/local/catalog") |
| 14 | +catalog.list_datasets() |
| 15 | + |
| 16 | +# Google Cloud |
| 17 | +catalog = Catalog(base_path="gs://bucket-name") |
| 18 | + |
| 19 | +# S3 catalog |
| 20 | +s3_catalog = Catalog(base_path="s3://bucket-name") |
| 21 | +``` |
| 22 | + |
| 23 | +If you are using Amazon S3 or Google Cloud to access private buckets, you can use environment variables to configure your credentials. For example, if you set a base_path starting with "gs://", the `gcsfs` package will attempt to find credentials by looking in cache locations used by `gcloud auth login` or reading environment variables such as `GOOGLE_APPLICATION_CREDENTIALS`. |
| 24 | + |
| 25 | +## Adding a new dataset to the public datasets repo |
| 26 | + |
| 27 | +Note: Only Pinecone employees with access to the bucket can complete this step. |
| 28 | + |
| 29 | +Prerequisites: |
| 30 | + |
| 31 | +1. Install google cloud CLI |
| 32 | +2. Authenticate with `gcloud auth login` |
| 33 | + |
| 34 | +```python |
| 35 | +from pinecone_datasets import Catalog, Dataset, DatasetMetadata, DenseModelMetadata |
| 36 | + |
| 37 | +# 1. Prepare pandas dataframes containing your embeddings |
| 38 | +documents_df = ... |
| 39 | +queries_df = ... |
| 40 | + |
| 41 | +# 2. Create metadata to describe the dataset |
| 42 | +import datatime |
| 43 | +metadata = DatasetMetadata( |
| 44 | + name="new-dataset-name", |
| 45 | + created_at=datetime.now().strftime("%Y-%m-%d %H:%M:%S.%f"), |
| 46 | + documents=len(documents_df), |
| 47 | + queries=len(queries_df), |
| 48 | + dense_model=DenseModelMetadata( |
| 49 | + name="ada2", |
| 50 | + dimension=2, |
| 51 | + ), |
| 52 | +) |
| 53 | + |
| 54 | +# 3. Take all this, and instantiate a Dataset |
| 55 | +ds = Dataset.from_pandas( |
| 56 | + documents=documents_df, |
| 57 | + queries=queries_df, |
| 58 | + metadata=metadata |
| 59 | +) |
| 60 | + |
| 61 | +# 4. Save to catalog (requires gcloud auth step above) |
| 62 | +catalog = Catalog(base_path="gs://pinecone-datasets-dev") |
| 63 | +catalog.save_dataset(ds) |
| 64 | +``` |
| 65 | + |
| 66 | +Afterwards, verify the new dataset appears in list function and can be used |
| 67 | + |
| 68 | +```python |
| 69 | +from pinecone_datasets import list_datasets, load_dataset |
| 70 | + |
| 71 | +list_datasets(as_df=True) |
| 72 | + |
| 73 | +ds = load_dataset("new-dataset-name") |
| 74 | +ds.documents |
| 75 | +ds.head() |
| 76 | +``` |
| 77 | + |
| 78 | +### Expected dataset structure |
| 79 | + |
| 80 | +The package expects data to be laid out with the following directory structure: |
| 81 | + |
| 82 | + ├── my-subdir # path to where all datasets |
| 83 | + │ ├── my-dataset # name of dataset |
| 84 | + │ │ ├── metadata.json # dataset metadata (optional, only for listed) |
| 85 | + │ │ ├── documents # datasets documents |
| 86 | + │ │ │ ├── file1.parquet |
| 87 | + │ │ │ └── file2.parquet |
| 88 | + │ │ ├── queries # dataset queries |
| 89 | + │ │ │ ├── file1.parquet |
| 90 | + │ │ │ └── file2.parquet |
| 91 | + └── ... |
| 92 | + |
| 93 | +The data schema is expected to be as follows: |
| 94 | + |
| 95 | +- `documents` directory contains parquet files with the following schema: |
| 96 | + - Mandatory: `id: str, values: list[float]` |
| 97 | + - Optional: `sparse_values: Dict: indices: List[int], values: List[float]`, `metadata: Dict`, `blob: dict` |
| 98 | + - note: blob is a dict that can contain any data, it is not returned when iterating over the dataset and is inteded to be used for storing additional data that is not part of the dataset schema. however, it is sometime useful to store additional data in the dataset, for example, a document text. In future version this may become a first class citizen in the dataset schema. |
| 99 | +- `queries` directory contains parquet files with the following schema: |
| 100 | + - Mandatory: `vector: list[float], top_k: int` |
| 101 | + - Optional: `sparse_vector: Dict: indices: List[int], values: List[float]`, `filter: Dict` |
| 102 | + - note: filter is a dict that contain pinecone filters, for more information see [here](https://docs.pinecone.io/docs/metadata-filtering) |
| 103 | + |
| 104 | +in addition, a metadata file is expected to be in the dataset directory, for example: `s3://my-bucket/my-dataset/metadata.json` |
| 105 | + |
| 106 | +```python |
| 107 | +from pinecone_datasets.catalog import DatasetMetadata |
| 108 | + |
| 109 | +meta = DatasetMetadata( |
| 110 | + name="test_dataset", |
| 111 | + created_at="2023-02-17 14:17:01.481785", |
| 112 | + documents=2, |
| 113 | + queries=2, |
| 114 | + source="manual", |
| 115 | + bucket="LOCAL", |
| 116 | + task="unittests", |
| 117 | + dense_model={"name": "bert", "dimension": 3}, |
| 118 | + sparse_model={"name": "bm25"}, |
| 119 | +) |
| 120 | +``` |
| 121 | + |
| 122 | +full metadata schema can be found in `pinecone_datasets.dataset_metadata.DatasetMetadata.schema` |
| 123 | + |
| 124 | +### The 'blob' column |
| 125 | + |
| 126 | +Pinecone dataset ship with a blob column which is inteneded to be used for storing additional data that is not part of the dataset schema. however, it is sometime useful to store additional data in the dataset, for example, a document text. We added a utility function to move data from the blob column to the metadata column. This is useful for example when upserting a dataset to an index and want to use the metadata to store text data. |
| 127 | + |
| 128 | +```python |
| 129 | +from pinecone_datasets import import_documents_keys_from_blob_to_metadata |
| 130 | + |
| 131 | +new_dataset = import_documents_keys_from_blob_to_metadata(dataset, keys=["text"]) |
| 132 | +``` |
| 133 | + |
| 134 | +## Usage saving |
| 135 | + |
| 136 | +You can save your dataset to a catalog managed by you or to a local path or a remote path (GCS or S3). |
| 137 | + |
| 138 | +### Saving a dataset to a Catalog |
| 139 | + |
| 140 | +To set you own catalog endpoint, set the environment variable `DATASETS_CATALOG_BASEPATH` to your bucket. Note that pinecone uses the default authentication method for the storage type (gcsfs for GCS and s3fs for S3). |
| 141 | + |
| 142 | +After this environment variable is set you can save your dataset to the catalog using the `save` function |
| 143 | + |
| 144 | +```python |
| 145 | +from pinecone_datasets import Dataset |
| 146 | + |
| 147 | +metadata = DatasetMetadata(**{"name": "my-dataset", ...}) |
| 148 | +``` |
| 149 | + |
| 150 | + |
| 151 | +### Saving to Path |
| 152 | + |
| 153 | +You can save your dataset to a local path or a remote path (GCS or S3). Note that pinecone uses the default authentication method for the storage type (gcsfs for GCS and s3fs for S3). |
| 154 | + |
| 155 | +```python |
| 156 | +dataset = Dataset.from_pandas(documents, queries, metadata) |
| 157 | +dataset.to_path("s3://my-bucket/my-subdir/my-dataset") |
| 158 | +``` |
| 159 | + |
| 160 | +## Running tests |
| 161 | + |
| 162 | +This project is using poetry for dependency managemet. To start developing, on project root directory run: |
| 163 | + |
| 164 | +```bash |
| 165 | +poetry install --with dev |
| 166 | +``` |
| 167 | + |
| 168 | +To run test locally run |
| 169 | + |
| 170 | +```bash |
| 171 | +poetry run pytest test/unit --cov pinecone_datasets |
| 172 | +``` |
0 commit comments