Skip to content

Commit 528794a

Browse files
jovinson-msAlexa ThomasesMike Soennichsenjosiahvinson
authored
Deid GA release (#40850)
* Regenerated for stable API version * Custom methods * Customization WIP * Add client methods * Overrides for async client * Update tests for new API spec * AttributeError: 'DeidentificationClient' object has no attribute 'begin_deidentify_documents' * Internal operations! * Still missing client method * Update tests for job refactor * Inheritance for customizations * Model imports * Add kwargs for maxpagesize * Add @distributed_trace * Fix pylint-next errors/warnings * Update configuration files * Changelog update * Updating tests for new API version * Pagination test, fixes for urls * work in progress test sanitizing * Regenerated for stable API version * Custom methods * Customization WIP * Add client methods * Overrides for async client * Update tests for new API spec * AttributeError: 'DeidentificationClient' object has no attribute 'begin_deidentify_documents' * Internal operations! * Still missing client method * Update tests for job refactor * Inheritance for customizations * Model imports * Add kwargs for maxpagesize * Add @distributed_trace * Fix pylint-next errors/warnings * Update configuration files * Changelog update * Updating tests for new API version * Pagination test, fixes for urls * work in progress test sanitizing * Tests running against latest TypeSpec * Update TypeSpec before customizations * Pull in SDK client name updates * Update changelog and samples * update changelog to unreleased * remove unreleased beta version from changelog * Updating version to 1.0.0 * Update README, samples * Update spelling * Separate samples for each operation * adding black formatting * update snippets after black formatting * Update generated code * Updating TypeSpec commit --------- Co-authored-by: Alexa Thomases <athomases@microsoft.com> Co-authored-by: Mike Soennichsen <msoennichsen@microsoft.com> Co-authored-by: Josiah Vinson <jovinson@microsoft.com>
1 parent eb12ddd commit 528794a

File tree

71 files changed

+2729
-2895
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

71 files changed

+2729
-2895
lines changed

.vscode/cspell.json

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -71,6 +71,7 @@
7171
"sdk/eventhub/azure-eventhub/**",
7272
"sdk/easm/azure-defender-easm/azure/defender/easm/**",
7373
"sdk/graphrbac/azure-graphrbac/**",
74+
"sdk/healthdataaiservices/azure-health-deidentification/tests/data/**/*",
7475
"sdk/healthinsights/azure-healthinsights-cancerprofiling/azure/**",
7576
"sdk/healthinsights/azure-healthinsights-clinicalmatching/azure/**",
7677
"sdk/formrecognizer/azure-ai-formrecognizer/samples/sample_forms/**",
@@ -223,6 +224,8 @@
223224
"dateutil",
224225
"ddos",
225226
"decryptor",
227+
"deidentification",
228+
"deidservice",
226229
"delenv",
227230
"dependened",
228231
"deque",
@@ -428,6 +431,7 @@
428431
"struct",
429432
"STRUCT",
430433
"substringof",
434+
"surrogated",
431435
"systemperf",
432436
"tenvparallel",
433437
"Teradata",

sdk/healthdataaiservices/azure-health-deidentification/CHANGELOG.md

Lines changed: 21 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,31 @@
11
# Release History
22

3-
## 1.0.0b2 (Unreleased)
3+
## 1.0.0 (Unreleased)
44

55
### Features Added
66

7-
### Breaking Changes
7+
- Introduced `DeidentificationCustomizationOptions` and `DeidentificationJobCustomizationOptions` models.
8+
- Added `surrogate_locale` field in these models.
9+
- Moved `redaction_format` field into these models.
10+
- Introduced `overwrite` property in `TargetStorageLocation` model, which allows a job to overwrite existing documents in the storage location.
811

9-
### Bugs Fixed
12+
### Breaking Changes
1013

11-
### Other Changes
14+
- Changed method names in `DeidentificationClient` to match functionality:
15+
- Changed the `deidentify` method name to `deidentify_text`.
16+
- Changed the `begin_create_job` method name to `begin_deidentify_documents`.
17+
- Renamed the property `DeidentificationContent.operation` to `operation_type`.
18+
- Deprecated `DocumentDataType`.
19+
- Changed the model `DeidentificationDocumentDetails`:
20+
- Renamed `input` to `input_location`.
21+
- Renamed `output` to `output_location`.
22+
- Changed the model `DeidentificationJob`
23+
- Renamed `name` to `job_name`.
24+
- Renamed `operation` to `operation_type`.
25+
- Renamed the model `OperationState` to `OperationStatus`.
26+
- Changed `path` field to `location` in `SourceStorageLocation` and `TargetStorageLocation`.
27+
- Changed `outputPrefix` behavior to no longer include `job_name` by default.
28+
- Deprecated `path` and `location` from `TaggerResult` model.
1229

1330
## 1.0.0b1 (2024-08-15)
1431

sdk/healthdataaiservices/azure-health-deidentification/MANIFEST.in

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,4 @@ include azure/health/deidentification/py.typed
44
recursive-include tests *.py
55
recursive-include samples *.py *.md
66
include azure/__init__.py
7-
include azure/health/__init__.py
7+
include azure/health/__init__.py
Lines changed: 191 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -1,82 +1,220 @@
1+
# Azure Health Data Services de-identification service client library for Python
12

3+
This package contains a client library for the de-identification service in Azure Health Data Services which
4+
enables users to tag, redact, or surrogate health data containing Protected Health Information (PHI).
5+
For more on service functionality and important usage considerations, see [the de-identification service overview][product_documentation].
26

3-
# Azure Health Deidentification client library for Python
4-
Azure.Health.Deidentification is a managed service that enables users to tag, redact, or surrogate health data.
7+
This library support API versions `2024-11-15` and earlier.
8+
9+
Use the client library for the de-identification service to:
10+
- Discover PHI in unstructured text
11+
- Replace PHI in unstructured text with placeholder values
12+
- Replace PHI in unstructured text with realistic surrogate values
13+
- Manage asynchronous jobs to de-identify documents in Azure Storage
14+
15+
[Source code](https://github.com/Azure/azure-sdk-for-python/tree/main/sdk/healthdataaiservices/azure-health-deidentification/azure/health/deidentification)
16+
| [Package (PyPI)](https://pypi.org/project/azure-health-deidentification)
17+
| [API reference documentation](https://learn.microsoft.com/python/api/overview/azure/health-deidentification)
18+
| [Product documentation][product_documentation]
19+
| [Samples][samples]
520

621
## Getting started
722

23+
### Prequisites
24+
25+
- Python 3.9 or later is required to use this package.
26+
- Install [pip][pip].
27+
- You need an [Azure subscription][azure_sub] to use this package.
28+
- [Deploy the de-identification service][deid_quickstart].
29+
- [Configure Azure role-based access control (RBAC)][deid_rbac] for the operations you will perform.
30+
831
### Install the package
932

1033
```bash
1134
python -m pip install azure-health-deidentification
1235
```
1336

14-
#### Prequisites
37+
### Authentication
38+
To authenticate with the de-identification service, install [`azure-identity`][azure_identity_pip]:
1539

16-
- Python 3.8 or later is required to use this package.
17-
- You need an [Azure subscription][azure_sub] to use this package.
18-
- An existing Azure Health Deidentification instance.
19-
#### Create with an Azure Active Directory Credential
20-
To use an [Azure Active Directory (AAD) token credential][authenticate_with_token],
21-
provide an instance of the desired credential type obtained from the
22-
[azure-identity][azure_identity_credentials] library.
40+
```bash
41+
python -m pip install azure.identity
42+
```
43+
44+
You can use [DefaultAzureCredential][default_azure_credential] to automatically find the best credential to use at runtime.
2345

24-
To authenticate with AAD, you must first [pip][pip] install [`azure-identity`][azure_identity_pip]
46+
You will need a **service URL** to instantiate a client object. You can find the service URL for a particular resource in the [Azure portal][azure_portal], or using the [Azure CLI][azure_cli]:
2547

26-
After setup, you can choose which type of [credential][azure_identity_credentials] from azure.identity to use.
27-
As an example, [DefaultAzureCredential][default_azure_credential] can be used to authenticate the client:
48+
```bash
49+
# Get the service URL for the resource
50+
az deidservice show --name "<resource-name>" --resource-group "<resource-group-name>" --query "properties.serviceUrl"
51+
```
2852

29-
Set the values of the client ID, tenant ID, and client secret of the AAD application as environment variables:
30-
`AZURE_CLIENT_ID`, `AZURE_TENANT_ID`, `AZURE_CLIENT_SECRET`
53+
Optionally, save the service URL as an environment variable named `AZURE_HEALTH_DEIDENTIFICATION_ENDPOINT` for the sample client initialization code.
3154

32-
Use the returned token credential to authenticate the client:
55+
Create a client with the endpoint and credential:
56+
<!-- SNIPPET: examples.create_client -->
3357

3458
```python
35-
>>> from azure.health.deidentification import DeidentificationClient
36-
>>> from azure.identity import DefaultAzureCredential
37-
>>> client = DeidentificationClient(endpoint='<endpoint>', credential=DefaultAzureCredential())
59+
endpoint = os.environ["AZURE_HEALTH_DEIDENTIFICATION_ENDPOINT"]
60+
credential = DefaultAzureCredential()
61+
client = DeidentificationClient(endpoint, credential)
3862
```
3963

64+
<!-- END SNIPPET -->
65+
4066
## Key concepts
4167

42-
**Operation Modes**
43-
- Tag: Will return a structure of offset and length with the PHI category of the related text spans.
44-
- Redact: Will return output text with placeholder stubbed text. ex. `[name]`
45-
- Surrogate: Will return output text with synthetic replacements.
46-
- `My name is John Smith`
47-
- `My name is Tom Jones`
68+
### De-identification operations:
69+
Given an input text, the de-identification service can perform three main operations:
70+
- `Tag` returns the category and location within the text of detected PHI entities.
71+
- `Redact` returns output text where detected PHI entities are replaced with placeholder text. For example `John` replaced with `[name]`.
72+
- `Surrogate` returns output text where detected PHI entities are replaced with realistic replacement values. For example, `My name is John Smith` could become `My name is Tom Jones`.
73+
74+
### Available endpoints
75+
There are two ways to interact with the de-identification service. You can send text directly, or you can create jobs
76+
to de-identify documents in Azure Storage.
77+
78+
You can de-identify text directly using the `DeidentificationClient`:
79+
<!-- SNIPPET: deidentify_text_surrogate.surrogate -->
80+
81+
```python
82+
body = DeidentificationContent(input_text="Hello, my name is John Smith.")
83+
result: DeidentificationResult = client.deidentify_text(body)
84+
print(f'\nOriginal Text: "{body.input_text}"')
85+
print(f'Surrogated Text: "{result.output_text}"') # Surrogated output: Hello, my name is <synthetic name>.
86+
```
87+
88+
<!-- END SNIPPET -->
89+
90+
To de-identify documents in Azure Storage, see [Tutorial: Configure Azure Storage to de-identify documents][deid_configure_storage]
91+
for prerequisites and configuration options.
92+
93+
To run the sample code below, populate the following environment variables:
94+
- `AZURE_STORAGE_ACCOUNT_LOCATION`: an Azure Storage container endpoint, like `https://<storageaccount>.blob.core.windows.net/<container>`.
95+
- `INPUT_PREFIX`: the prefix of the input document name(s) in the container. For example, providing `folder1` would create a job that would process documents like `https://<storageaccount>.blob.core.windows.net/<container>/folder1/document1.txt`
96+
97+
The client exposes a `begin_deidentify_documents` method that returns a [LROPoller](https://learn.microsoft.com/python/api/azure-core/azure.core.polling.lropoller) instance. You can get the result of the operation by calling `result()`, optionally passing in a `timeout` value in seconds:
98+
<!-- SNIPPET: deidentify_documents.sample -->
99+
100+
```python
101+
endpoint = os.environ["AZURE_HEALTH_DEIDENTIFICATION_ENDPOINT"]
102+
storage_location = os.environ["AZURE_STORAGE_ACCOUNT_LOCATION"]
103+
inputPrefix = os.environ["INPUT_PREFIX"]
104+
outputPrefix = "_output"
105+
106+
credential = DefaultAzureCredential()
107+
108+
client = DeidentificationClient(endpoint, credential)
48109

49-
**Job Integration with Azure Storage**
50-
Instead of sending text, you can send an Azure Storage Location to the service. We will asynchronously
51-
process the list of files and output the deidentified files to a location of your choice.
110+
jobname = f"sample-job-{uuid.uuid4().hex[:8]}"
52111

53-
Limitations:
54-
- Maximum file count per job: 1000 documents
55-
- Maximum file size per file: 2 MB
112+
job = DeidentificationJob(
113+
source_location=SourceStorageLocation(
114+
location=storage_location,
115+
prefix=inputPrefix,
116+
),
117+
target_location=TargetStorageLocation(location=storage_location, prefix=outputPrefix, overwrite=True),
118+
)
119+
120+
finished_job: DeidentificationJob = client.begin_deidentify_documents(jobname, job).result(timeout=60)
121+
122+
print(f"Job Name: {finished_job.job_name}")
123+
print(f"Job Status: {finished_job.status}")
124+
print(f"File Count: {finished_job.summary.total_count if finished_job.summary is not None else 0}")
125+
```
126+
127+
<!-- END SNIPPET -->
56128

57129
## Examples
130+
The following sections provide code samples covering some of the most common client use cases, including:
131+
132+
- [Discover PHI in unstructured text](#discover-phi-in-unstructured-text)
133+
- [Replace PHI in unstructured text with placeholder values](#replace-phi-in-unstructured-text-with-placeholder-values)
134+
- [Replace PHI in unstructured text with realistic surrogate values](#replace-phi-in-unstructured-text-with-realistic-surrogate-values)
135+
136+
See the [samples][samples] for code files illustrating common patterns, including creating and managing jobs to de-identify documents in Azure Storage.
137+
138+
### Discover PHI in unstructured text
139+
When you specify the `TAG` operation, the service will return information about the PHI entities it detects. You can use this information to customize your de-identification workflow:
140+
<!-- SNIPPET: deidentify_text_tag.tag -->
58141

59142
```python
60-
>>> from azure.health.deidentification import DeidentificationClient
61-
>>> from azure.identity import DefaultAzureCredential
62-
>>> from azure.core.exceptions import HttpResponseError
143+
body = DeidentificationContent(
144+
input_text="Hello, I'm Dr. John Smith.", operation_type=DeidentificationOperationType.TAG
145+
)
146+
result: DeidentificationResult = client.deidentify_text(body)
147+
print(f'\nOriginal Text: "{body.input_text}"')
148+
149+
if result.tagger_result and result.tagger_result.entities:
150+
print(f"Tagged Entities:")
151+
for entity in result.tagger_result.entities:
152+
print(
153+
f'\tEntity Text: "{entity.text}", Entity Category: "{entity.category}", Offset: "{entity.offset.code_point}", Length: "{entity.length.code_point}"'
154+
)
155+
else:
156+
print("\tNo tagged entities found.")
157+
```
158+
159+
<!-- END SNIPPET -->
63160

64-
>>> client = DeidentificationClient(endpoint='<endpoint>', credential=DefaultAzureCredential())
65-
>>> try:
66-
<!-- write test code here -->
67-
except HttpResponseError as e:
68-
print('service responds error: {}'.format(e.response.json()))
161+
### Replace PHI in unstructured text with placeholder values
162+
When you specify the `REDACT` operation, the service will replace the PHI entities it detects with placeholder values. You can learn more about [redaction customization][deid_redact].
163+
<!-- SNIPPET: deidentify_text_redact.redact -->
69164

165+
```python
166+
body = DeidentificationContent(
167+
input_text="It's great to work at Contoso.", operation_type=DeidentificationOperationType.REDACT
168+
)
169+
result: DeidentificationResult = client.deidentify_text(body)
170+
print(f'\nOriginal Text: "{body.input_text}"')
171+
print(f'Redacted Text: "{result.output_text}"') # Redacted output: "It's great to work at [organization]."
70172
```
71173

72-
## Next steps
174+
<!-- END SNIPPET -->
73175

74-
- Find a bug, or have feedback? Raise an issue with "Health Deidentification" Label.
176+
### Replace PHI in unstructured text with realistic surrogate values
177+
The default operation is the `SURROGATE` operation. Using this operation, the service will replace the PHI entities it detects with realistic surrogate values:
178+
<!-- SNIPPET: deidentify_text_surrogate.surrogate -->
75179

180+
```python
181+
body = DeidentificationContent(input_text="Hello, my name is John Smith.")
182+
result: DeidentificationResult = client.deidentify_text(body)
183+
print(f'\nOriginal Text: "{body.input_text}"')
184+
print(f'Surrogated Text: "{result.output_text}"') # Surrogated output: Hello, my name is <synthetic name>.
185+
```
186+
187+
<!-- END SNIPPET -->
188+
189+
### Troubleshooting
190+
The `DeidentificationClient` raises various `AzureError` [exceptions][azure_error]. For example, if you
191+
provide an invalid service URL, an `ServiceRequestError` would be raised with a message indicating the failure cause.
192+
In the following code snippet, the error is handled and displayed:
193+
<!-- SNIPPET: examples.handle_error -->
194+
195+
```python
196+
error_client = DeidentificationClient("https://contoso.deid.azure.com", credential)
197+
body = DeidentificationContent(input_text="Hello, I'm Dr. John Smith.")
198+
199+
try:
200+
error_client.deidentify_text(body)
201+
except AzureError as e:
202+
print("\nError: " + e.message)
203+
```
204+
205+
<!-- END SNIPPET -->
206+
207+
If you encounter an error indicating that the service is unable to access source or target storage in a de-identification job:
208+
- Ensure you [assign a managed identity][deid_managed_identity] to your de-identification service
209+
- Ensure you [assign appropriate permissions][deid_rbac] to the managed identity to access the storage account
210+
211+
## Next steps
212+
213+
Find a bug, or have feedback? Raise an issue with the [Health Deidentification][github_issue_label] label.
76214

77215
## Troubleshooting
78216

79-
- **Unabled to Access Source or Target Storage**
217+
- **Unable to Access Source or Target Storage**
80218
- Ensure you create your deid service with a system assigned managed identity
81219
- Ensure your storage account has given permissions to that managed identity
82220

@@ -99,10 +237,18 @@ additional questions or comments.
99237

100238
<!-- LINKS -->
101239
[code_of_conduct]: https://opensource.microsoft.com/codeofconduct/
102-
[authenticate_with_token]: https://learn.microsoft.com/azure/cognitive-services/authentication?tabs=powershell#authenticate-with-an-authentication-token
103-
[azure_identity_credentials]: https://github.com/Azure/azure-sdk-for-python/tree/main/sdk/identity/azure-identity#credentials
240+
[product_documentation]: https://learn.microsoft.com/azure/healthcare-apis/deidentification/
104241
[azure_identity_pip]: https://pypi.org/project/azure-identity/
105242
[default_azure_credential]: https://github.com/Azure/azure-sdk-for-python/tree/main/sdk/identity/azure-identity#defaultazurecredential
106243
[pip]: https://pypi.org/project/pip/
107244
[azure_sub]: https://azure.microsoft.com/free/
108-
245+
[deid_quickstart]: https://learn.microsoft.com/azure/healthcare-apis/deidentification/quickstart
246+
[deid_redact]: https://learn.microsoft.com/azure/healthcare-apis/deidentification/redaction-format
247+
[deid_rbac]: https://learn.microsoft.com/azure/healthcare-apis/deidentification/manage-access-rbac
248+
[deid_managed_identity]: https://learn.microsoft.com/azure/healthcare-apis/deidentification/managed-identities
249+
[deid_configure_storage]: https://learn.microsoft.com/azure/healthcare-apis/deidentification/configure-storage
250+
[azure_cli]: https://learn.microsoft.com/cli/azure/healthcareapis/deidservice?view=azure-cli-latest
251+
[azure_portal]: https://ms.portal.azure.com
252+
[azure_error]: https://learn.microsoft.com/python/api/azure-core/azure.core.exceptions.azureerror
253+
[samples]: https://github.com/Azure/azure-sdk-for-python/tree/main/sdk/healthdataaiservices/azure-health-deidentification/samples
254+
[github_issue_label]: https://github.com/Azure/azure-sdk-for-python/labels/Health%20Deidentification
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
{
2+
"CrossLanguagePackageId": "HealthDataAIServices.DeidServices",
3+
"CrossLanguageDefinitionId": {
4+
"azure.health.deidentification.models.DeidentificationContent": "HealthDataAIServices.DeidServices.DeidentificationContent",
5+
"azure.health.deidentification.models.DeidentificationCustomizationOptions": "HealthDataAIServices.DeidServices.DeidentificationCustomizationOptions",
6+
"azure.health.deidentification.models.DeidentificationDocumentDetails": "HealthDataAIServices.DeidServices.DeidentificationDocumentDetails",
7+
"azure.health.deidentification.models.DeidentificationDocumentLocation": "HealthDataAIServices.DeidServices.DeidentificationDocumentLocation",
8+
"azure.health.deidentification.models.DeidentificationJob": "HealthDataAIServices.DeidServices.DeidentificationJob",
9+
"azure.health.deidentification.models.DeidentificationJobCustomizationOptions": "HealthDataAIServices.DeidServices.DeidentificationJobCustomizationOptions",
10+
"azure.health.deidentification.models.DeidentificationJobSummary": "HealthDataAIServices.DeidServices.DeidentificationJobSummary",
11+
"azure.health.deidentification.models.DeidentificationResult": "HealthDataAIServices.DeidServices.DeidentificationResult",
12+
"azure.health.deidentification.models.PhiEntity": "HealthDataAIServices.DeidServices.PhiEntity",
13+
"azure.health.deidentification.models.PhiTaggerResult": "HealthDataAIServices.DeidServices.PhiTaggerResult",
14+
"azure.health.deidentification.models.SourceStorageLocation": "HealthDataAIServices.DeidServices.SourceStorageLocation",
15+
"azure.health.deidentification.models.StringIndex": "HealthDataAIServices.DeidServices.StringIndex",
16+
"azure.health.deidentification.models.TargetStorageLocation": "HealthDataAIServices.DeidServices.TargetStorageLocation",
17+
"azure.health.deidentification.models.DeidentificationOperationType": "HealthDataAIServices.DeidServices.DeidentificationOperationType",
18+
"azure.health.deidentification.models.OperationStatus": "Azure.Core.Foundations.OperationState",
19+
"azure.health.deidentification.models.PhiCategory": "HealthDataAIServices.DeidServices.PhiCategory",
20+
"azure.health.deidentification.DeidentificationClient.get_job": "HealthDataAIServices.DeidServices.getJob",
21+
"azure.health.deidentification.DeidentificationClient.begin_deidentify_documents": "HealthDataAIServices.DeidServices.deidentifyDocuments",
22+
"azure.health.deidentification.DeidentificationClient.cancel_job": "HealthDataAIServices.DeidServices.cancelJob",
23+
"azure.health.deidentification.DeidentificationClient.delete_job": "HealthDataAIServices.DeidServices.deleteJob",
24+
"azure.health.deidentification.DeidentificationClient.deidentify_text": "HealthDataAIServices.DeidServices.deidentifyText"
25+
}
26+
}

sdk/healthdataaiservices/azure-health-deidentification/assets.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,5 +2,5 @@
22
"AssetsRepo": "Azure/azure-sdk-assets",
33
"AssetsRepoPrefixPath": "python",
44
"TagPrefix": "python/healthdataaiservices/azure-health-deidentification",
5-
"Tag": "python/healthdataaiservices/azure-health-deidentification_a8eed6d322"
5+
"Tag": "python/healthdataaiservices/azure-health-deidentification_a9eda6ed27"
66
}

0 commit comments

Comments
 (0)