Skip to content

data explorer: "Copy as Code" comms #8536

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 29 commits into from
Jul 21, 2025
Merged

data explorer: "Copy as Code" comms #8536

merged 29 commits into from
Jul 21, 2025

Conversation

isabelizimm
Copy link
Contributor

@isabelizimm isabelizimm commented Jul 15, 2025

related issue: #8531
related pr for Ark: posit-dev/ark#873

I'm not exactly sure of the flow of these PRs with Ark, but I believe we'll want to merge that PR, release Ark, and then pick up the new release in this PR before this PR is merged.

This PR implements makes 3 major changes to the comms:

  • add convert_to_code as a supported feature: there are other supported features, such as row filters and column filters. This structure includes if it is supported as well as supported syntaxes, as an array of values.
  • suggest_code_syntax message: for a particular data explorer instance, decide what options a user has for generating code. I would expect the return value of this to be base-r or something to that effect.
  • convert_to_code message: for a particular data explorer instance, this message sends over the filters and sorts applied to the data and the desired syntax. The expectation is that this will return a code snippet string. The front end will (eventually) apply code highlighting and formatting.

Below is a diagram of the general flow through comms. Note that these comm messages will not call each other; they will be triggered by a modal in Positron core.

flowchart TD
    A["Data explorer instance is created"] --> B["Get all potential syntaxes from language pack via the **supported_features**." ]
    B -->  C["Also set suggested syntax with **suggest_code_syntax**"]
    C --> D[Open modal]
    D --> E["Use **dataExplorerInstance.suggestedSyntax**"]
    E --> F["Translate data view (filters, sorts) to code with **_translate_to_code_**"]
    F --> G[Display code for user to copy]
    G -->|user chooses another syntax from modal dropdown| F
Loading

There is an open PR that builds off this branch to use these comms, so you can actually fire them off/interact with them #8537

Release Notes

New Features

  • N/A

Bug Fixes

  • N/A

QA Notes

It doesn't look like we have much testing for comms, but open to ideas! Currently I plan on adding unit tests on each language runtime and then e2e tests for the UI pieces.

Copy link

github-actions bot commented Jul 15, 2025

E2E Tests 🚀
This PR will run tests tagged with: @:critical

readme  valid tags

@isabelizimm isabelizimm requested a review from Copilot July 15, 2025 20:48
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for exporting the current data explorer view as code by fetching available syntaxes and translating applied filters/sorts into user-copyable code.

  • Introduce new RPC methods GetCodeSyntaxes and TranslateToCode in the DuckDB backend, comm layer, and client instances.
  • Add TypeScript interfaces and Python stubs for the new code export functionality.
  • Extend the OpenRPC specification to include the new methods.

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/vs/workbench/services/positronDataExplorer/common/positronDataExplorerDuckDBBackend.ts Added backend RPC methods for code syntax retrieval and code translation
src/vs/workbench/services/positronDataExplorer/browser/positronDataExplorerInstance.ts Implemented instance methods and UI logic for code export
src/vs/workbench/services/positronDataExplorer/browser/interfaces/positronDataExplorerInstance.ts Extended the public instance interface with code export methods
src/vs/workbench/services/languageRuntime/common/positronDataExplorerComm.ts Defined new comm interfaces and request types for code export
src/vs/workbench/services/languageRuntime/common/languageRuntimeDataExplorerClient.ts Added client‐side wrappers for the new RPCs
positron/comms/data_explorer-backend-openrpc.json Extended the OpenRPC spec with translate_to_code and get_code_syntaxes
extensions/positron-python/python_files/posit/positron/data_explorer_comm.py Added Python Pydantic models and request types for code export
extensions/positron-python/python_files/posit/positron/data_explorer.py Added abstract stubs and example implementations for code export in Python
Comments suppressed due to low confidence (4)

src/vs/workbench/services/positronDataExplorer/browser/positronDataExplorerInstance.ts:434

  • [nitpick] The JSDoc summary for getCodeSyntaxes is incorrect. Update it to describe fetching available syntaxes rather than exporting filters to clipboard.
	 * Export filters as code.

src/vs/workbench/services/positronDataExplorer/browser/positronDataExplorerInstance.ts:447

  • [nitpick] The JSDoc for translateToCode refers to clipboard copying. It should describe returning generated code for a desired syntax.
	 * Export filters as code.

src/vs/workbench/services/languageRuntime/common/positronDataExplorerComm.ts:1402

  • The supported_features array should list the actual RPC method key (translate_to_code), not export_as_code, to correctly reflect available backend capabilities.
	 */

positron/comms/data_explorer-backend-openrpc.json:262

  • The code_syntax parameter is defined as a string, so it should not include an items schema. Remove the items block here.
						"items": {

@isabelizimm isabelizimm marked this pull request as ready for review July 15, 2025 21:33
Copy link
Member

@jennybc jennybc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've got some early comments and questions to help me get better oriented to the surroundings. I am very new to the data explorer neighborhood.

Two issues I raised inline:

  • Should we return the code as a single string or an array of strings?
  • The type of the object being explored is not explicit over on this side. Which is perhaps to be expected (?), but I could use some help understanding this better.

A couple of other high-level thoughts:

  • It feels like the code syntax should be more of an enum than a free string. Maybe namespaced within language? Something like python:polars, r:tidyverse, etc.
  • The current PR treats available code syntaxes as a matter to be determined every time that a user asks for code. But this is really a static capability that the kernel has when it fires up and it never changes during the session. Then the syntaxes available to a specific data explorer instance are always a subset of that.
  • When I imagine the syntaxes being displayed in the modal as a dropdown or similar: it feels like the set of choices should be fixed (e.g. for R, "base", "tidyverse", "data.table") and appear in a fixed order. And then the default selection could be influenced by the type of data in the explorer (best match). And a syntax that is truly impossible/unsupported might be greyed out (*). Instead of the backend returning a set of possibilities (such as ['base-r', 'dplyr']), it feels more like we should pass back a TRUE/FALSE for each syntax (can we do it?) and a default (the syntax that seems like the best match).

(*) That would never really be the case for R, because we have as.whatever() functions to coerce things back and forth. I gather the situation for Python might be different?

],
"properties": {
"code": {
"type": "string",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How sure are we that we want a string? As opposed to an array of strings, i.e. one per line. I can't say exactly if/why this matters, but somehow sending lines of code back to the front end feels more proper than a single collapsed string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I like this idea! It does feel more proper, somehow. An array would probably be easier for formatting anyway; I'll move over to this structure. 👍

{
"name": "get_code_syntaxes",
"summary": "Get code syntaxes supported for code translation",
"description": "Get all available code syntaxes supported for translation for a data view",
Copy link
Member

@jennybc jennybc Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like there's some notion that the backend is expected to reply using knowledge of the type of data being viewed. In the R world, i.e. that the backend will know the class(es) of the object, in this sense:

class(palmerpenguins::penguins)
#> [1] "tbl_df"     "tbl"        "data.frame"

which would effect which syntaxes are supported or which syntax is the default.

I don't see this information being sent over. Does the backend obtain that info another way? I do understand that the object in question must already exist in the runtime and, therefore, the backend already knows (or can know) this. And I even have some sense that all of this is unfolding inside a data explorer instance and therefore the target data object is implicitly available. I just don't understand the information flow yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The table lives on the RDataExplorer, which is created when the data explorer is opened. We should be able to use that table to get information like class without explicitly passing it over the comm.

@isabelizimm
Copy link
Contributor Author

isabelizimm commented Jul 16, 2025

  • It feels like the code syntax should be more of an enum than a free string. Maybe namespaced within language? Something like python:polars, r:tidyverse, etc.

This could be done on the language pack side, but I do want to stay away from storing any knowledge on the language pack side. Theoretically, if another, non-R/Python language pack wanted to implement this, it would be ideal to not have to make any changes to Positron core.

  • The current PR treats available code syntaxes as a matter to be determined every time that a user asks for code. But this is really a static capability that the kernel has when it fires up and it never changes during the session. Then the syntaxes available to a specific data explorer instance are always a subset of that.

It's true that the full scope of options would be available when the kernel fires up. At the moment when a data explorer instance is created, we could get these options before the user asks for code. That would at least allow us to make one less async call in the modal, which might make it feel a bit snappier.

  • When I imagine the syntaxes being displayed in the modal as a dropdown or similar: it feels like the set of choices should be fixed (e.g. for R, "base", "tidyverse", "data.table") and appear in a fixed order. And then the default selection could be influenced by the type of data in the explorer (best match). And a syntax that is truly impossible/unsupported might be greyed out (*). Instead of the backend returning a set of possibilities (such as ['base-r', 'dplyr']), it feels more like we should pass back a TRUE/FALSE for each syntax (can we do it?) and a default (the syntax that seems like the best match).

With a dropdown, I think it makes the most sense to just omit the options that are not available, rather than greying them out and adding prose that these are the options available for this data type. I'm not sure I understand the benefit of a true/false vs listing what is possible in order. It would add another comm call with essentially the same information, since this would be created per data explorer instance?

Copy link
Member

@jennybc jennybc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like a big step forward. I'm definitely learning a lot about this neck of the woods just by reviewing this!

In my inline comments, there are 2 big themes:

  • Vocabulary questions. On one hand, this is an internal matter and is not a hill I will die on. And yet I do really think "translate" and "guess" don't quite have the right vibes.
  • Type choices. It feels like we still need to adjust some of the interfaces. Ideally we'd use an enum for the syntaxes, but I do understand why we can't, if we want that to be completely owned by the backends. The way we're handling "no syntax" doesn't feel quite right yet, i.e. that shouldn't just be another arbitrary syntax string. Hopefully my suggestions will get us moving in a good direction.

Comment on lines 219 to 221
"name": "translate_to_code",
"summary": "Translates the current data view into a code snippet.",
"description": "Translate filters and sort keys as code in different syntaxes like pandas, polars, data.table, dplyr",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "translate" is not a great verb here, because it could have so many meanings. Translate from English to another language? Translate from R to Python? I realize there's a lot of context to help someone clarify this, but I suspect it would be even better to use a different verb.

For example, I think "convert" or "generate" are better.

This choice obviously affects a lot of locations (every instance of "translate", "translation", "translated", etc.), so I won't repeat the comment elsewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had started with generate and moved away from it since I felt it was too AI-y, but I'm on board with convert instead 💯

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hear you re: "generate" and "AI-y" in terms of the UI. But internally, I think "generate" is a very viable option. I'm thinking about how nice "generated code" feels versus "converted code".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case, both "generate" and "convert" might be handy for actual names and then also for docstrings.

}
},
{
"name": "guess_code_syntax",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "suggest" fits the intent better. I feel like "guess" implies there is a right answer and we're trying our best to guess it. Whereas "suggest" seems to fit the situation better. When I see "guessing", I think about "type guessing". Whereas we're suggesting a syntax based on the type.

I feel less strongly about this than about "translate".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! I like this idea. Will change from guess -> suggest

Comment on lines 256 to 258
this.guessCodeSyntax().then(syntax => {
this.guessedSyntax = syntax;
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly paranoid, but should this check that the suggested (guessed, in the current vocabulary) syntax is among the ones that this data explorer instance claims to support? Given that syntax can't be an actual enum on this side, it feels like we should be a bit pedantic re: checking that the backend is internally consistent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added early erroring/returns if 1) the backend has stated it does not support any syntaxes to convert or 2) if the desired syntax is not one of the options available. This definitely feels a lot safer!

* Guess the code syntax for export.
* @returns A promise that resolves to the available code syntaxes.
*/
async guessCodeSyntax(): Promise<CodeSyntax> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this is a better place to check that the returned syntax is among those that the backend claims to support.

/**
* Best guess of syntax to use for code translation
*/
code_syntax?: string;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this being nullable creates problems elsewhere and is, I'm guessing, why we don't then use CodeSyntax in all of the other places I'd expect later in this file. Something feels slightly off about the current situation, where it feels like 'No available syntaxes' could be an actual syntax as far as the frontend knows. Whereas you'd really like to make undefined be the signal for that situation.

I chatted with copilot about my discomfort and this is an interesting suggestion. At the very least, it starts to get at my concern:

/**
 * Represents a code syntax name supported by the backend
 * This is a string at runtime, but using a distinct type for type safety
 */
export type CodeSyntaxName = string;

/**
 * Best guess of syntax to use for code translation
 */
export interface CodeSyntax {
    /**
     * Best guess of syntax to use for code translation
     */
    code_syntax?: CodeSyntaxName;
}

// Update the SupportedFeatures interface
export interface SupportedFeatures {
    // ...other properties
    /**
     * List of supported code syntax names
     */
    code_syntaxes: Array<CodeSyntaxName>;
}

// Update TranslateToCodeParams
export interface TranslateToCodeParams {
    /**
     * Zero or more column filters to apply
     */
    column_filters: Array<ColumnFilter>;

    /**
     * Zero or more row filters to apply
     */
    row_filters: Array<RowFilter>;

    /**
     * Zero or more sort keys to apply
     */
    sort_keys: Array<ColumnSortKey>;

    /**
     * The code syntax to use for translation
     */
    code_syntax: CodeSyntaxName;
}

// Update the translateToCode method
translateToCode(columnFilters: Array<ColumnFilter>, rowFilters: Array<RowFilter>, sortKeys: Array<ColumnSortKey>, codeSyntax: CodeSyntaxName): Promise<TranslatedCode> {
    return super.performRpc('translate_to_code', ['column_filters', 'row_filters', 'sort_keys', 'code_syntax'], [columnFilters, rowFilters, sortKeys, codeSyntax]);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something feels slightly off about the current situation, where it feels like 'No available syntaxes' could be an actual syntax as far as the frontend knows. Whereas you'd really like to make undefined be the signal for that situation.

Ah, yeah that is not ideal 😬 I think the only time we really want a string to designate that there are no syntaxes is in the actual modal itself, not anywhere in this PR. Updated to reflect this!

* Translates the data explorer view to code in the desired syntax.
* @returns A Promise<string> that resolves when generated code is returned.
*/
async translateToCode(desiredSyntax: string): Promise<string> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Placing this here for lack of another concrete location: it feels like somewhere there should be an early exit for this entire feature if the backend state has an empty array of supported syntaxes. AFAICT that would be the expected way for the frontend to learn that the backend does not support "export to code".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a isSyntaxSupported() function, as well as a feature support flag (much like the other supported_features on the comm) to give some flexibility as we build out different syntax support! We can now early exit when we're converting code, or perhaps even earlier when trying to get a suggestion.

/**
* The guessed code syntax for the data explorer instance, if we generate code.
*/
public guessedSyntax: CodeSyntax = { code_syntax: 'No available syntaxes' };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I say more about this below, but having a special string indicate that this piece of information is unavailable doesn't feel right.

isabelizimm and others added 5 commits July 17, 2025 16:56
Co-authored-by: Jennifer (Jenny) Bryan <jenny.f.bryan@gmail.com>
Signed-off-by: Isabel Zimmerman <54685329+isabelizimm@users.noreply.github.com>
jennybc
jennybc previously approved these changes Jul 18, 2025
Copy link
Member

@jennybc jennybc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@isabelizimm and I just did another round of live review on zoom. I'm going to approve this now, with the expectation that some more tweaks are coming based on that conversation. It would be great to update the main PR description to reflect the current PR state.

This feels really close to an MVP we can merge and then start to refine. It's a good time to get a 3rd set of 👀 on this.

isabelizimm and others added 2 commits July 18, 2025 12:34
Signed-off-by: Isabel Zimmerman <54685329+isabelizimm@users.noreply.github.com>
Copy link
Contributor

@softwarenerd softwarenerd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jumping in a little late, and without a ton of "comms" definition experiences, it looks like a great start to me. I have some thoughts on how the UI of this could work, when that time comes.

@isabelizimm isabelizimm merged commit b60b0bd into main Jul 21, 2025
30 checks passed
@isabelizimm isabelizimm deleted the copy-as-code-comms branch July 21, 2025 20:39
@github-actions github-actions bot locked and limited conversation to collaborators Jul 21, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants