Replies: 5 comments 4 replies
-
A few ideas:
|
Beta Was this translation helpful? Give feedback.
-
In any case, a change in Iceberg Core is required. More specifically, the following method needs to be modified: This method is the "entry point" for table-scoped private FileIO tableFileIO(
SessionContext context, Map<String, String> config, AuthSession tableSession, List<Credential> storageCredentials) {
if (config.isEmpty() && ioBuilder == null && storageCredentials.isEmpty()) {
return io; // reuse client and io since config/credentials are the same
}
Map<String, String> fullConf = RESTUtil.merge(properties(), config);
fullConf = RESTUtil.merge(fullConf, tableSession.remoteFileIOProperties());
return newFileIO(context, fullConf, storageCredentials);
} |
Beta Was this translation helpful? Give feedback.
-
There might be a 3rd idea that wouldn't require changes in Iceberg Core: the table configuration is created by merging the table-specific properties onto the catalog properties: Map<String, String> fullConf = RESTUtil.merge(properties(), config); // properties() returns the catalog properties We can notice that session context properties, if present, are not added to the final table configuration. This may have been done on purpose. One could leverage this peculiarity and reserve human-based flows to session contexts exclusively. Example:
With the above configuration, one would need to create a catalog with a non-empty Session Context. This cannot be donne with configuration only (e.g. Spark SQL wouldn't work), but I assume it can be easily done in a Spark Shell session or using a Python script. In that case the Spark driver process holding the (to be tested) |
Beta Was this translation helpful? Give feedback.
-
Another way to enable human-based flows only on the driver node is to leverage environment variables or credentials files on disk, cf. #78. E.g. the driver node could have a credentials file specifying Authorization Code grant type, while worker nodes would either inherit the default grant type from the catalog properties, or also read environment vars or credentials files, using different a grant type and/or client ID+secret. |
Beta Was this translation helpful? Give feedback.
-
I've been thinking about this some and would like to run some of the ideas by you. This is engine-specific (meaning spark) which is less than ideal, but still something. This also presupposes that auth managers on executors somehow are already aware that they are running on executor nodes rather than on a driver. If the goal is to enable refresh on every node and an initial grant can only ever have one valid refresh token at a time, it makes sense to delegate access token generation to the driver node exclusively. In other words, refresh token lives only on the driver and whenever an executor needs it's access token refreshed, it should reach out to the driver for it. Obviously some sort of network communication is bound to break down depending on the deployment model, but fortunately spark does have a plugin interface that allows executor plugins to send rpc messages (PluginContext.ask) to their driver counterpart and receive responses (DriverPlugin.receive). Plugins are also able to read spark conf, so they could be made aware of all the spark config that is used to configure auth managers themselves. The last problem to solve would be setting up communication between spark plugin components and their respective auth managers. Persisting tokens to disk (#135) could come in handy here. ExecutorPlugin (knowing from conf that an auth manager needs a token refreshed every 15 minutes) would start a background thread that sends an rpc to the driver periodically and persists the resulting token to disk. Executor AuthManager then interprets it's own 15 min refresh as a signal to "refresh" it's token by reading it from disk once again. Similarly DriverPlugin would have no way of knowing how the refresh token it uses for token generation has been acquired, it would simply read the refresh token from disk that has been put there by Driver AuthManager. Curious what you think about it. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
The driver process is generally responsible for initializing the
RESTCatalog
instance, and thus interacts the most with the catalog server. Human-based flows work generally well in this setup, as long as the driver process is interacting with a human operator.Things get more complicated though when
FileIO
instances, created by executors, need to interact with the catalog server. This can happen e.g. when using S3 request signing. Each signer will have its own AuthManager, but this time, there won't be any human operator available, so the flow will timeout and fail.As of now, I think human-based flows are completely incompatible with S3 request signing and also object storage credentials refreshing. Basically any interaction between executors and the catalog server would fail.
We should investigate ways to improve this.
Beta Was this translation helpful? Give feedback.
All reactions