-
Notifications
You must be signed in to change notification settings - Fork 12
Description
If you also assume Gumbel-distributed errors with equal scale parameters for the priors then I think it's as simple as adding the logs of the priors:
Or alternatively:
This only works for the Softmax function and is also why it's valid to take a subset of the categories like you are doing for the tokens due to the IIA property.
You can go even further and allow variable scale parameters for the priors, but it requires numerical integration and is probably too much hassle to be worthwhile.
Another alternative is convert into a multinomial probit model:
You can easily set up a system of equations to convert the logits (location parameters of the Gumbel distribution) to the location parameters (ie: means) of a Gaussian distribution with SD=1. There is only one solution to this and it's easy to find in a few steps of Newton's method.
This would then let you use Gaussian-distributed priors (which are likely much more intuitive to the average user), but again if the number of classes is more than 2; it will require numerical integration and probably too much hassle to implement.