What model(s) would you suggest for this problem #6839

david-waterworth · 2023-06-08T11:04:50Z

david-waterworth
Jun 8, 2023

I'm looking for ideas for a problem that is essentially schema guided / task-oriented dialogue. The task is essentially invoking an internal api.

I have a number of services that would be invoked using natural language such as "what is the energy consumption for 1 George St" (energy_consumption intent with site slot) or "what open tickets are assigned to Bob Jones" (ticket intent with contractor slot).

In my case, both the intents and slots are fully defined by the schema at inference time, but the potential values of the slots aren't (all) known at train time. Also the available slots are context dependent - i.e. different users manage different sites and contractors, and users will most often mistype/abbreviate the slot values.

I can use conventional joint intent and slot, with some sort of normalisation/text distance to map the slot spans to the most likely schema entries - but since the valid slot values are known at inference time it seems useful to make use of this somehow.

Does NeMo have a model that does something like this, i.e. I could embed all the valid slot values (list of sites, list of contractors etc filtered by context) but how do I make use of this at inference time? I've considered first using an intent classifier, then based on the schema use something like SentenceTransformer to compare the user text to the embedded slot values.

I'm interested in any ideas/recommendations/publications that are relevant.

Answered by Zhilin123

Jun 9, 2023

Hi @david-waterworth

Both BERT and GPT models predict intent and slots (in BERT, see this metric unified_slot_precision in the table).

The metrics for intent are called by the same names in both GPT and BERT while for slots, the metrics are reported slightly different because of how each model does the prediction.

In BERT, given this utterance below, it predicts something like

Where Intent 15 is "energy consumption" and 1 out of say 50 intents and slot24 is the slot name for "site" out of say 40 different slots (with slot0 being the name for empty slot)…

View full answer

Zhilin123 · 2023-06-08T23:54:58Z

Zhilin123
Jun 8, 2023
Collaborator

I would suggest you have a look at https://colab.research.google.com/github/NVIDIA/NeMo/blob/main/tutorials/nlp/Dialogue.ipynb particularly at Section 1.4 where this problem that you're describing is reduced to this: given an utterance, generate an intent as well as the corresponding slot names and slot values.

In this problem formulation, it's not required that slot names are known beforehand during inference. With enough examples and training steps, the GPT style model can learn to draw this correlation between intent and slot names (without needing a sentence transformer style model for approximate match or predefining a schema). In fact this dialogue module also implements the traditional bert-style joint intent and slot models as well as intent classification based on sentence transformer but these can be a lot more complex to set up.

The effectiveness of this model strongly depends on that the set of intents and slot names in your training set overlaps with your test set (since the model learns to memorize these). In terms of publications, this approach was inspired by https://proceedings.neurips.cc/paper/2020/hash/e946209592563be0f01c844ab2170f0c-Abstract.html

0 replies

david-waterworth · 2023-06-09T01:56:25Z

david-waterworth
Jun 9, 2023
Author

Thanks @Zhilin123

I have looked at that example, the BERT model appears similar to my test model that I built from scratch - although probably a lot more robust!

I wasn't 100% sure if the GPT2 version is better or not, the accuracy after 3 epochs looks lower but both models produce different reports so its hard to directly compare

BERT (3 epochs)

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃           Test metric            ┃           DataLoader 0           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│            intent_f1             │        91.35687255859375         │
│         intent_precision         │        91.35688018798828         │
│          intent_recall           │        91.35688018798828         │
│             slot_f1              │        94.23531341552734         │
│          slot_precision          │        94.23531341552734         │
│           slot_recall            │        94.23531341552734         │
│         unified_slot_f1          │        41.828437898922246        │
│ unified_slot_joint_goal_accuracy │        77.78810408921933         │
│      unified_slot_precision      │        41.38785625774473         │
│       unified_slot_recall        │        42.27850061957869         │
│             val_loss             │       0.28877702355384827        │
└──────────────────────────────────┴──────────────────────────────────┘

GPT (3 epochs)

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃        Test metric        ┃       DataLoader 0        ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│         intent_f1         │     88.66171264648438     │
│     intent_precision      │     88.66171264648438     │
│       intent_recall       │     88.66171264648438     │
│          slot_f1          │     80.49942622463102     │
│ slot_joint_goal_accuracy  │     74.81412639405205     │
│      slot_precision       │     80.59324659231723     │
│        slot_recall        │     80.40582403965303     │
│   test_intent_accuracy    │     88.66171003717473     │
│      test_loss_epoch      │    0.07188215106725693    │
└───────────────────────────┴───────────────────────────┘

Finally the SGD example only seems to predict intents, not slots? Is that true - and is it possible to configure the model to do both?

1 reply

Zhilin123 Jun 9, 2023
Collaborator

Hi @david-waterworth

Both BERT and GPT models predict intent and slots (in BERT, see this metric unified_slot_precision in the table).

The metrics for intent are called by the same names in both GPT and BERT while for slots, the metrics are reported slightly different because of how each model does the prediction.

In BERT, given this utterance below, it predicts something like

Where Intent 15 is "energy consumption" and 1 out of say 50 intents and slot24 is the slot name for "site" out of say 40 different slots (with slot0 being the name for empty slot). For intent and slot, it ends up being reduced to a multi-class classification problem. Small note, we only take the prediction for the first token in each word.

Because of this problem formulation, there is a slot_{precision, recall, f1} metric calculated at the token level, which explains why it's so high because it's biased by slot0 which is the empty.

On the other, GPT generates energy_consumption (site=1 George St) conditioning on what is the energy consumption for 1 George St

GPT therefore only has the slot_{precision, recall, f1} metrics at slot_name-slot_value pair level and slot_joint_goal_accuracy, which means that all slot_name-slot_value pairs matches the ground truth (which can be empty, a not uncommon case for assistant dataset). To make BERT results comparable to GPT, we introduced similar metrics for BERT, called unified_slot_{joint_goal_accuracy, f1, precision, recall}.

Regarding results, I think my experience shows that BERT generally has slightly better optimal performance than GPT, because the task is easier (choose among 50 intents, instead of generating in natural language the intent name outside tens of thousands of possible tokens at each step). What GPT has going for it is the flexibility afforded by the model since all of the output are linearized into a common format. This means that configuration wise, there's fewer things you need to do for GPT (for BERT, you need to specify a loss proportion for intent vs slots, you need to make some configuration files to map label id to label name etc) and also makes it more flexible for instance, if you are working on a dataset with hierachary intents (domain --> intents, say music.play_music, music.next_song) and you end up having say 1000+ intents. In the BERT paradigm, no relationship between the common domain 'music' can be learnt since each label is learnt independently whereas in GPT, the first word is common and hence can be used. More importantly, there are some case where you have more complex slot names and values (e.g. Can i have 2 burgers without cheese and a sundae?), it can be formatted into a hierarchical representation instead of just a linear key-value pair afforded by the BERT model.

Answer selected by david-waterworth

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What model(s) would you suggest for this problem #6839

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

What model(s) would you suggest for this problem #6839

Uh oh!

david-waterworth Jun 8, 2023

Replies: 2 comments · 1 reply

Uh oh!

Zhilin123 Jun 8, 2023 Collaborator

Uh oh!

david-waterworth Jun 9, 2023 Author

Uh oh!

Zhilin123 Jun 9, 2023 Collaborator

david-waterworth
Jun 8, 2023

Replies: 2 comments 1 reply

Zhilin123
Jun 8, 2023
Collaborator

david-waterworth
Jun 9, 2023
Author

Zhilin123 Jun 9, 2023
Collaborator