Hyperparameters tuning; ablation studies; experiments

Ordered approximately by importance. Don't forget to include a baseline. Something like 1-layer uni-directional without direction.

Important:
- [x] Base architecture: 
    - enc+dec; 
    - enc+dec+attn; 
    - enc+ctc; 
    - enc+ctc+dec; 
    - enc+ctc+dec+attn
- [ ] Frame encoding scheme: flattening vs. CNN (jz)
- [ ] Video segmentation (i.e. sent-based vs. line-based (non-sent))
- [ ] Various dimensions/the basic stuff (e.g. **grad norm** (which seems really important), batch size, lstm dim, char dim) (i.e. varying each of the keyword argument of encoder/decoder/training func) (jz)
- [ ] Regularization
- [ ] Stopping criterion 
    - josephz: Let's just do 50 epochs?

Useful:
- [x] Various attention functions (see #11) (yli)
- [ ] Teacher forcing ratio (and ways to change this ratio over time (see Bengio et al., 2015)) (yli)
- [ ] Gradient normalization
- [ ] Use all 68 points vs. a subset (yli)
- [ ] Have a global Adam vs. a new Adam every epoch. We are currently doing the latter because it seems to work better, but that actually doesn't make sense. This could be related to the learning rate, which is a separate item below.
- [ ] Temperature (in train, eval, and/or inference) (yli)

Try if time permitted:
- [ ] Global/local/input-feeding attentions (see Luong et al., 2015)
- [ ] Optimizers?
- [x] Learning rates & learning rate decay methods
- [ ] Char-based vs. word-based

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Hyperparameters tuning; ablation studies; experiments #18

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Hyperparameters tuning; ablation studies; experiments #18

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions