Significance testing for different portions of gold

Current significance testing only allow us to compare two outputs on the same gold. It would be good to have a bootstrap resampling version that, if I understand correctly, can tell us whether a system is substantially better on one dataset than another. (i.e. we could evaluate whether the sample of system scores on the politics domain seems to be drawn from the same population of a sample of system scores on the sports domain.)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Significance testing for different portions of gold #49

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Significance testing for different portions of gold #49

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions