Skip to content

CapitolHillChartMap

DavidMoeljadi edited this page Jan 16, 2017 · 8 revisions

Discussion: ChartMap

Lead by: Francis Bond

[scribed by David]

FCB: ...the very complete goal, trying make the reduplication that we have semi-working with the chart mapping where we expect to have a bit of guidance from Dan and Glenn. So let me just recap the current state. So let's just parse something with reduplication in Indonesian. Before we jump into reduplication, a few words from our local host.

Glenn: my impression is that you can understand what the chart mapping machinery does but there is not a whole lot we can do as group to help you do it. It is a very cool piece of machinery. It is not as formally precise, not as well understood, I think, and is not reversible, and it is actually quite early in the pipeline, it is immediately after the repp, well before the LKB morphology. I guess I just want to get a lot more clarification about what you are going to achieve here.

FCB: What we want to do is solve the concrete problem. I want to get feedback. Dan has done most in his grammar with the chart mapping. I am hoping Dan can tell us it is possible or not. I think we have at least three grammars in this room wo would like to have reduplication working.

Dan: I could do a two-minute talk. The intend of that pipeline is to normalize ordinary running text so that the parser deals with the more simplified problem. For example, you can use various unicode characters to mark an apostrophe in English writing. It is annoying to have all of those apostrophes inside the lexicon. It is convenient to have a piece of those normalizing which says whenever you see any of those apostrophes, turn this into a canonical one, use it throughout your lexicon. It is also a convenient tool for dealing with unknown words because you can use POS taggers to give you a guess about the tag and then you use the chart mapping to convert that into particular ... The large intent of this preprocessor is to turn real text into something a little bit simpler, more regular, more predictable, so that your manually constructed lexical entry rules do not have to work so hard. There are games we play with the chart mapping, for English involving words with spaces, sort of putting in a space or not, pulling out apostrophe ('s)s or possessives but not for any other thing. We also used the token mapping for expressions that involved numbers or other open sources of tokens where you just can't list them all, you can pretty well list all of the dates in English but it is annoying to do it: 31 days of the month and the variations: European encoding of the dates etc., and for currency, units of measure, there's a lot of real things that happen that are annoying to deal with inside a grammar. This chart mapping squeezes into a more predictable or regular. Now for the reduplication case, I think you want to make the machine to do some more kind of tokenization, where there is no help from the conventional regex, so there is a bigger thing here, you want to see what the root is, you need an engine which can run over that sequence of characters and find the pattern there or sets of patterns and abstract away from the sequence of characters. It does not seem like an impossible request for this engine because that chart mapping is supposed to be able to manipulate and discover the regularity hidden inside the sequences of the surface characters.

Glenn: You worked on surface strings not tree structures, so this is before the syntax starts or deals with morphology. the grammar to be the first priority, not make it polluted or impure with all of this.

Clone this wiki locally