On the collapsing of duplicate types in unions #748

mattcalabrese · 2022-09-26T18:01:11Z

mattcalabrese
Sep 26, 2022

I'm just now starting to follow Val and am very pleased with the overall direction of the language. I have not used it yet in any capacity and am still at the level of just reading through the docs. One thing that jumps out at me that I disagree with and would like to start a discussion on is this line from the docs:

Note: While T | U | T is equivalent to T | U (element type repetitions at the same level are collapsed), (T | U) | T is a distinct type. Thus Optional<Optional> is not the same as Optional.

I do not know how open for discussion this design decision is. What I take issue with is the first part of that quote and would like to push against the de-duplication of types. Instead, I suggest that duplicate types should never implicitly collapse -- this suggestion primarily comes from experience as a long-standing user of the (surprisingly many) algebraic datatype libraries in C++. The basis of this view is that different alternatives that happen to have the same type can still have different meaning, semantically, and so implicitly collapsing out of duplicate types can make working with these kinds of situations difficult (concrete example later in this post). I suggest that, instead, no type lists here are collapsed to remove duplicates, and I also encourage the ability to give the alternatives names, similar to how Val allows tuple elements to have names. If a user wishes to collapse out duplicates, they certainly still can do so if they want to, which most-commonly is desired when the particular union is being used for something like type-erasure over a fixed-set of possible types. There can certainly be language facilities to make such de-duplication easy, however, the creation of sum-types that have duplicate alternative types should ideally be directly supported and I'd argue should be the default.

A quick practical example of where you can naturally end up with duplicate types in a sum type and where de-duplication causes problems is with something akin to the concatenation of ranges in C++ (a view of N ranges that each have the same value type as-if they were a single range, one after the other in order). Forgive me for the C++ example, it is just what I am most used to (a talk describing the exact case I refer to can be seen here. , though I'll describe the relevant points in this post). In this case, the iterator type of the "concatenated" range ends up containing a sum-type of the iterator types of the source ranges. For instance, if you concatenate a std::vector<int> and a std::list<int>, then the iterator type of the result contains a sum type of std::vector<int>::iterator and std::list<int>::iterator. This is because any given iterator may be referring to either an element of the vector or to an element of the list. Similarly, when you increment an iterator of the concatenated list, if you happen to be referring to
a vector element, you need to check against the end iterator of that vector after incrementing the underlying iterator, and if you reach the end element, you update the sum-type to now contain the begin element of the list.

In this particular situation, there happen to be no duplicate types and there are no problems. However, consider instead that you are concatenating a vector<int> and another vector<int>. There is nothing logically wrong with this situation and in some sense even appears simpler. Similar to the earlier example, what you naturally end up with inside of the iterator type of the concatenated view is a sum type of vector<int>::iterator and vector<int>::iterator. Even though these are the same type, their "position" (or index) into the sum type imbues additional semantics beyond the datatype that are necessary in order to implement the new iterator. They cannot be collapsed down to a sum type with a single alternative type without pushing additional complexities onto the user, since the differing discriminator value is precisely how you know whether the iterator refers to the element of the first underlying range or the element of the second underlying range.

If the exact problem that arises is not clear, consider again what an increment of the iterator into the concatenated view looks like. When that iterator is incremented, we first increment the underlying iterator, and if that iterator then happens to be referring to the end of the first range, we need to set the sum-type to now contain the begin of the second range. Again, we know whether the current underlying iterator refers to an element of the first range or the second range based on the discriminator of the sum-type (whether it is the 0th or 1st alternative). If the sum-type, instead, implicitly collapses away duplicates, this discrimination is lost to the user! The easiest solution user-side ends up being to add additional wrapping to each duplicate alternative type before forming the sum-type, and then "unwrapping" upon access. This kind of unfortunate workaround becomes necessary more generally any time you need to put types that are parameters into a sum-type if the facility being used to form the sum-type implicitly "collapses" duplicates.

Situations like this are not specific to concatenation of ranges and naturally come up when you build more complicated datastructures around algebraic datatypes. Supporting duplicates is a default that makes sense, and users who want de-duplication for a specific case can still do that de-duplication with some kind of type-function if they wish to.

kyouko-taiga · 2022-09-26T18:44:01Z

kyouko-taiga
Sep 26, 2022
Maintainer

I am a little confused by your example because I do not think that the type you use to represent the different iterators is actually a sum type. IIUC, you need one additional property that is only incidentally provided by std::variant: the ability to index a particular case. If a sum type is just a union of different types (which is what they are in Val), then the cases are unordered and implementing prov::index_of impossible.

I would rather use a collection of iterators in your situation, or at least define a data structure that conceptually operates as a collection.

I suggest that, instead, no type lists here are collapsed to remove duplicates, and I also encourage the ability to give the alternatives names, similar to how Val allows tuple elements to have names.

This is a very reasonable feature and it can definitely discussed.

As of now, however, I am not yet entirely convinced it has its place in Val. Given that "naming" individual elements of a sum type can be done by simply defining other types, I'm still on the look for use cases that would demonstrate how a fully-fledged feature would significantly improve usability and/or expressiveness and/or performance.

For an example of what you can do now, look at the program below:

// The two tuple types differ because of their labels.
// They do not collapse under union.
typealias Alignment = { left: Int } | { right: Int }

public fun main() {
  let x: Alignment = (left: 8)
  print(x)
}

0 replies

mattcalabrese · 2022-09-26T20:37:41Z

mattcalabrese
Sep 26, 2022
Author

IIUC, you need one additional property that is only incidentally provided by std::variant: the ability to index a particular case. If a sum type is just a union of different types (which is what they are in Val), then the cases are unordered and implementing prov::index_of impossible.

Yes, what I am saying is that this information is, indeed, important and is information that should not be discarded. Your description of the alternatives as being unordered is a fine way of looking at a subset of use-cases of discriminated unions, but not all use-cases of discriminated unions. A discriminated union that allows duplicates can be used to implement your existing facility trivially (the user or an alternative language facility can remove duplicates, for example, and trivially be implemented on top of the facility that otherwise allows duplicates). The workarounds to effectively get duplicates when working with a facility that disallows them add complexity (for the user and for the implementation), with the most common solution being to have the user wrap the duplicate types in a dummy struct with a single member, so that the "duplicate" types become different. The user then further unwraps the type upon access.

For some perspective, in C++ the leading sum-type facility (boost::variant) did not allow duplicates. Over the years that it was used, the limitations of this became clear. It is not incidental that the standard facility and many other implementations that followed allow duplicates.

I would rather use a collection of iterators in your situation, or at least define a data structure that conceptually operates as a collection.

By collection do you mean a struct/product type of the iterators? If so, that does not accurately represent the situation here. My use of the term "iterator" here is in reference to a C++ iterator, in other words, a locator of a single element of a range. If you concatenate 10 ranges, for example, your iterator type is logically a discriminated union of each of the iterator types of the input ranges. It only ever refers to an element of exactly one of the subranges at any given point. Using a collection of iterators does not match that situation and only leads to a "fat" iterator type where N-1 of the N objects are meaningless at any given point in the object's lifetime.

As of now, however, I am not yet entirely convinced it has its place in Val. Given that "naming" individual elements of a sum type can be done by simply defining other types, I'm still on the look for use cases that would demonstrate how a fully-fledged feature would significantly improve usability and/or expressiveness and/or performance. For an example of what you can do now, look at the program below:

Right, that was the workaround we did for years in C++ until accepting discriminated unions with duplicate types. At the moment you are requiring the user to make additional types solely to abide by the restriction of duplicates being disallowed. Those new types serve no other purpose and are effectively just acting as names or indices anyway. So in that sense, what exactly is being accomplished by the restriction at all? The user can still do what they need to do by artificially creating types, and the implementation is still going through the process of checking for duplicates to collapse them out, the user needs to unwrap the types, and the implementation now has more datatypes to deal with overall as well. It ends up being more work for the users and for the implementation. On the other hand, what is lost to the user/implementation when duplicates are simply allowed? In such a case, the user does not need to create intermediate wrapper types/tuples, and the implementation does less work (it does not need to trim duplicate types, and it is not dealing with these little dummy types). The only thing the user loses is the ability to index unambiguously by type, but if they want to do that, such a facility can be built on top of a discriminated union that allows duplicates (apply the restriction in a higher-level facility).

The situation for the user of a discriminated union that disallows duplicate types gets further complicated when a couple of other situations naturally happen.

when the alternative types are dependent on a type-parameter
when the amount of alternatives is dependent

Refer to the concatenation example for why both of these are problematic. In the case of (1), consider you are writing that generic "concatenate" operation. It should ideally work perfectly fine whether passed a vector<int> and a list<int>, or if it is passed a vector<int> and a vector<int>. The former case has no duplicates whereas the latter case has duplicates, but there is nothing logically wrong about either of these situations. However, if you are writing the generic function and you only have a facility that collapses duplicates out, you are now in a predicament. The author of the generic code either needs to do some metaprogramming and check for duplicates and write a manual special case in that situation, or, more-likely, just always internally wrap the incoming types in some unique types just so that the duplicate alternatives do not get collapsed.

In the case of (2), there isn't anything particularly more complicated except for the fact that names alone do not save you unless you have a way to programmatically create names. For instance, if your concatenate is variadic and you concatenate 10 ranges, the iterator type of the range contains a discriminated union of the iterator types of the subranges. To avoid the chance of getting collapsing of "duplicates" the writer of the generic code would need to per-emptively wrap each of those types into a unique type.

Having worked for a long time in the C++ world prior to allowing duplicates there is a more subtle issue that comes up when writing generic code. If you allow duplicates, everything tends to work by default. If you disallow duplicates, the problems can be latent until a user just-so-happens to be dealing with types that would end up with the implementation creating a discriminated union with duplicates. In other words, even when workarounds are known, stuff "half works" when the workarounds are not used.

I suspect we are talking past each other because you are starting with the view that the discriminated union "should" not have duplicates. This property is not a given to argue away from, but rather it is a restriction that needs to have a practical reason for being. The uses for discriminated unions with duplicates do, indeed exist and naturally arise. It is possible to make workarounds for such a restriction, but the simpler solution is to not have the restriction to begin with. A discriminated union without duplicates can be easily and efficiently built from one that does allow duplicates.

7 replies

mattcalabrese Sep 26, 2022
Author

I still do not believe that your use case is a compelling demonstration of the need to index the elements of a sum type. To me, it seems you are using a tuple that cleverly doesn't need to allocate storage for any element but one at the same time. That is a neat representation, but it does not point to an intrinsic requirement of a sum type feature.

I do not understand what you mean by this. The use-case is exactly what a discriminated union is for -- it contains one of a set of possible cases. There is nothing "clever" here. I recognize that you and I both know of other uses where we do not have duplicates. A facility that allows duplicates can be used to implement those other use-cases perfectly fine, and if it proves desirable to have something enforce the additional restrictions, it can be easily and efficiently built on top of the more fundamental facility.

You can represent a notional value any way you want. This choice is usually driven by performance, ergonomics, and safety. Of course, your implementation is a reasonable option in C++, given the characteristics of the tools at your disposal, but it does not make mine any less valid in Val.

I am not claiming that what you want is at all illogical. I have similar use-cases too. I am stating that what you provide is only applicable to a subset of uses of discriminated unions that are commonly encountered and I do not feel that the language gains by imposing the restriction to that subset of uses. If the language does not provide the more fundamental facility, users will only resort to workarounds and add complexity to their own code, and work against those restrictions that are artificially in place.

That is why I am insisting on use cases.

I am a little bit perplexed in that you are claiming that you want use-cases but disregarded the use-case that was presented. What is "clever" about the example? It is just an instance of where you logically contain one of a series of possible alternatives. That one or more alternatives might be of the same type is not particularly strange nor does it require being clever to account for -- in fact, the restriction is what requires more work for the implementation, and in the cases where the uses-cases are encountered, it requires clever workarounds for the user.

From my perspective Val is currently presenting an argument from "ought". Workarounds for the restriction are mentioned, but what is not expressed is why the restriction that necessitates those workarounds is there to begin with. I can give more use-cases but if the pre-conceived assumption is that there must not be duplicates, then I worry no progress can be made in the discussion.

I should also mention that Val does not support variadic parameters yet. AFAICT, The ability to programmatically "create names" is only relevant in this context. Once we do have variadic generic parameters, we can think of a way to programmatically create distinct types that can participate in a union. I'll be sure to remember your use case!

At least in C++, the way this was accomplished was by making a dummy-template with a type T as a parameter and an integral index. When you needed a discriminated union that might contain logical duplicates, you used the template to make N different unique types by just giving each successive alternative a unique index argument. Regardless of language, though, these dummy types serve no real purpose other than to workaround an artificial "no duplicate type" restriction -- the user just wraps the types to get by the restriction and then immediately unwraps them on the other end upon access. In other words, these are effectively just acting as names or indices. The more direct solution was and is to just allow duplicates.

dabrahams Sep 27, 2022
Maintainer

I agree with Matt, FWIW

kyouko-taiga Sep 27, 2022
Maintainer

I do not understand what you mean by this. The use-case is exactly what a discriminated union is for -- it contains one of a set of possible cases.

I understand we may start from very different backgrounds so I will try to better explain my point of view.

My issue with your example is that it makes a case for discriminated unions as implemented in C++. You need the ability to have repeated elements in the union (which you'll concede sounds at least a little strange) and the ability to index a particular element. So, conceptually, you are using std::variant as a random-access list and you do not need it to be a union. In fact, it must not be a union.

It is true that the notional value of the type that you want to construct is indeed single iterator. But the notional value of a type doesn't impose any specific implementation. For example, both of these definitions conceptually represent the same notional value.

class P1 {
private:
  int x, y;
public:
  int& get_x() { return x; }
  int& get_y() { return y; }
};

class P2 {
private:
  int cs[2];
public:
  int& get_x() { return cs[0]; }
  int& get_y() { return cs[1]; }
};

Does P2 demonstrates that implementing a pair requires fixed-sized buffers? I don't think it does because P1 is an equally reasonable implementation. Now let's imagine a language X that doesn't feature fixed-sized buffers. I claim that this use case alone would not demonstrate why fixed-sized buffers should be added.

Admittedly, the difference between my example and yours is that dumb researchers like me often disregard the constraints of real computer hardware for the sake of purity. AFAICT, there is no real difference between P1 and P2 in C++, but there is a very real difference between a random-access list and a discriminated union: the former requires storage for all its elements while the latter only requires as many bits as the largest element, plus a tag.

The "clever" part is to recognize that to implement the notional value of your iterator, you only ever need access to a single element in the list. The hypothetical values of the other elements do not matter and could probably be synthesized anyway. So if you concatenate 1,000,000 collections, you'll save a lot of storage. That is clever.

I am stating that what you provide is only applicable to a subset of uses of discriminated unions that are commonly encountered and I do not feel that the language gains by imposing the restriction to that subset of uses.

Hopefully I was able to show that, with my view of the world, these other use cases (extrapolating from the one you presented) do not actually use a discriminated union. They use a random-access list that only provides access to a single element at a time.

I am a little bit perplexed in that you are claiming that you want use-cases but disregarded the use-case that was presented.

I am sorry if I seem dismissive. It is not my intent. I am honestly not convinced by the use case you have presented and hope I have offered a sound rationale for this lack of conviction.

From my perspective Val is currently presenting an argument from "ought".

That is a fair remark. Let me try to present arguments in favor of the restrictions.

I believe there's value in adopting a common vocabulary to identify abstract concepts. This vocabulary gives rise to a common wisdom that allows everyone to share assumptions. This common wisdom is especially important in education, as it lets students relate concepts from one domain to another more easily. One can look up the definition of a sum type on Wikipedia and project the characteristics they have learned onto their use in an actual programming language.

Going against the expectations of the common wisdom adds complexity. std::variant is harder (i.e., requires more concepts) to explain than OCaml's sum types because the latter have a 1-to-1 correspondence with the common wisdom while the former don't.

To give another example, if I talk about a list, most people can intuitively picture this data structure and its properties. If I suddenly claim that this list should provide O(1) access to its elements, people will justifiably raise an eyebrow, because a list is not expected to satisfy this property. It does in Python, though. But even so I can't claim that Java's list should implement O(1) access because of the use cases that can be expressed efficiently with Python's implementation.

I also believe in economy of features. If language X provides a way to implement a specific pattern, then I would generally be reluctant to offer a different feature for the purpose of implementing that specific pattern. Sure, the added feature might be more concise and that might be a sufficient argument. Otherwise, we should all write assembly code. But the question is to gauge if that conciseness is worth its complexity.

Lastly, I believe the ability to index the element of a union type makes the language more complex in general.

First, T | U is no longer equal to U | T. That goes against the common wisdom and requires all users to care about the order in which they declare union types, not only those who need the extra capabilities. It also raises questions about the subtyping relationship. Is T | U <: T | T | U? What about T | U <: T | U | T | U? Now we need to either forbid those otherwise intuitive relationships or come up with rules to deal with indices.

Second, note that there's nothing compelling the user to declare every union type as a distinct nominal type. It might be more concise to just use structural types in signatures. As a result, making union types order-sensitive hinder the composability of the language:

// In module A
fun foo() -> Int | String { ... }

// In module B
fun bar(_ x: String | Int) { ... }

// In main module
import A
import B
public fun main() {
  bar(foo()) // does not compile
}

Order-sensitive union types will require users to write adapter functions that translate T | U to U | T, adding noise and complexity.

Third, it is not even clear to me how the index is supposed to be determined.

public fun main() {
  var x: String | String | Int
  x = "hello" // what is the index?
  x = "world" // what about now?
}

I guess C++ has an algorithm that we could just reuse, but the mere fact that it does not seem obvious is a sign that something's wrong.

I can give more use-cases but if the pre-conceived assumption is that there must not be duplicates, then I worry no progress can be made in the discussion.

Yes, that is my pre-conceived assumptions, for the reasons I have presented. I am not saying that I cannot be convinced otherwise, but only that I need more compelling use cases. As of now, I honestly think that the complexity brought by union-types-with-repeated-elements-accessible-by-an-index is not justified.

Regardless of language, though, these dummy types serve no real purpose other than to workaround an artificial "no duplicate type" restriction

Unions have specific characteristics that describe them. These are not artificial restrictions. They let you infer other properties, like the fact that T | U <: T | T | U. These properties are intuitive under the common definition of what unions are.

The more direct solution was and is to just allow duplicates.

To me, this strategy is a hack of discriminated unions, abusing the way they have been implemented.

At the end of the day, I do recognize that most of my reasoning might look like a big purity argument. I am sorry if I sound stubborn. But I also think purity arguments are sometimes valid. I care deeply about keeping Val's type system as simple as possible and think preserving the characteristics commonly attributed to union types is more valuable than making the use case I've been presented easier to implement.

mattcalabrese Sep 29, 2022
Author

The "clever" part is to recognize that to implement the notional value of your iterator, you only ever need access to a single element in the list. The hypothetical values of the other elements do not matter and could probably be synthesized anyway. So if you concatenate 1,000,000 collections, you'll save a lot of storage. That is clever.

I don't follow this line of reasoning. This "cleverness" is precisely why someone would choose a sum type over a product type to begin with. For any case at all where someone would want a union, whether duplicates are involved or not, they could choose to use a struct instead (and a manual discriminator). I wouldn't call the decision to use a union clever in any of those cases, it's just using the appropriate tool for the job.

Hopefully I was able to show that, with my view of the world, these other use cases (extrapolating from the one you presented) do not actually use a discriminated union. They use a random-access list that only provides access to a single element at a time.

Sorry, but I do not see that. These two properties are orthogonal. Being able to refer to an alternative by name/index does not impact the complexity of its operations.

To give another example, if I talk about a list, most people can intuitively picture this data structure and its properties. If I suddenly claim that this list should provide O(1) access to its elements, people will justifiably raise an eyebrow, because a list is not expected to satisfy this property. It does in Python, though. But even so I can't claim that Java's list should implement O(1) access because of the use cases that can be expressed efficiently with Python's implementation.

A couple of things.

Val can call the facility whatever it decides is best and causes the least confusion. What I personally care about here are the properties of the facility and its usefulness in developing software. I personally disagree with your rationale for why you feel a discriminated union that allows different alternatives that share a type should not be called a "sum type", but that's fine. Forget the name for a moment since we can't seem to agree on it anyway -- what I am suggesting is that the more fundamental facility here is one that allows duplicates, and that the uses you want are a subset of those. One or both facilities can exist, but if exactly one does exist, it should probably be the more fundamental one, otherwise it is not going to be able to address reasonable and common use-cases without users needing workarounds.
While I agree that languages using terms like "list" inconsistently between languages is unfortunate, consider the development of a language such as Python. If an argument were made to Guido early on that a "list" shouldn't have constant time random access because of its name, I doubt his response would have been to remove random access because it was inconsistent with the term "list". Instead, he might have considered changing the name (in reality I assume he did consider the name, though I do not know the history of Python)? The point is that the properties of the datastructure are the important part. While I have issues with Python and with Java and with C++ and with other languages, the fact that they frequently use the same name to mean subtly different things (and that they also use different names to refer to things that are equivalent) is not something that keeps me up at night or prevents me from using a language. What the facilities can easily do and represent are what truly matter.

First, T | U is no longer equal to U | T. That goes against the common wisdom

I recognize this is your view but I would prefer neither of us appeal to common wisdom here, both because it is subjective and also because what seems intuitive (even to many) often ends up not being entirely accurate in practice. I'd prefer starting from what tasks need to be accomplished and only then move from there to decide names/notations and how to best explain those facilities to users. This is why I presented a common kind of use-case for discriminated unions in my first post.

and requires all users to care about the order in which they declare union types, not only those who need the extra capabilities.

When presented like that, it deceptively appears as though this is some kind of a problem that is unique to unions, but it is not. This is partly because of the chosen notation, and partly because of an appeal to a specific subset of uses of discriminated unions. One could make a similar argument regarding ordering not mattering for a tuple that has named elements. For example, I could make the argument that (foo: Double, bar: Int) and (bar: Int, foo: Double) "should" be notionally the same type, since the users "shouldn't" care about order. There are practical considerations for why this is not the case, even if data layout were to be a non-issue.

It also raises questions about the subtyping relationship. Is T | U <: T | T | U? What about T | U <: T | U | T | U? Now we need to either forbid those otherwise intuitive relationships or come up with rules to deal with indices.

If that form of subtyping is considered important to have in the language, I would ask to separate out the functionality and still have a way to make the more fundamental, simple discriminated union. The answer to the question you pose is that I have personally not found a relationship such as that to be desirable in practice, and I work a lot with discriminated unions. In practice, modeling discriminated unions as a form of composition just tends to have fewer issues (that is, explicit containment and explicit access as different kinds from the constituents, rather than there being a subtyping relationship). While that is my view, I do not have much of a desire to try to persuade you on it, but I do suggest that there at least be a lower-level facility available that is purely just a discriminated union.

Second, note that there's nothing compelling the user to declare every union type as a distinct nominal type. It might be more concise to just use structural types in signatures. As a result, making union types order-sensitive hinder the composability of the language

A similar kind of argument had been made in C++ with respect to constraint declarations. For instance, that if you have two templates sharing the same name and parameters but whose constraints (including disjunction constraints) should be considered unordered, such that declarations for which the orders are different should still be considered redeclarations. Intuitively some people "wanted" this. There are a variety of reasons why that situation is even more complicated with templates, but my personal stance is similar either way and for similar reasons -- if people truly want/need equivalence there then they either declare the type, or they agree on the order.

Order-sensitive union types will require users to write adapter functions that translate T | U to U | T, adding noise and complexity.

While I work a lot with discriminated unions, I have not actually seen the need for this. That is because in practice a union is very commonly contained in some other type (often even if it is the sole member, as the high-level type tends to represent the concrete semantics moreso than simply "this or that"). In the rare cases where that is not true, the union is at the very least not re-specified at the usage points, but instead is declared/aliased in a single place because otherwise usage is verbose/redundant/error-prone/difficult to update. Beyond that, the only time I tend to see the raw types used in places such as declarations are on generic operations that act on "sum types" as a kind rather than on a specific sum type, such as when implementing a "visit" operation, or a type-transformative operation that depends on the alternative types of the sum type.

Third, it is not even clear to me how the index is supposed to be determined.

It is determined the same way it is with raw unions (in a language like C) or with structs or with tuples. When writing to change the active alternative, you specify either the name or the index. If you consider treatment as a subtype to be essential, then you can still have those implicit (untagged) assignments "work" in cases where there happen to be no ambiguities. In other words not specifying the name/index is only ambiguous iff there is a duplicate. Users can always use the name or index either for clarity or for disambiguation. This is what C++ implementations do, for example.

Unions have specific characteristics that describe them. These are not artificial restrictions. They let you infer other properties, like the fact that T | U <: T | T | U. These properties are intuitive under the common definition of what unions are.

The more direct solution was and is to just allow duplicates.

To me, this strategy is a hack of discriminated unions, abusing the way they have been implemented.

At the end of the day, I do recognize that most of my reasoning might look like a big purity argument.

I don't see it as arguing out of purity as I think we just disagree fundamentally on what the salient properties of a union are as opposed to specific uses. I view Val as arguing for a facility that behaves in a certain way by edict rather than by the uses that come up in software. The facility you are talking about, with whatever name, is not a solution to the problems that I personally have faced over the years and that I currently use algebraic datatypes for without issue. If the simpler facility does not exist in Val (the simpler being the one without the restriction of duplicates), then I and other potential users will be required to come up with workarounds that otherwise should be unnecessary. Val's claimed simplicity here seems strange to me as what Val does is both more complex in terms of implementation and also makes the real-life situations faced by users more difficult to implement. I wouldn't use simple to describe it. It is more complicated and geared towards a subset of uses of discriminated unions.

I recognize that you feel the uses of discriminated unions I present are somehow clever, but they seem to be fairly trivial and fundamental uses to me. The "not clever" solution you presented is both less efficient and suggests using a product type to implement something that contains exactly one of only N things at a given point in time. I don't think the argument you are making is one of purity of sum types, but rather I think we are disagreeing on what is the important, salient point of a union -- that it contains one of a set of possible cases. Sometimes there are multiple cases that happen to have the same data representation but are semantically different, and that's a normal kind of case to encounter.

kyouko-taiga Sep 30, 2022
Maintainer

I can see progress here!

As @dabrahams mentioned, Val will likely need some way to define enums. While it is possible to implement them with union types, the current solution is far too verbose for my liking.

Swift's discriminated unions are called "enums". They are labeled and may have associated values. Here's an example:

enum Alignment {
  case left(padding: Int)
  case right(padding: Int)
  case center
}

var x: Alignment = .left(padding: 42)

This enum has three cases, labeled left, right, and center. The two first cases have an associated value (with the same type!) and the third doesn't. AFAIU, it would be equivalent to the following variant in C++:

using Alignment = std::variant<int, int, std::monostate>;

int main() {
  Alignment x;
  std::get<0>(x) = 42;
}

Swift's enums have other interesting features. I invite you to look at the documentation if you want the full details.

It is likely that a form of Swift-inspired enums will eventually land in Val. I didn't think they should have associated values, to avoid the overlap with union types, but your use cases may have proven otherwise.

Enums in that form would not solve your variadic problem, because we would still need a way to generate arbitrary number of cases. Nonetheless, I would be interested to see if there's a way to combine this design with a variadic feature.

dabrahams · 2022-09-27T16:24:31Z

dabrahams
Sep 27, 2022
Maintainer

Lastly, I believe the ability to index the element of a union type makes the language more complex in general.

First, T | U is no longer equal to U | T. That goes against the common wisdom and requires all users to care about the order in which they declare union types, not only those who need the extra capabilities. It also raises questions about the subtyping relationship. Is T | U <: T | T | U? What about T | U <: T | U | T | U? Now we need to either forbid those otherwise intuitive relationships or come up with rules to deal with indices.

This part, I find quite compelling. I think we may be discovering that the union types with subtype relationships, which are non-nominal and collapse duplicates, are simply a different beast from tagged union types, which are usually nominal, allow duplicate payload types but don't support subtype relationships. I don't believe the latter are simply a notional tuple of optionals; there's an additional invariant that exactly one of the tuple elements is non-nil, and that leads to fundamental efficiency differences. That's no less fundamental than the difference between a linked list and an array. It may simply be a question of how often the need for each arises, and how painful it is to build one in terms of the other.

I note that since we probably need enums for C++ interop anyway, it might not be stupid to think about extending them to support payloads, if the answers to the above questions don't look favorable.

0 replies

dabrahams · 2022-09-30T01:41:39Z

dabrahams
Sep 30, 2022
Maintainer

I’m pretty sure neither of these is more fundamental, at least if you have generics. Either one can be implemented in terms of the other

…

Sent from my iPhone

On Sep 29, 2022, at 12:58 PM, Matt Calabrese ***@***.***> wrote: I would ask to separate out the functionality and still have a way to make the more fundamental, simple discriminated union.

1 reply

kyouko-taiga Sep 30, 2022
Maintainer

Well, experience has showed me that Swift-like enums lack subtyping, which I believe to be a fundamental property. So if I had to put my money on one construct, I would choose union types. But that is news to probably no one at this point ;)

dabrahams · 2022-09-30T02:17:23Z

dabrahams
Sep 30, 2022
Maintainer

On Sep 29, 2022, at 6:44 PM, Dimitri Racordon ***@***.***> wrote: Well, experience has showed me that Swift-like enums lack sub typing, which I believe to be a fundamental property.

Yes, I thought of that right after I posted. And IIRC we worked through all this and that’s why we ended up with sum types instead of tagged variants, as long as we were going to have just one feature. But the universe is now telling us that premise may have been wrong.

7 replies

kyouko-taiga Sep 30, 2022
Maintainer

I'm not so sure.

Let me amend my statement: ordering severely hinders subtyping.

// Or even this:
MyEnum2 = {0: X } | {1: X}

We can easily support this, but the problem is that 0 and 1 are not really indices, they are just labels. So implementing @mattcalabrese's prov::index_of (see their talk) would be difficult. Further, these definitions are also legal:

// mixing numeric and textual labels
typealias E1 = { 0 : X } | { a : X }
// repeating the same label, but with a different associated values
typealias E2 = { 0 : X } | { 0 : Y }
// using arbitrary numbers
typealias E2 = { 42 : X } | { 1337 : X }

But I like your suggestion. It's a promising lead!

IMO your syntax doesn't work because if N is a value argument, then it's representing a fixed value, not an arbitrary set of integers. Besides, I am not sure we want to allow any type to be used as a label, but let's say we do for the moment.

Here are the problems I think we must solve:

We need a syntax conveying the fact that N is a set of labels.
We need a way to express a generic "indexed union type" (I like this name because it is a thing in set theory 👩‍🏫)
Given an instance of an indexed union type, we need a function, probably a primitive, to retrieve its index.

Based on your suggestion, here's how far I got:

trait P {}
type X: P {}

fun foo<T, @labelset N: T>(_ u: { N: any P }) -> T {
  u.active_index.copy()
}

public fun main() {
  typealias S = { 0 : X } | { 1 : X }
  var s: S = { 0: X() }
  print(foo(s))

  typealias T = { 0 : X } | { a : X }
  var t: T = { 0: X() }
  print(foo(t)) // type error
}

It is very unfortunate that we have to use an existential because it incurs a tax for type erasure. Maybe we can use an opaque type here (i.e., some P). That would force monomorphization within an ABI boundary and I think we can turn any opaque type into an existential if we must cross one.

If we don't want to allow arbitrary types to serve as indices, then simply a label set must be a subset of Int. My instinct is telling me that it's probably a good restriction, but a use case could easily convince me otherwise.

dabrahams Sep 30, 2022
Maintainer

ordering severely hinders subtyping.

I don't know why you say that. My impression is that they're just not features you'd want to use together, but if you combine them, the language calculus is still coherent, if not very useful.

the problem is that 0 and 1 are not really indices, they are just labels. So implementing @mattcalabrese's prov::index_of (see their talk) would be difficult

I think that really depends on our compile-time computation story. If they're allowed as labels, they're compile-time constants, and prov::index_of is returning an integral_constant, which is just that. The point here is that while we don't have the language facilities today to make this work, we very well could, especially if we keep Matt's use case in mind.

these definitions are also legal…

They don't have to be.
I don't see any problem with them being legal.

IMO your syntax doesn't work because if N is a value argument, then it's representing a fixed value, not an arbitrary set of integers.

? It was only ever intended to represent a fixed value.

I didn't try to analyze your example, but I consider anything that would require an existential here to be a non-solution to Matt's problems.

kyouko-taiga Sep 30, 2022
Maintainer

I don't know why you say that. My impression is that they're just not features you'd want to use together, but if you combine them, the language calculus is still coherent, if not very useful.

The question is about whether or not the ordering is a salient feature of the type itself, should that type (i.e., instance of a kind kind, meta-type, etc.) being treated as a value. That is not the case with your approach because assert(({ 0: X } | { 1: X }) == ({ 1: X } | { 0: X })). So all is well.

The point here is that while we don't have the language facilities today to make this work, we very well could, especially if we keep Matt's use case in mind.

I agree. I'm just pointing out that there's (at least currently) a difference between labels and values, because labels do not have a first-class status (at least currently).

They don't have to be.

I think they do, or we're needlessly adding restrictions on the kind of tuples that can participate in a union.

I don't see any problem with them being legal.

They could be legal, but we need to define what the index set of an indexed union type will be in these cases. What type contains a and 0 as values? Maybe we can define a label type, but then what are the properties of such a type?

I'm not saying these questions don't have answer but I'm raising them so that we get an idea of where we are going.

It was only ever intended to represent a fixed value.

You wanted N to be the upper bound of a natural range starting from 0?
If so, I prefer a less symbolic representation of index sets.
If not, I didn't understand your example.

I didn't try to analyze your example, but I consider anything that would require an existential here to be a non-solution to Matt's problems.

Please read the note under the code ;)

dabrahams Oct 1, 2022
Maintainer

I'm not saying these questions don't have answer but I'm raising them so that we get an idea of where we are going.

I'm suggesting maybe a tuple with integral constant labels is a different beast than a tuple with identifier labels. But if it comes down to it, we could also give _0, _1... special status and allow them to be mapped to/from integral constants. There are lots of ways to sort out the details.

You wanted N to be the upper bound of a natural range starting from 0?

No, I wanted N to be an Int value with the same computability as a type parameter.

Please read the note under the code ;)

I read it. I feel the same way about opaque types. Pretty sure Matt wants something whose structure can be exposed across an ABI boundary for efficiency.

mattcalabrese Oct 26, 2022
Author

Thanks again for considering all of this. Something else to consider, as it relates to discriminator access:

What is the current plan for serialization of sum/union types? Is it something that would be directly supported at all (such as through the language or through some core library facility), and if not, is there a reasonable way for a user to write serialize and deserialization functions for such types?

The way this is generally done in C++ with sum/union types is through:

A way to retrieve and serialize some kind of discriminator (usually just an index, but hypothetically can be unique type info if the types are unique, though this is also likely to be larger)
A way to deserialize and construct a union/sum type from the discriminator coupled with the serialized representation of the actual value that was stored (this is even tricky to write in C++ without libraries to "visit" a runtime value to produce a std::integral_constant in some range of values)

The above is not how Val would strictly need to do serialization of such types, this is just how it is done in C++ in practice. Something to note though is that even if alternatives are all unique types, being able to serialize/deserialize the discriminator as a single, small integer type is useful.

Separately, I do think that indices are separate notions from names and are used for different purposes. They can/should coexist as different ways to access the same data of a single algebraic type. Similarly, identification of a union/sum type's alternatives by an index and identification by a name are both useful in different situations, assuming those support indices or names at all. Access in the abstract of algebraic types through just type lists and/or indices is often done in the implementation guts of libraries that produce algebraic datatypes, such as parser combinator libraries, but the user-side of those same types often want some kind of name association that they control.

The Hylo Group

On the collapsing of duplicate types in unions #748

Uh oh!

Uh oh!

mattcalabrese Sep 26, 2022

Replies: 0 comments · 20 replies

Uh oh!

kyouko-taiga Sep 26, 2022 Maintainer

Uh oh!

Uh oh!

mattcalabrese Sep 26, 2022 Author

Uh oh!

Uh oh!

mattcalabrese Sep 26, 2022 Author

Uh oh!

dabrahams Sep 27, 2022 Maintainer

Uh oh!

Uh oh!

kyouko-taiga Sep 27, 2022 Maintainer

Uh oh!

mattcalabrese Sep 29, 2022 Author

Uh oh!

Uh oh!

kyouko-taiga Sep 30, 2022 Maintainer

Uh oh!

Uh oh!

dabrahams Sep 27, 2022 Maintainer

Uh oh!

dabrahams Sep 30, 2022 Maintainer

Uh oh!

Uh oh!

kyouko-taiga Sep 30, 2022 Maintainer

Uh oh!

dabrahams Sep 30, 2022 Maintainer

Uh oh!

kyouko-taiga Sep 30, 2022 Maintainer

Uh oh!

dabrahams Sep 30, 2022 Maintainer

Uh oh!

Uh oh!

kyouko-taiga Sep 30, 2022 Maintainer

Uh oh!

dabrahams Oct 1, 2022 Maintainer

Uh oh!

mattcalabrese Oct 26, 2022 Author

mattcalabrese
Sep 26, 2022

Replies: 0 comments 20 replies

kyouko-taiga
Sep 26, 2022
Maintainer

mattcalabrese
Sep 26, 2022
Author

mattcalabrese Sep 26, 2022
Author

dabrahams Sep 27, 2022
Maintainer

kyouko-taiga Sep 27, 2022
Maintainer

mattcalabrese Sep 29, 2022
Author

kyouko-taiga Sep 30, 2022
Maintainer

dabrahams
Sep 27, 2022
Maintainer

dabrahams
Sep 30, 2022
Maintainer

kyouko-taiga Sep 30, 2022
Maintainer

dabrahams
Sep 30, 2022
Maintainer

kyouko-taiga Sep 30, 2022
Maintainer

dabrahams Sep 30, 2022
Maintainer

kyouko-taiga Sep 30, 2022
Maintainer

dabrahams Oct 1, 2022
Maintainer

mattcalabrese Oct 26, 2022
Author