Skip to content

feat: implement Primitive type Literal #117

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

mapleFU
Copy link
Member

@mapleFU mapleFU commented Jun 4, 2025

Before implement expression, Primitive should be implemented. The spec follows: https://iceberg.apache.org/spec/#schemas-and-data-types

In this patch, only few types of Literals is supported, including int, long, float, double, boolean, string and binary.

TODO:

  • Implement detail logics about comparing
  • Implement detail logics about castTo
  • ToString and printing logics
  • Tests

Would in other patch:

  • More primitive type basic supports ( maybe without decimal now )
  • Implement detail logics about serialize/deserialize

@mapleFU mapleFU force-pushed the primitive-literal-impl branch from 8285c07 to 082c7e1 Compare June 4, 2025 15:47
int64_t, // for long, timestamp, timestamp_tz, time
float, // for float
double, // for double
std::string, // for string
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C++ string not requires utf8, so I wonder whether std::vector<uint8_t> with string is possible

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small String Optimization (SSO) will make PrimitiveLiteralValue 40 bytes, may be use a unique_ptr to own the string?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a nit, Literal might be used as vector, so more bytes is not critical?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the data part of vector will not be in Literal's layout?
If we are not going to use something like std::vector, I think it's fine with the current design.

Copy link
Member Author

@mapleFU mapleFU Jun 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Emmm, a tiny wrapper like this should be defined if unique_ptr should be added. I'm also ok for the case

  struct StringLiteralValue {
    StringLiteralValue(std::string value);
    StringLiteralValue(const StringLiteralValue&);
    StringLiteralValue(StringLiteralValue&&) noexcept;
    const std::string& str() const {
      return *value_;
    }
    bool operator==(const StringLiteralValue&) const;
    std::strong_ordering operator<=>(const StringLiteralValue&) const;
  private:
    std::unique_ptr<std::string> value_;
  };

In libc++ version, sizeof(PrimitiveLiteralValue) == 48, since std::vector is large, it's still 48b after change to StringLiteralValue.

After change to below:

  template <typename T>
  struct LiteralValueWrapper {
    LiteralValueWrapper(T value);
    LiteralValueWrapper(const LiteralValueWrapper&);
    LiteralValueWrapper(LiteralValueWrapper&&) noexcept = default;
    const std::string& str() const {
      return *value_;
    }
    bool operator==(const LiteralValueWrapper<T>&) const;
    std::strong_ordering operator<=>(const LiteralValueWrapper<T>&) const;
  private:
    std::unique_ptr<T> value_;
  };

  using PrimitiveLiteralValue =
      std::variant<bool,                     // for boolean
                   int32_t,                  // for int, date
                   int64_t,                  // for long, timestamp, timestamp_tz, time
                   float,                    // for float
                   double,                   // for double
                   LiteralValueWrapper<std::string>,              // for string
                   LiteralValueWrapper<std::vector<uint8_t>>,     // for binary, fixed
                   std::array<uint8_t, 16>,  // for uuid and decimal
                   BelowMin, AboveMax>;

The sizeof sizeof(PrimitiveLiteralValue) == 48 becomes 40B, just eliminate 8B 😅. I think it's due to C++ doesn't do any niche optimization for std::variant... If the largest element in PrimitiveLiteralValue is 8B, sizeof(PrimitiveLiteralValue) can be 32B. So I think currently std::string is enough here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the analysis, let's keep std::string :)


private:
PrimitiveLiteralValue value_;
std::shared_ptr<PrimitiveType> type_;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't know std::shared_ptr<PrimitiveType> to heavy for this type

@wgtmac
Copy link
Member

wgtmac commented Jun 5, 2025

As discussed with @mapleFU offline, we need a good abstraction on Literal to represent arbitrary nesting structure (e.g. primitive, struct, map, list, and even geometry and variant?) and optimize specifically for primitive types.

As far as I know, literal values will be used in the following cases:

  1. Literal value in a predicate.
  2. Partition value in the DataFile.
  3. Default value in a SchemaField for Iceberg V3.
  4. Input and output values for a transform function.

For expression evaluator, we can directly evaluate the Arrow arrays as all file readers support reading into Arrow arrays so we don't pay for any data conversion. We can also specialize expression evaluators for ManifestFile, Snapshot and ContentFile. We can also provide a visitor to visit Arrow arrays to get the literal values of a single row with selected columns to make it easy to use.

If literals are all we need, we don't have to implement a StructLike in #77.

WDYT? @dongxiao1198 @gty404 @lidavidm @zhjwpku (names are sorted alphabetically :))

BelowMin, AboveMax>;

/// \brief PrimitiveLiteral is owned literal of a primitive type.
class PrimitiveLiteral {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename to Datum since you are using datum.h/cc as the filename?
Personally I prefer the following:

using PrimitiveLiteral = std::variant<bool, ...
class Datum {
    PrimitiveLiteral literal_,
    std::shared_ptr<PrimitiveType>
};

int64_t, // for long, timestamp, timestamp_tz, time
float, // for float
double, // for double
std::string, // for string
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small String Optimization (SSO) will make PrimitiveLiteralValue 40 bytes, may be use a unique_ptr to own the string?

@lidavidm
Copy link
Member

lidavidm commented Jun 5, 2025

The idea seems reasonable to me.

@mapleFU mapleFU marked this pull request as ready for review June 5, 2025 15:42
@mapleFU
Copy link
Member Author

mapleFU commented Jun 5, 2025

Would start testing if interface is ok

Copy link
Collaborator

@zhjwpku zhjwpku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mapleFU mapleFU force-pushed the primitive-literal-impl branch from d6b7e8e to 146a86e Compare June 6, 2025 13:33
static PrimitiveLiteral Float(float value);
static PrimitiveLiteral Double(double value);
static PrimitiveLiteral String(std::string value);
static PrimitiveLiteral Binary(std::vector<uint8_t> value);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can UUIDLiteral and DecimalLiteral also be created via factory methods?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can delay it into future patches.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need. Otherwise we don't know the exact type of the literal.

public:
/// Factory methods for primitive types
static PrimitiveLiteral Boolean(bool value);
static PrimitiveLiteral Int(int32_t value);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we should treat Timestamp and Date differently from long/int in expressions or elsewhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we can uses more factory here. This patch only supports int, float, string, binary. We can first make this in so we can then write the (1) impl logic for other literals (2) basic implement for predicate expressions.

break;
}

return NotSupported("Cast from {} to {} is not implemented", type_->ToString(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not implemented -> not supported?

Copy link
Contributor

@yingcai-cy yingcai-cy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -24,6 +24,7 @@ set(ICEBERG_SOURCES
expression/expression.cc
file_reader.cc
json_internal.cc
literal.cc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be put into the expression subdir?

int64_t, // for long, timestamp, timestamp_tz, time
float, // for float
double, // for double
std::string, // for string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a string_view variant?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PrimitiveLiteral is "owned", I think using string_view is high risk and most of time doesn't bring up performance

namespace iceberg {

/// \brief PrimitiveLiteral is a literal value that is associated with a primitive type.
class ICEBERG_EXPORT PrimitiveLiteral {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we simply use Literal instead of PrimitiveLiteral which looks lengthy?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I afraid in the future there would be MapLiteral and StructLiteral. But all is ok to me

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MapLiteral and StructLiteral may be combination of Literals and the majority use case is still for primitive types. So I believe making them shorter looks better. @Fokko WDYT?

float, // for float
double, // for double
std::string, // for string
std::vector<uint8_t>, // for binary, fixed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not reusing std::string?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good idea. C++ doesn't have a concept for "utf8 encoded string", here I would like to use std::string to repr this. But std::vector<uint8_t> or std::string all looks ok to me

static PrimitiveLiteral Float(float value);
static PrimitiveLiteral Double(double value);
static PrimitiveLiteral String(std::string value);
static PrimitiveLiteral Binary(std::vector<uint8_t> value);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need. Otherwise we don't know the exact type of the literal.

private:
PrimitiveLiteral(PrimitiveLiteralValue value, std::shared_ptr<PrimitiveType> type);

static PrimitiveLiteral BelowMinLiteral(std::shared_ptr<PrimitiveType> type);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need a type here, right? If your intention is to make them comparable, I think we can add ToAboveMax() and ToBelowMin() to replace these two static functions to keep the original type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need a type here, right?

BelowMin and AboveMax should be typed, a BelowMin for int32 and other types have different meaning

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sample: Literal(type:int64, value: int32_t::max() + 1) cast to int32_t becomes Literal(type:int32, value: AboveMax)

@mapleFU mapleFU changed the title feat: implement PrimitiveLiteral feat: implement Primitive type Literal Jun 20, 2025
std::strong_ordering operator<=>(const AboveMax&) const = default;
};

using Value = std::variant<bool, // for boolean
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make it public so users are able to get the value?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can expose this in the future, but currently I think encapsulate in here is ok

case TypeId::kTime:
case TypeId::kTimestamp:
case TypeId::kTimestampTz: {
throw IcebergError("Not implemented: ToString for " + type_->ToString());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not throw

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Emmm what should we do here? Here I want to first mark it as unimplemented, when all literal is implemented, this should be change to std::string. Should this first as:

return fmt::format("Not implemented: ToString for {}", type_->ToString());

Or

Result<std::string> .ToString()

?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants