-
Notifications
You must be signed in to change notification settings - Fork 37
feat: implement Primitive type Literal #117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
8285c07
to
082c7e1
Compare
src/iceberg/datum.h
Outdated
int64_t, // for long, timestamp, timestamp_tz, time | ||
float, // for float | ||
double, // for double | ||
std::string, // for string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
C++ string not requires utf8, so I wonder whether std::vector<uint8_t>
with string is possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small String Optimization (SSO) will make PrimitiveLiteralValue 40 bytes, may be use a unique_ptr to own the string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a nit, Literal might be used as vector, so more bytes is not critical?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the data part of vector will not be in Literal's layout?
If we are not going to use something like std::vector, I think it's fine with the current design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Emmm, a tiny wrapper like this should be defined if unique_ptr should be added. I'm also ok for the case
struct StringLiteralValue {
StringLiteralValue(std::string value);
StringLiteralValue(const StringLiteralValue&);
StringLiteralValue(StringLiteralValue&&) noexcept;
const std::string& str() const {
return *value_;
}
bool operator==(const StringLiteralValue&) const;
std::strong_ordering operator<=>(const StringLiteralValue&) const;
private:
std::unique_ptr<std::string> value_;
};
In libc++ version, sizeof(PrimitiveLiteralValue) == 48
, since std::vector
is large, it's still 48b after change to StringLiteralValue
.
After change to below:
template <typename T>
struct LiteralValueWrapper {
LiteralValueWrapper(T value);
LiteralValueWrapper(const LiteralValueWrapper&);
LiteralValueWrapper(LiteralValueWrapper&&) noexcept = default;
const std::string& str() const {
return *value_;
}
bool operator==(const LiteralValueWrapper<T>&) const;
std::strong_ordering operator<=>(const LiteralValueWrapper<T>&) const;
private:
std::unique_ptr<T> value_;
};
using PrimitiveLiteralValue =
std::variant<bool, // for boolean
int32_t, // for int, date
int64_t, // for long, timestamp, timestamp_tz, time
float, // for float
double, // for double
LiteralValueWrapper<std::string>, // for string
LiteralValueWrapper<std::vector<uint8_t>>, // for binary, fixed
std::array<uint8_t, 16>, // for uuid and decimal
BelowMin, AboveMax>;
The sizeof sizeof(PrimitiveLiteralValue) == 48
becomes 40B, just eliminate 8B 😅. I think it's due to C++ doesn't do any niche optimization for std::variant... If the largest element in PrimitiveLiteralValue
is 8B, sizeof(PrimitiveLiteralValue)
can be 32B. So I think currently std::string is enough here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the analysis, let's keep std::string :)
src/iceberg/datum.h
Outdated
|
||
private: | ||
PrimitiveLiteralValue value_; | ||
std::shared_ptr<PrimitiveType> type_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't know std::shared_ptr<PrimitiveType>
to heavy for this type
As discussed with @mapleFU offline, we need a good abstraction on As far as I know, literal values will be used in the following cases:
For expression evaluator, we can directly evaluate the Arrow arrays as all file readers support reading into Arrow arrays so we don't pay for any data conversion. We can also specialize expression evaluators for If literals are all we need, we don't have to implement a WDYT? @dongxiao1198 @gty404 @lidavidm @zhjwpku (names are sorted alphabetically :)) |
src/iceberg/datum.h
Outdated
BelowMin, AboveMax>; | ||
|
||
/// \brief PrimitiveLiteral is owned literal of a primitive type. | ||
class PrimitiveLiteral { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename to Datum since you are using datum.h/cc as the filename?
Personally I prefer the following:
using PrimitiveLiteral = std::variant<bool, ...
class Datum {
PrimitiveLiteral literal_,
std::shared_ptr<PrimitiveType>
};
src/iceberg/datum.h
Outdated
int64_t, // for long, timestamp, timestamp_tz, time | ||
float, // for float | ||
double, // for double | ||
std::string, // for string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small String Optimization (SSO) will make PrimitiveLiteralValue 40 bytes, may be use a unique_ptr to own the string?
The idea seems reasonable to me. |
Would start testing if interface is ok |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
d6b7e8e
to
146a86e
Compare
src/iceberg/literal.h
Outdated
static PrimitiveLiteral Float(float value); | ||
static PrimitiveLiteral Double(double value); | ||
static PrimitiveLiteral String(std::string value); | ||
static PrimitiveLiteral Binary(std::vector<uint8_t> value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can UUIDLiteral and DecimalLiteral also be created via factory methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can delay it into future patches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need. Otherwise we don't know the exact type of the literal.
src/iceberg/literal.h
Outdated
public: | ||
/// Factory methods for primitive types | ||
static PrimitiveLiteral Boolean(bool value); | ||
static PrimitiveLiteral Int(int32_t value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we should treat Timestamp and Date differently from long/int in expressions or elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, we can uses more factory here. This patch only supports int, float, string, binary. We can first make this in so we can then write the (1) impl logic for other literals (2) basic implement for predicate expressions.
src/iceberg/literal.cc
Outdated
break; | ||
} | ||
|
||
return NotSupported("Cast from {} to {} is not implemented", type_->ToString(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not implemented -> not supported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
src/iceberg/CMakeLists.txt
Outdated
@@ -24,6 +24,7 @@ set(ICEBERG_SOURCES | |||
expression/expression.cc | |||
file_reader.cc | |||
json_internal.cc | |||
literal.cc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be put into the expression
subdir?
src/iceberg/literal.h
Outdated
int64_t, // for long, timestamp, timestamp_tz, time | ||
float, // for float | ||
double, // for double | ||
std::string, // for string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a string_view
variant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PrimitiveLiteral is "owned", I think using string_view is high risk and most of time doesn't bring up performance
src/iceberg/literal.h
Outdated
namespace iceberg { | ||
|
||
/// \brief PrimitiveLiteral is a literal value that is associated with a primitive type. | ||
class ICEBERG_EXPORT PrimitiveLiteral { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we simply use Literal
instead of PrimitiveLiteral
which looks lengthy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I afraid in the future there would be MapLiteral and StructLiteral. But all is ok to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MapLiteral and StructLiteral may be combination of Literals and the majority use case is still for primitive types. So I believe making them shorter looks better. @Fokko WDYT?
src/iceberg/literal.h
Outdated
float, // for float | ||
double, // for double | ||
std::string, // for string | ||
std::vector<uint8_t>, // for binary, fixed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not reusing std::string
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good idea. C++ doesn't have a concept for "utf8 encoded string", here I would like to use std::string
to repr this. But std::vector<uint8_t>
or std::string
all looks ok to me
src/iceberg/literal.h
Outdated
static PrimitiveLiteral Float(float value); | ||
static PrimitiveLiteral Double(double value); | ||
static PrimitiveLiteral String(std::string value); | ||
static PrimitiveLiteral Binary(std::vector<uint8_t> value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need. Otherwise we don't know the exact type of the literal.
src/iceberg/literal.h
Outdated
private: | ||
PrimitiveLiteral(PrimitiveLiteralValue value, std::shared_ptr<PrimitiveType> type); | ||
|
||
static PrimitiveLiteral BelowMinLiteral(std::shared_ptr<PrimitiveType> type); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need a type here, right? If your intention is to make them comparable, I think we can add ToAboveMax()
and ToBelowMin()
to replace these two static functions to keep the original type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need a type here, right?
BelowMin
and AboveMax
should be typed, a BelowMin for int32 and other types have different meaning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sample: Literal(type:int64, value: int32_t::max() + 1)
cast to int32_t becomes Literal(type:int32, value: AboveMax)
std::strong_ordering operator<=>(const AboveMax&) const = default; | ||
}; | ||
|
||
using Value = std::variant<bool, // for boolean |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make it public so users are able to get the value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can expose this in the future, but currently I think encapsulate in here is ok
case TypeId::kTime: | ||
case TypeId::kTimestamp: | ||
case TypeId::kTimestampTz: { | ||
throw IcebergError("Not implemented: ToString for " + type_->ToString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do not throw
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Emmm what should we do here? Here I want to first mark it as unimplemented, when all literal is implemented, this should be change to std::string
. Should this first as:
return fmt::format("Not implemented: ToString for {}", type_->ToString());
Or
Result<std::string> .ToString()
?
Co-authored-by: Gang Wu <ustcwg@gmail.com>
Before implement expression, Primitive should be implemented. The spec follows: https://iceberg.apache.org/spec/#schemas-and-data-types
In this patch, only few types of Literals is supported, including int, long, float, double, boolean, string and binary.
TODO:
Would in other patch: