Skip to content

Conversation

davidangb
Copy link
Contributor

@davidangb davidangb commented Dec 17, 2024

This PR is an optimization found while researching CORE-207. This should be performance only; it has no functional impact to end users.

The stringToTypedAttribute() function is called for each cell in an uploaded TSV; this is a lot of invocations so any optimization here is useful.

  1. check the cell's first character. Previously, we attempted to parse the cell as a number, a boolean, and a reference before giving up and calling it a string. In this PR, we grab the first non-whitespace character of the cell and compare it to a known list of possible first characters for the non-string datatypes - for instance, a boolean will always start with "t" or "f". This is faster than trying to parse the whole cell and results in a ~2.4x speedup in our new benchmark.

Raw benchmark numbers:

  ops/sec error +/- vs.baseline avg ops/sec avg vs. baseline
baseline (648c5f2) 13,426.94 598.82   13,227.86  
  12,690.29 564.33 94.51%    
  13,566.35 277.55 101.04%    
first-character checks (ed275ec) 32,275.95 660.82 240.38% 32,325.38 244.37%
  32,994.04 912.17 245.73%    
  31,706.16 1,068.82 236.14%    
more Scala-y (6ad61e4) 32,781.50 564.73 244.15% 32,126.14 242.87%
  31,982.75 1,143.94 238.20%    
  31,614.16 634.61 235.45%    

Have you read CONTRIBUTING.md lately? If not, do that first.

I, the developer opening this PR, do solemnly pinky swear that:

  • I've followed the instructions if I've made any changes to the API, especially if they're breaking changes
  • I've updated the FISMA documentation if I've made any security-related changes, including auth, encryption, or auditing

In all cases:

  • Get two thumbsworth of review and PO signoff if necessary
  • Verify all tests go green
  • Squash and merge. Make sure your branch deletes; GitHub should do this for you.
  • Test this change deployed correctly and works on dev environment after deployment

@davidangb davidangb marked this pull request as ready for review December 18, 2024 13:29
@davidangb davidangb requested a review from a team as a code owner December 18, 2024 13:29
@davidangb davidangb requested review from dvoet and kevinmarete and removed request for a team December 18, 2024 13:29
Copy link
Contributor

@dvoet dvoet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can take or leave my comment

Comment on lines 196 to 248
private def toAttribute(value: String): Attribute = {
val trimmed = value.trim
// empty string
if (trimmed.isEmpty) {
return AttributeString(value)
}
// find the first character of the inbound cell
val firstChar = trimmed.charAt(0)

// first char doesn't match any known starters; this is a string
if (!possibleNonStrings.contains(firstChar)) {
return AttributeString(value)
}

// could this be a number?
if (possibleNumbers.contains(firstChar)) {
Try(java.lang.Integer.parseInt(value)) match {
case Success(intValue) => return AttributeNumber(intValue)
case Failure(_) =>
Try(java.lang.Double.parseDouble(value)) match {
// because we represent AttributeNumber as a BigDecimal, and BigDecimal has no concept of infinity or NaN,
// if we find infinite/NaN numbers here, don't save them as AttributeNumber; instead let them fall through
// to AttributeString.
case Success(doubleValue)
if !Double.NegativeInfinity.equals(doubleValue)
&& !Double.PositiveInfinity.equals(doubleValue)
&& !Double.NaN.equals(doubleValue)
&& !matchesLiteral(value) =>
return AttributeNumber(doubleValue)
case _ => // noop
}
}
}

// could this be a boolean?
if (possibleBooleans.contains(firstChar)) {
Try(BooleanUtils.toBoolean(value.toLowerCase, "true", "false")) match {
case Success(booleanValue) => return AttributeBoolean(booleanValue)
case _ => // noop
}
}

// could this be a reference?
if (possibleReferences.contains(firstChar)) {
Try(value.parseJson.convertTo[AttributeEntityReference]) match {
case Success(ref) => return ref
case _ => // noop
}
}

// it's a string.
AttributeString(value)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The early exits bother me in scala code so I asked copilot to rewrite using collectFirst and this is what it came up with.

Suggested change
private def toAttribute(value: String): Attribute = {
val trimmed = value.trim
// empty string
if (trimmed.isEmpty) {
return AttributeString(value)
}
// find the first character of the inbound cell
val firstChar = trimmed.charAt(0)
// first char doesn't match any known starters; this is a string
if (!possibleNonStrings.contains(firstChar)) {
return AttributeString(value)
}
// could this be a number?
if (possibleNumbers.contains(firstChar)) {
Try(java.lang.Integer.parseInt(value)) match {
case Success(intValue) => return AttributeNumber(intValue)
case Failure(_) =>
Try(java.lang.Double.parseDouble(value)) match {
// because we represent AttributeNumber as a BigDecimal, and BigDecimal has no concept of infinity or NaN,
// if we find infinite/NaN numbers here, don't save them as AttributeNumber; instead let them fall through
// to AttributeString.
case Success(doubleValue)
if !Double.NegativeInfinity.equals(doubleValue)
&& !Double.PositiveInfinity.equals(doubleValue)
&& !Double.NaN.equals(doubleValue)
&& !matchesLiteral(value) =>
return AttributeNumber(doubleValue)
case _ => // noop
}
}
}
// could this be a boolean?
if (possibleBooleans.contains(firstChar)) {
Try(BooleanUtils.toBoolean(value.toLowerCase, "true", "false")) match {
case Success(booleanValue) => return AttributeBoolean(booleanValue)
case _ => // noop
}
}
// could this be a reference?
if (possibleReferences.contains(firstChar)) {
Try(value.parseJson.convertTo[AttributeEntityReference]) match {
case Success(ref) => return ref
case _ => // noop
}
}
// it's a string.
AttributeString(value)
}
private def toAttribute(value: String): Attribute = {
val trimmed = value.trim
if (trimmed.isEmpty) {
// empty string
AttributeString(value)
} else {
val firstChar = trimmed.charAt(0)
val possibleConversions = Seq(
() =>
// first char doesn't match any known starters; this is a string
if (!possibleNonStrings.contains(firstChar)) Some(AttributeString(value))
else None,
() =>
// could this be an int?
if (possibleNumbers.contains(firstChar)) Try(java.lang.Integer.parseInt(value)).toOption.map(AttributeNumber(_))
else None,
() =>
// could this be a double?
if (possibleNumbers.contains(firstChar))
Try(java.lang.Double.parseDouble(value)).toOption
// because we represent AttributeNumber as a BigDecimal, and BigDecimal has no concept of infinity or NaN,
// if we find infinite/NaN numbers here, don't save them as AttributeNumber; instead let them fall through
// to AttributeString.
.filterNot(doubleValue =>
Double.NegativeInfinity.equals(doubleValue) || Double.PositiveInfinity.equals(doubleValue) || Double.NaN
.equals(doubleValue) || matchesLiteral(value)
)
.map(AttributeNumber(_))
else None,
() =>
// could this be a boolean?
if (possibleBooleans.contains(firstChar))
Try(BooleanUtils.toBoolean(value.toLowerCase, "true", "false")).toOption.map(AttributeBoolean)
else None,
() =>
// could this be a reference?
if (possibleReferences.contains(firstChar)) Try(value.parseJson.convertTo[AttributeEntityReference]).toOption
else None
)
possibleConversions
.collectFirst { case conversion if conversion().isDefined => conversion().get }
.getOrElse(AttributeString(value))
}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dvoet I've rewritten to be more idiomatic - what do you think?

@davidangb davidangb merged commit 36b8583 into develop Dec 18, 2024
13 checks passed
@davidangb davidangb deleted the da_CORE-207_optimizeTsvUpload branch December 18, 2024 20:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants