Skip to content

[SPARK-52265][SQL][TEST] Fix regex leading to empty PROCESS_TABLES.testingVersions in HiveExternalCatalogVersionsSuite #50989

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

efaracci018
Copy link
Contributor

@efaracci018 efaracci018 commented May 22, 2025

What changes were proposed in this pull request?

Fix the version parsing logic in HiveExternalCatalogVersionsSuite to properly handle new artifact paths in https://dist.apache.org/repos/dist/release/spark/ so that "backward compatibility" test can be run.

This change creates a constant val SparkVersionPattern = """<a href="spark-(\d.\d.\d)/">""".r for more precise version matching, and removes redundant .filterNot(_.contains("preview")) which is no longer needed.

Why are the changes needed?

The suite is failing to execute the "backward compatibility" test due to parsing errors with testing versions. The current implementation fails to parse versions when encountering new paths like spark-connect-swift-0.1.0/ and spark-kubernetes-operator-0.1.0/ in https://dist.apache.org/repos/dist/release/spark/.

This leads to PROCESS_TABLES.testingVersions being empty, and in turn a logError: "Exception encountered when invoking run on a nested suite - Fail to get the latest Spark versions to test". As a result, the condition is not met to run the test.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Executed local build and test for HiveExternalCatalogVersionsSuite:

build/mvn -pl sql/hive-Dtest=none -DwildcardSuites=org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite test-compile scalatest:test

Verified that the reported error no longer appears, "backward compatibility" test runs successfully, and PROCESS_TABLES.testingVersions now correctly contains "3.5.5" when printed out, which was previously empty.

Was this patch authored or co-authored using generative AI tooling?

No.

…stingVersions in HiveExternalCatalogVersionsSuite
@github-actions github-actions bot added the SQL label May 22, 2025
@efaracci018
Copy link
Contributor Author

@cloud-fan please review, thank you

@LuciferYang
Copy link
Contributor

LuciferYang commented May 23, 2025

It seems that new issues may still arise after the 4.0 release with this pr:

build/sbt "hive/testOnly org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite" -Phive
[info] HiveExternalCatalogVersionsSuite:
12:55:31.252 WARN org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite: Failed to download Spark 4.0.0 from https://dlcdn.apache.org//spark/spark-4.0.0/spark-4.0.0-bin-hadoop3-scala2.13.tgz: https://dlcdn.apache.org//spark/spark-4.0.0/spark-4.0.0-bin-hadoop3-scala2.13.tgz
12:55:32.485 WARN org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite: Failed to download Spark 4.0.0 from https://archive.apache.org/dist/spark/spark-4.0.0/spark-4.0.0-bin-hadoop3-scala2.13.tgz: https://archive.apache.org/dist/spark/spark-4.0.0/spark-4.0.0-bin-hadoop3-scala2.13.tgz
12:55:33.414 WARN org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite: Failed to download Spark 4.0.0 from https://dist.apache.org/repos/dist/release/spark/spark-4.0.0/spark-4.0.0-bin-hadoop3-scala2.13.tgz: https://dist.apache.org/repos/dist/release/spark/spark-4.0.0/spark-4.0.0-bin-hadoop3-scala2.13.tgz
12:55:33.597 WARN org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite: 

===== POSSIBLE THREAD LEAK IN SUITE o.a.s.sql.hive.HiveExternalCatalogVersionsSuite, threads: Keep-Alive-Timer (daemon=true) =====
[info] org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite *** ABORTED *** (3 minutes, 41 seconds)
[info]   Unable to download Spark 4.0.0 (HiveExternalCatalogVersionsSuite.scala:125)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
[info]   at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
[info]   at org.scalatest.funsuite.AnyFunSuite.newAssertionFailedException(AnyFunSuite.scala:1564)
[info]   at org.scalatest.Assertions.fail(Assertions.scala:933)
[info]   at org.scalatest.Assertions.fail$(Assertions.scala:929)
[info]   at org.scalatest.funsuite.AnyFunSuite.fail(AnyFunSuite.scala:1564)
[info]   at org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite.tryDownloadSpark(HiveExternalCatalogVersionsSuite.scala:125)
[info]   at org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite.$anonfun$beforeAll$4(HiveExternalCatalogVersionsSuite.scala:212)
[info]   at org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite.$anonfun$beforeAll$4$adapted(HiveExternalCatalogVersionsSuite.scala:209)
[info]   at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:619)
[info]   at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:617)
[info]   at scala.collection.AbstractIterable.foreach(Iterable.scala:935)
[info]   at org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite.beforeAll(HiveExternalCatalogVersionsSuite.scala:209)
[info]   at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212)
[info]   at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
[info]   at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
[info]   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:69)
[info]   at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:321)
[info]   at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:517)
[info]   at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:414)
[info]   at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
[info]   at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
[info]   at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
[info]   at java.base/java.lang.Thread.run(Thread.java:840)
[info] Run completed in 3 minutes, 42 seconds.
[info] Total number of tests run: 0
[info] Suites: completed 0, aborted 1
[info] Tests: succeeded 0, failed 0, canceled 0, ignored 0, pending 0
[info] *** 1 SUITE ABORTED ***
[error] Error during tests:
[error] 	org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite
[error] (hive / Test / testOnly) sbt.TestsFailedException: Tests unsuccessful

The package name for version 4.0 doesn't have the identifier 'scala-2.13'.

image

@efaracci018 Are you interested in taking another look?

@efaracci018
Copy link
Contributor Author

Thank you @LuciferYang for pointing out this recent issue, I am interesting in taking a look.

@efaracci018
Copy link
Contributor Author

Requiring identifier 'scala-2.13' was introduced by this following PR, with a comment that noted

We can change it back to *-hadoop3.tgz at Apache Spark 4.1 after the official Apache Spark 4.0 release.

Given official 4.0 release is out, we can drop the 'scala-2.13' identifier for 4.0, but need to maintain it for any 3.x version.

This fix allows the test to run, and I verified that scala 2.13 is used:

~/w/O/spark ❯❯❯ ls -al /tmp/test-spark/spark-3.5.6/jars/scala*
-rw-r--r--@ 1 efaracci  wheel      5596 May 22 23:20 /tmp/test-spark/spark-3.5.6/jars/scala-collection-compat_2.13-2.7.0.jar
-rw-r--r--@ 1 efaracci  wheel  12097183 May 22 23:20 /tmp/test-spark/spark-3.5.6/jars/scala-compiler-2.13.8.jar
-rw-r--r--@ 1 efaracci  wheel   6003601 May 22 23:20 /tmp/test-spark/spark-3.5.6/jars/scala-library-2.13.8.jar
-rw-r--r--@ 1 efaracci  wheel   1127123 May 22 23:20 /tmp/test-spark/spark-3.5.6/jars/scala-parallel-collections_2.13-1.0.4.jar
-rw-r--r--@ 1 efaracci  wheel    189332 May 22 23:20 /tmp/test-spark/spark-3.5.6/jars/scala-parser-combinators_2.13-2.3.0.jar
-rw-r--r--@ 1 efaracci  wheel   3772083 May 22 23:20 /tmp/test-spark/spark-3.5.6/jars/scala-reflect-2.13.8.jar
-rw-r--r--@ 1 efaracci  wheel    483090 May 22 23:20 /tmp/test-spark/spark-3.5.6/jars/scala-xml_2.13-2.1.0.jar
~/w/O/spark ❯❯❯ ls -al /tmp/test-spark/spark-4.0.0/jars/scala*      
-rw-r--r--@ 1 efaracci  wheel      5596 May 19 01:26 /tmp/test-spark/spark-4.0.0/jars/scala-collection-compat_2.13-2.7.0.jar
-rw-r--r--@ 1 efaracci  wheel  12308669 May 19 01:26 /tmp/test-spark/spark-4.0.0/jars/scala-compiler-2.13.16.jar
-rw-r--r--@ 1 efaracci  wheel   5926340 May 19 01:26 /tmp/test-spark/spark-4.0.0/jars/scala-library-2.13.16.jar
-rw-r--r--@ 1 efaracci  wheel   1120803 May 19 01:26 /tmp/test-spark/spark-4.0.0/jars/scala-parallel-collections_2.13-1.2.0.jar
-rw-r--r--@ 1 efaracci  wheel    189196 May 19 01:26 /tmp/test-spark/spark-4.0.0/jars/scala-parser-combinators_2.13-2.4.0.jar
-rw-r--r--@ 1 efaracci  wheel   3801119 May 19 01:26 /tmp/test-spark/spark-4.0.0/jars/scala-reflect-2.13.16.jar
-rw-r--r--@ 1 efaracci  wheel    546013 May 19 01:26 /tmp/test-spark/spark-4.0.0/jars/scala-xml_2.13-2.3.0.jar

Please note the following constraints:

  1. We can remove match expression once 3.x is no longer being tested
  2. This will likely have to be changed again when Spark moves to scala 3.x

@LuciferYang

Copy link
Contributor

@LuciferYang LuciferYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM (pending tests)

@@ -96,7 +96,10 @@ class HiveExternalCatalogVersionsSuite extends SparkSubmitTestUtils {
mirrors.distinct :+ "https://archive.apache.org/dist" :+ PROCESS_TABLES.releaseMirror
logInfo(s"Trying to download Spark $version from $sites")
for (site <- sites) {
val filename = s"spark-$version-bin-hadoop3-scala2.13.tgz"
val filename = version match {
case v if v.startsWith("3") => s"spark-$version-bin-hadoop3-scala2.13.tgz"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use "3." for the version match

val filename = s"spark-$version-bin-hadoop3-scala2.13.tgz"
val filename = version match {
case v if v.startsWith("3") => s"spark-$version-bin-hadoop3-scala2.13.tgz"
case _ => s"spark-$version-bin-hadoop3.tgz"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider using "4." for the version match and fail the test for unknown/not expected version.

@@ -96,7 +96,10 @@ class HiveExternalCatalogVersionsSuite extends SparkSubmitTestUtils {
mirrors.distinct :+ "https://archive.apache.org/dist" :+ PROCESS_TABLES.releaseMirror
logInfo(s"Trying to download Spark $version from $sites")
for (site <- sites) {
val filename = s"spark-$version-bin-hadoop3-scala2.13.tgz"
val filename = version match {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider

val filename = s"spark-$version-bin-hadoop3$scala_version.tgz"

and define

val scala_version = ...

using version match logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I've put up a revision with this change.

@vrozov
Copy link
Member

vrozov commented Jun 4, 2025

LGTM

@@ -96,7 +96,12 @@ class HiveExternalCatalogVersionsSuite extends SparkSubmitTestUtils {
mirrors.distinct :+ "https://archive.apache.org/dist" :+ PROCESS_TABLES.releaseMirror
logInfo(s"Trying to download Spark $version from $sites")
for (site <- sites) {
val filename = s"spark-$version-bin-hadoop3-scala2.13.tgz"
val scala_version = version match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
val scala_version = version match {
val scalaVersion = version match {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variables should be named using camelCase in Spark.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I've updated scalaVersion, and sparkVersionPattern as well.

@LuciferYang
Copy link
Contributor

Once the tests pass, I will merge this PR

@LuciferYang
Copy link
Contributor

@efaracci018 Could you please provide your Jira account? I need to add your Jira account to the Spark Contributor group.

@efaracci018
Copy link
Contributor Author

Thanks @LuciferYang, my Jira account is efaracci018.

LuciferYang pushed a commit that referenced this pull request Jun 6, 2025
…stingVersions in HiveExternalCatalogVersionsSuite

### What changes were proposed in this pull request?

Fix the version parsing logic in `HiveExternalCatalogVersionsSuite` to properly handle new artifact paths in https://dist.apache.org/repos/dist/release/spark/ so that "backward compatibility" test can be run.

This change creates a constant `val SparkVersionPattern = """<a href="spark-(\d.\d.\d)/">""".r` for more precise version matching, and removes redundant `.filterNot(_.contains("preview"))` which is no longer needed.

### Why are the changes needed?

The suite is failing to execute the "backward compatibility" test due to parsing errors with testing versions. The current implementation fails to parse versions when encountering new paths like `spark-connect-swift-0.1.0/` and `spark-kubernetes-operator-0.1.0/` in https://dist.apache.org/repos/dist/release/spark/.

This leads to `PROCESS_TABLES.testingVersions` being empty, and in turn a logError: "Exception encountered when invoking run on a nested suite - Fail to get the latest Spark versions to test". As a result, the condition is not met to run the test.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

Executed local build and test for `HiveExternalCatalogVersionsSuite`:

`build/mvn -pl sql/hive-Dtest=none -DwildcardSuites=org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite test-compile scalatest:test`

Verified that the reported error no longer appears, "backward compatibility" test runs successfully, and `PROCESS_TABLES.testingVersions` now correctly contains "3.5.5" when printed out, which was previously empty.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #50989 from efaracci018/fix-testingVersions.

Lead-authored-by: Emilie Faracci <efaracci@amazon.com>
Co-authored-by: efaracci018 <efaracci@amazon.com>
Signed-off-by: yangjie01 <yangjie01@baidu.com>
(cherry picked from commit a380380)
Signed-off-by: yangjie01 <yangjie01@baidu.com>
@LuciferYang
Copy link
Contributor

LuciferYang commented Jun 6, 2025

Merged into master and branch-4.0. Thanks @efaracci018 and @vrozov

@efaracci018 Welcome to the Apache Spark community.

yhuang-db pushed a commit to yhuang-db/spark that referenced this pull request Jun 9, 2025
…stingVersions in HiveExternalCatalogVersionsSuite

### What changes were proposed in this pull request?

Fix the version parsing logic in `HiveExternalCatalogVersionsSuite` to properly handle new artifact paths in https://dist.apache.org/repos/dist/release/spark/ so that "backward compatibility" test can be run.

This change creates a constant `val SparkVersionPattern = """<a href="spark-(\d.\d.\d)/">""".r` for more precise version matching, and removes redundant `.filterNot(_.contains("preview"))` which is no longer needed.

### Why are the changes needed?

The suite is failing to execute the "backward compatibility" test due to parsing errors with testing versions. The current implementation fails to parse versions when encountering new paths like `spark-connect-swift-0.1.0/` and `spark-kubernetes-operator-0.1.0/` in https://dist.apache.org/repos/dist/release/spark/.

This leads to `PROCESS_TABLES.testingVersions` being empty, and in turn a logError: "Exception encountered when invoking run on a nested suite - Fail to get the latest Spark versions to test". As a result, the condition is not met to run the test.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

Executed local build and test for `HiveExternalCatalogVersionsSuite`:

`build/mvn -pl sql/hive-Dtest=none -DwildcardSuites=org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite test-compile scalatest:test`

Verified that the reported error no longer appears, "backward compatibility" test runs successfully, and `PROCESS_TABLES.testingVersions` now correctly contains "3.5.5" when printed out, which was previously empty.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#50989 from efaracci018/fix-testingVersions.

Lead-authored-by: Emilie Faracci <efaracci@amazon.com>
Co-authored-by: efaracci018 <efaracci@amazon.com>
Signed-off-by: yangjie01 <yangjie01@baidu.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants