-
Notifications
You must be signed in to change notification settings - Fork 78
Spark 4 support #452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark 4 support #452
Conversation
|
This PR has too many files to review (>50 files). |
e3c873b to
db9ed7a
Compare
|
This PR has too many files to review (>50 files). |
mzitnik
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the PR is huge, let's have a 1:1 to go over it, but a few comments for now
.github/workflows/build-and-test.yml
Outdated
| with: | ||
| distribution: zulu | ||
| java-version: 8 | ||
| java-version: 17 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can merge run-tests and run-tests-with-specific-clickhouse into one test. Do you know what version we are running on run-tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The run-tests run on 23.8 (the default version in the ClickHouseSingleMixIn file).
We could merge them into one test but then we'll need to update the matrix that would run all tests on all CH versions (might actually be a good idea but it'll create a lot of new tests. maybe we should change this only when we fix the tests so that they will be able to run concurrently)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, this test is against local ClickHouse, not the cloud, so it does not have the concurrency issue
.github/workflows/cloud.yml
Outdated
| scala: [ 2.12, 2.13 ] | ||
| spark: [ '3.3', '3.4', '3.5', '4.0' ] | ||
| scala: [ '2.12', '2.13' ] | ||
| java: [ 8 ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not java: [ 8, 17 ]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again this would cause many more versions of the tests to run (today it is only running 8).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'am ok with that since we might have a so feature that is not supported in 8 that can break so it better to test
.github/workflows/tpcds.yml
Outdated
| scala: [ 2.12, 2.13 ] | ||
| spark: [ '3.3', '3.4', '3.5', '4.0' ] | ||
| scala: [ '2.12', '2.13' ] | ||
| java: [ 8 ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here java: [ 8, 17 ]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as previous comment
| apply plugin: "java-library" | ||
| apply plugin: "org.scoverage" | ||
| // Disable scoverage when running Metals' bloopInstall to avoid plugin resolution issues | ||
| def isBloopInstall = gradle.startParameter.taskNames.any { it.contains('bloopInstall') } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you check if it does not affect IntelliJ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't affect as IntelliJ uses its own build server and it doesn't use Bloop at all
clickhouse-core/src/testFixtures/scala/com/clickhouse/spark/base/ClickHouseSingleMixIn.scala
Outdated
Show resolved
Hide resolved
- Copied spark-3.5/clickhouse-spark to spark-4.0/clickhouse-spark - Copied spark-3.5/clickhouse-spark-it to spark-4.0/clickhouse-spark-it - Copied build.gradle with module name references updated to 4.0 - This is an exact copy of Spark 3.5 code before API adaptations - No functional changes, only directory structure and module naming
- Updated AnalysisException constructor to require errorClass and messageParameters - Updated ArrowUtils.toArrowSchema to include largeVarTypes parameter (false for compatibility) - These are breaking API changes in Spark 4.0 that require code updates
- Added Spark 4.0.1 version configuration to gradle.properties - Updated ANTLR version to 4.13.1 to match Spark 4.0 dependencies - Added Spark 4.0 examples module to settings.gradle - Updated build.gradle with Spark 4.0 compatibility settings
- Added BeforeAndAfterAll trait to ClickHouseSingleMixIn - Fixed trait linearization order: BeforeAndAfterAll before ForAllTestContainer - Added logging for container lifecycle debugging - Resolves Spark 4.0 test compilation issues with trait conflicts
- Added Spark 4.0 with Scala 2.13 to build-and-test workflow - Added Spark 4.0 to check-license and style workflows with Java 17 - Added Spark 4.0 to cloud and tpcds workflows - Excluded Spark 4.0 + Scala 2.12 combinations (not supported) - Configured Java 17 requirement for Spark 4.0 builds
- StreamingRateExample: Streaming app using rate source for continuous debugging - SimpleBatchExample: Simple batch app for basic debugging scenarios - Comprehensive README with setup and usage instructions - Gradle and SBT build configurations for IDE support - Examples allow setting breakpoints in connector code during execution
cb77141 to
631d235
Compare
- Added Logging trait to ClickHouseSingleMixIn - Replaced all println statements with log.info() calls - Provides structured, filterable logging for test execution - Maintains test output visibility while enabling proper log management
631d235 to
6d449b3
Compare
|
|
||
| def toArrowSchema(schema: StructType, timeZoneId: String): Schema = ArrowUtils.toArrowSchema(schema, timeZoneId, true) | ||
| def toArrowSchema(schema: StructType, timeZoneId: String): Schema = | ||
| ArrowUtils.toArrowSchema(schema, timeZoneId, true, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know how this boolean largeVarTypes = false will affect us?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
largeVarTypes = false uses 32-bit offsets (max 2GB per string/binary value) which is the Spark 3.x behavior, while true would use 64-bit offsets for massive values but likely break compatibility.
|
@ShimonSte |
- Updated build-and-test.yml: Added Java 8 & 17 matrix to run-tests-with-specific-clickhouse - Updated cloud.yml: Test all Spark 3.x versions with both Java 8 & 17 - Updated tpcds.yml: Test all Spark 3.x versions with both Java 8 & 17 - Spark 4.0 continues to use Java 17 only (requires Java 11+) - Updated artifact names to include Java version for better debugging
ANTLR 4.13.2 (required for Spark 4.0) needs Java 11+ to compile. The run-tests job already handles Java 8 testing for Spark 3.x versions. This job now only uses Java 17 to avoid ANTLR compilation errors.
gradle.properties
Outdated
| spark_33_scala_212_version=2.12.15 | ||
| spark_34_scala_212_version=2.12.17 | ||
| spark_35_scala_212_version=2.12.18 | ||
| spark_40_scala_212_version=2.12.18 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark 4.0 does not support scala 2.12
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch
gradle.properties
Outdated
|
|
||
| # Align with Apache Spark, and don't bundle them in release jar. | ||
| commons_lang3_version=3.12.0 | ||
| commons_codec_version=1.16.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the apache commons libs have good backward compatibility, let's upgrade them to align with latest stable Spark version.
| kyuubi_version=1.9.2 | ||
| testcontainers_scala_version=0.41.2 | ||
| scalatest_version=3.2.16 | ||
| flexmark_version=0.62.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also suggest upgrading scalatest and flexmark to align with spark 4.0.1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should keep flexmark at the current version as newer versions do not support java 8.
I think it is time we start talking about deprecating java 8, I'll raise it up as an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, in the big data world, especially the Hadoop ecosystem still heavily stuck with Java 8, I would suggest deferring such deprecation until we hit significant issues.
Additional notes: Spark is not a service, but more like a library or programming framework that mixes user code at runtime, so it's impossible to make aggressive upgrades and ask all users to tune all their jobs. In practice, some Spark jobs may be unable to upgrade to a new Spark version in their lifecycle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point. Makes sense to keep java as-is for now. Thanks for the context!
.github/workflows/tpcds.yml
Outdated
| # Spark 4.0 only supports Scala 2.13 | ||
| - spark: '4.0' | ||
| scala: '2.12' | ||
| # Spark 4.0 requires Java 11+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # Spark 4.0 requires Java 11+ | |
| # Spark 4.0 requires Java 17+ |
- Fix CI comments: Spark 4.0 requires Java 17+, not Java 11+ - Update commons-codec to 1.17.2 (align with Spark 4.0.1) - Update scalatest to 3.2.19 (align with Spark 4.0.1) - Update flexmark to 0.64.8 (latest stable)
Merged run-tests and run-tests-with-specific-clickhouse into a single job that tests all combinations of: - ClickHouse versions: 25.3, 25.6, 25.7, latest - Java versions: 8, 17 - Scala versions: 2.12, 2.13 - Spark versions: 3.3, 3.4, 3.5, 4.0 With proper exclusions: - Spark 4.0 only supports Scala 2.13 - Spark 4.0 requires Java 17+ Total: 52 test jobs (down from 64 with exclusions) - Spark 3.x: 16 jobs each (4 CH × 2 Java × 2 Scala) - Spark 4.0: 4 jobs (4 CH × Java 17 × Scala 2.13)
Changed from: 25.3, 25.6, 25.7, latest Changed to: 25.6, 25.7, 25.8, 25.9, latest This adds testing for newer ClickHouse versions (25.8, 25.9) while maintaining coverage of recent stable releases. Total test jobs increases from 52 to 65 (5 CH versions instead of 4).
Flexmark 0.64.8 requires Java 11+ which breaks tests running with Java 8. Version 0.62.2 is the last version compatible with Java 8.
Use findProperty with fallback values in build.gradle instead of getProperty. This allows us to omit spark_40_scala_212_version from gradle.properties since Spark 4.0 doesn't support Scala 2.12. If the property is missing, it falls back to sensible defaults: - scala_212_version: 2.12.18 - scala_213_version: 2.13.8
pan3793
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall LGTM, (I didn't go through all files, and I suppose most scala files are just copy-pasted, maybe with a little tuning to pass compile.)
Add Spark 4.0 Support
Overview
This PR adds support for Apache Spark 4.0 with Scala 2.13 to the ClickHouse Spark Connector.
Changes
Core Implementation
spark-4.0/directory with connector implementation and integration testsCompatibility