Skip to content

Conversation

@ShimonSte
Copy link
Collaborator

@ShimonSte ShimonSte commented Nov 16, 2025

Add Spark 4.0 Support

Overview

This PR adds support for Apache Spark 4.0 with Scala 2.13 to the ClickHouse Spark Connector.

Changes

Core Implementation

  • Spark 4.0 Module: Added spark-4.0/ directory with connector implementation and integration tests
  • Scala 2.13: Full compatibility with Scala 2.13 (Spark 4.0 requirement)
  • Java 17: Updated build configuration to use Java 17 for Spark 4.0

Compatibility

  • Spark 4.0.x with Scala 2.13
  • Java 17 (minimum requirement for Spark 4.0)
  • Maintains backward compatibility with Spark 3.3, 3.4, and 3.5

@windsurf-bot
Copy link

windsurf-bot bot commented Nov 16, 2025

This PR has too many files to review (>50 files).

@ShimonSte ShimonSte marked this pull request as draft November 16, 2025 14:32
@ShimonSte ShimonSte marked this pull request as ready for review November 16, 2025 14:33
@windsurf-bot
Copy link

windsurf-bot bot commented Nov 16, 2025

This PR has too many files to review (>50 files).

Copy link
Collaborator

@mzitnik mzitnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the PR is huge, let's have a 1:1 to go over it, but a few comments for now

with:
distribution: zulu
java-version: 8
java-version: 17
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can merge run-tests and run-tests-with-specific-clickhouse into one test. Do you know what version we are running on run-tests

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The run-tests run on 23.8 (the default version in the ClickHouseSingleMixIn file).
We could merge them into one test but then we'll need to update the matrix that would run all tests on all CH versions (might actually be a good idea but it'll create a lot of new tests. maybe we should change this only when we fix the tests so that they will be able to run concurrently)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, this test is against local ClickHouse, not the cloud, so it does not have the concurrency issue

scala: [ 2.12, 2.13 ]
spark: [ '3.3', '3.4', '3.5', '4.0' ]
scala: [ '2.12', '2.13' ]
java: [ 8 ]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not java: [ 8, 17 ]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again this would cause many more versions of the tests to run (today it is only running 8).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'am ok with that since we might have a so feature that is not supported in 8 that can break so it better to test

scala: [ 2.12, 2.13 ]
spark: [ '3.3', '3.4', '3.5', '4.0' ]
scala: [ '2.12', '2.13' ]
java: [ 8 ]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here java: [ 8, 17 ]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as previous comment

apply plugin: "java-library"
apply plugin: "org.scoverage"
// Disable scoverage when running Metals' bloopInstall to avoid plugin resolution issues
def isBloopInstall = gradle.startParameter.taskNames.any { it.contains('bloopInstall') }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you check if it does not affect IntelliJ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't affect as IntelliJ uses its own build server and it doesn't use Bloop at all

- Copied spark-3.5/clickhouse-spark to spark-4.0/clickhouse-spark
- Copied spark-3.5/clickhouse-spark-it to spark-4.0/clickhouse-spark-it
- Copied build.gradle with module name references updated to 4.0
- This is an exact copy of Spark 3.5 code before API adaptations
- No functional changes, only directory structure and module naming
- Updated AnalysisException constructor to require errorClass and messageParameters
- Updated ArrowUtils.toArrowSchema to include largeVarTypes parameter (false for compatibility)
- These are breaking API changes in Spark 4.0 that require code updates
- Added Spark 4.0.1 version configuration to gradle.properties
- Updated ANTLR version to 4.13.1 to match Spark 4.0 dependencies
- Added Spark 4.0 examples module to settings.gradle
- Updated build.gradle with Spark 4.0 compatibility settings
- Added BeforeAndAfterAll trait to ClickHouseSingleMixIn
- Fixed trait linearization order: BeforeAndAfterAll before ForAllTestContainer
- Added logging for container lifecycle debugging
- Resolves Spark 4.0 test compilation issues with trait conflicts
- Added Spark 4.0 with Scala 2.13 to build-and-test workflow
- Added Spark 4.0 to check-license and style workflows with Java 17
- Added Spark 4.0 to cloud and tpcds workflows
- Excluded Spark 4.0 + Scala 2.12 combinations (not supported)
- Configured Java 17 requirement for Spark 4.0 builds
- StreamingRateExample: Streaming app using rate source for continuous debugging
- SimpleBatchExample: Simple batch app for basic debugging scenarios
- Comprehensive README with setup and usage instructions
- Gradle and SBT build configurations for IDE support
- Examples allow setting breakpoints in connector code during execution
- Added Logging trait to ClickHouseSingleMixIn
- Replaced all println statements with log.info() calls
- Provides structured, filterable logging for test execution
- Maintains test output visibility while enabling proper log management

def toArrowSchema(schema: StructType, timeZoneId: String): Schema = ArrowUtils.toArrowSchema(schema, timeZoneId, true)
def toArrowSchema(schema: StructType, timeZoneId: String): Schema =
ArrowUtils.toArrowSchema(schema, timeZoneId, true, false)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

largeVarTypes = false uses 32-bit offsets (max 2GB per string/binary value) which is the Spark 3.x behavior, while true would use 64-bit offsets for massive values but likely break compatibility.

@mzitnik
Copy link
Collaborator

mzitnik commented Nov 17, 2025

@ShimonSte
Looks like we are missing TPC_DS tests for Spark 4

- Updated build-and-test.yml: Added Java 8 & 17 matrix to run-tests-with-specific-clickhouse
- Updated cloud.yml: Test all Spark 3.x versions with both Java 8 & 17
- Updated tpcds.yml: Test all Spark 3.x versions with both Java 8 & 17
- Spark 4.0 continues to use Java 17 only (requires Java 11+)
- Updated artifact names to include Java version for better debugging
ANTLR 4.13.2 (required for Spark 4.0) needs Java 11+ to compile.
The run-tests job already handles Java 8 testing for Spark 3.x versions.
This job now only uses Java 17 to avoid ANTLR compilation errors.
spark_33_scala_212_version=2.12.15
spark_34_scala_212_version=2.12.17
spark_35_scala_212_version=2.12.18
spark_40_scala_212_version=2.12.18
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spark 4.0 does not support scala 2.12

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch


# Align with Apache Spark, and don't bundle them in release jar.
commons_lang3_version=3.12.0
commons_codec_version=1.16.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the apache commons libs have good backward compatibility, let's upgrade them to align with latest stable Spark version.

kyuubi_version=1.9.2
testcontainers_scala_version=0.41.2
scalatest_version=3.2.16
flexmark_version=0.62.2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also suggest upgrading scalatest and flexmark to align with spark 4.0.1

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should keep flexmark at the current version as newer versions do not support java 8.
I think it is time we start talking about deprecating java 8, I'll raise it up as an issue.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, in the big data world, especially the Hadoop ecosystem still heavily stuck with Java 8, I would suggest deferring such deprecation until we hit significant issues.

Additional notes: Spark is not a service, but more like a library or programming framework that mixes user code at runtime, so it's impossible to make aggressive upgrades and ask all users to tune all their jobs. In practice, some Spark jobs may be unable to upgrade to a new Spark version in their lifecycle.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point. Makes sense to keep java as-is for now. Thanks for the context!

# Spark 4.0 only supports Scala 2.13
- spark: '4.0'
scala: '2.12'
# Spark 4.0 requires Java 11+
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Spark 4.0 requires Java 11+
# Spark 4.0 requires Java 17+

- Fix CI comments: Spark 4.0 requires Java 17+, not Java 11+
- Update commons-codec to 1.17.2 (align with Spark 4.0.1)
- Update scalatest to 3.2.19 (align with Spark 4.0.1)
- Update flexmark to 0.64.8 (latest stable)
Merged run-tests and run-tests-with-specific-clickhouse into a single job
that tests all combinations of:
- ClickHouse versions: 25.3, 25.6, 25.7, latest
- Java versions: 8, 17
- Scala versions: 2.12, 2.13
- Spark versions: 3.3, 3.4, 3.5, 4.0

With proper exclusions:
- Spark 4.0 only supports Scala 2.13
- Spark 4.0 requires Java 17+

Total: 52 test jobs (down from 64 with exclusions)
- Spark 3.x: 16 jobs each (4 CH × 2 Java × 2 Scala)
- Spark 4.0: 4 jobs (4 CH × Java 17 × Scala 2.13)
Changed from: 25.3, 25.6, 25.7, latest
Changed to: 25.6, 25.7, 25.8, 25.9, latest

This adds testing for newer ClickHouse versions (25.8, 25.9) while
maintaining coverage of recent stable releases. Total test jobs
increases from 52 to 65 (5 CH versions instead of 4).
Flexmark 0.64.8 requires Java 11+ which breaks tests running with Java 8.
Version 0.62.2 is the last version compatible with Java 8.
Use findProperty with fallback values in build.gradle instead of
getProperty. This allows us to omit spark_40_scala_212_version from
gradle.properties since Spark 4.0 doesn't support Scala 2.12.

If the property is missing, it falls back to sensible defaults:
- scala_212_version: 2.12.18
- scala_213_version: 2.13.8
@ShimonSte ShimonSte requested review from mzitnik and pan3793 November 20, 2025 06:30
Copy link
Collaborator

@pan3793 pan3793 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall LGTM, (I didn't go through all files, and I suppose most scala files are just copy-pasted, maybe with a little tuning to pass compile.)

@ShimonSte ShimonSte merged commit 7f00753 into main Nov 20, 2025
244 of 294 checks passed
@ShimonSte ShimonSte deleted the spark-4-support branch November 20, 2025 06:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants