Skip to content
This repository was archived by the owner on Dec 15, 2023. It is now read-only.

Commit ae8093c

Browse files
author
Andrew Hoh
authored
Minor fixes
1 parent 41a9d1a commit ae8093c

File tree

1 file changed

+25
-19
lines changed

1 file changed

+25
-19
lines changed

README.md

Lines changed: 25 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,25 @@
11
# Azure Cosmos DB Cassandra API - Datastax Spark Connector Sample
2-
This maven project provides samples and best practices for using the [DataStax Spark Cassandra Connector](https://github.com/datastax/spark-cassandra-connector) against the [CosmosDB Cassandra API](https://docs.microsoft.com/en-us/azure/cosmos-db/cassandra-introduction).
3-
For the purposes of providing and end-to-end sample, we've made use of an Azure [HDI Spark Cluster](https://docs.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-jupyter-spark-sql) to run the spark jobs provided in the example.
4-
All samples provided are in scala, built with maven. *Please note, this sample is configured against the 2.0.6 version of the spark connector.*
2+
This maven project provides samples and best practices for using the [DataStax Spark Cassandra Connector](https://github.com/datastax/spark-cassandra-connector) against [Azure Cosmos DB's Cassandra API](https://docs.microsoft.com/azure/cosmos-db/cassandra-introduction).
3+
For the purposes of providing an end-to-end sample, we've made use of an [Azure HDI Spark Cluster](https://docs.microsoft.com/azure/hdinsight/spark/apache-spark-jupyter-spark-sql) to run the spark jobs provided in the example.
4+
All samples provided are in scala, built with maven.
5+
6+
*Note - this sample is configured against the 2.0.6 version of the spark connector.*
57

68
## Running this Sample
79

810
### Prerequisites
9-
- A Cosmos DB Account configured with Cassandra API
10-
- A Spark Cluster
11+
- Cosmos DB Account configured with Cassandra API
12+
- Spark Cluster
1113

12-
#Quick Start
13-
Information regarding submitting spark jobs is not covered as part of this sample, please refer to Apache Spark's [documentation](https://spark.apache.org/docs/latest/submitting-applications.html) for this.
14+
# Quick Start
15+
Information regarding submitting spark jobs is not covered as part of this sample, please refer to Apache Spark's [documentation](https://spark.apache.org/docs/latest/submitting-applications.html).
1416
In order run this sample, correctly configure the sample to your cluster(as discussed below), build the project, generate the required jar(s), and then submit the job to your spark cluster.
1517

1618
## Cassandra API Connection Parameters
17-
In order for your spark jobs to make connections with Cosmos DB's Cassandra API, you must set the following configurations. (All these values can be found on the "Connection String" blade of your CosmosDB Account).
19+
In order for your spark jobs to connect with Cosmos DB's Cassandra API, you must set the following configurations:
20+
21+
*Note - all these values can be found on the ["Connection String" blade](https://docs.microsoft.com/azure/cosmos-db/manage-account#keys) of your CosmosDB Account*
22+
1823
<table class="table">
1924
<tr><th>Property Name</th><th>Value</th></tr>
2025
<tr>
@@ -40,7 +45,7 @@ In order for your spark jobs to make connections with Cosmos DB's Cassandra API,
4045
</table>
4146

4247
## Configurations for Throughput optimization
43-
As Cosmos DB follows a provisioned throughput model, it is important to tune the relevant configurations of the connector to optimize for this model.
48+
Because Cosmos DB follows a provisioned throughput model, it is important to tune the relevant configurations of the connector to optimize for this model.
4449
General information regarding these configurations can be found on the [Configuration Reference](https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md) page of the DataStax Spark Cassandra Connector github repository.
4550
<table class="table">
4651
<tr><th>Property Name</th><th>Description</th></tr>
@@ -80,26 +85,27 @@ General information regarding these configurations can be found on the [Configur
8085

8186
Regarding throughput and degree of parallelism, it is important to tune the relevant parameters based on the amount of load you expect your upstream/downstream flows to be, the executors provisioned for your spark jobs, and the throughput you have provisioned for your Cosmos DB account.
8287

83-
##Connection Factory Configuration and Retry Policy
84-
As part of this sample, we have provided a connection factory and custom retry policy for CosmosDB. We need a custom connection factory as that is the only way to configure a retry policy on the connector - [SPARKC-437](https://datastax-oss.atlassian.net/browse/SPARKC-437).
88+
## Connection Factory Configuration and Retry Policy
89+
As part of this sample, we have provided a connection factory and custom retry policy for Cosmos DB. We need a custom connection factory as that is the only way to configure a retry policy on the connector - [SPARKC-437](https://datastax-oss.atlassian.net/browse/SPARKC-437).
8590
* <code>CosmosDbConnectionFactory.scala</code>
8691
* <code>CosmosDbMultipleRetryPolicy.scala</code>
8792

88-
###Retry Policy
89-
The retry policy for cosmos db is configured to handle 429 - Request Rate Large exceptions. The Cosmos Db Cassandra API, translates these exceptions to overloaded errors on the Cassandra native protocol, which we want to retry with back-offs.
90-
The reason for doing so is because Cosmos Db follows a provisioned throughput model, and having this retry policy would protect your spark jobs against spikes of data ingress/egress that would momentarily exceed the allocated throughput for your collection, resulting in the request rate limiting exceptions.
91-
*Please note, that this retry policy is meant to only protect your spark jobs against momentary spikes. If you have not configured enough RUs on your collection for the intended throughput of your workload such that the retries don't catch up, then the retry policy will result in rethrows*.
93+
### Retry Policy
94+
The retry policy for Cosmos DB is configured to handle http status code 429 - Request Rate Large exceptions. The Cosmos Db Cassandra API, translates these exceptions to overloaded errors on the Cassandra native protocol, which we want to retry with back-offs.
95+
The reason for doing so is because Cosmos DB follows a provisioned throughput model, and having this retry policy would protect your spark jobs against spikes of data ingress/egress that would momentarily exceed the allocated throughput for your collection, resulting in the request rate limiting exceptions.
96+
97+
*Note - that this retry policy is meant to only protect your spark jobs against momentary spikes. If you have not configured enough RUs on your collection for the intended throughput of your workload such that the retries don't catch up, then the retry policy will result in rethrows.*
9298

93-
##Known Issues
99+
## Known Issues
94100

95-
###Left Join API on CassandraTable
101+
### Left Join API on CassandraTable
96102
Currently, there is an open bug when using the leftJoinWithCassandraTable on cassandraTable. This will be addressed soon, but in the meantime, please avoid using this API in your read paths of your spark jobs.
97103

98-
###Tokens and Token Range Filters
104+
### Tokens and Token Range Filters
99105
We do not currently support methods that make use of Tokens for filtering data. Hence please avoid using any APIs that perform table scans.
100106

101107
## Resources
102108
- [DataStax Spark Cassandra Connector](https://github.com/datastax/spark-cassandra-connector)
103109
- [CosmosDB Cassandra API](https://docs.microsoft.com/en-us/azure/cosmos-db/cassandra-introduction)
104110
- [Apache Spark](https://spark.apache.org/docs/latest/index.html)
105-
- [HDI Spark Cluster](https://docs.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-jupyter-spark-sql)
111+
- [HDI Spark Cluster](https://docs.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-jupyter-spark-sql)

0 commit comments

Comments
 (0)