Seamlessly transition your Spark applications between Cloud and OnPrem using IBM Analytics Engine

Mrudula Madiraju
8 min readMay 3, 2024

Depending on the need of your organization and usecase, you may need to use either public cloud managed service for running your spark workloads or you may need to run the workloads on a private on-prem environment. This article shows you how you can run the same spark applications in either of the form-factors without a drastic context switch. With the recently revamped UI experience, the UX is even more unified for the IBM Analytics Engine service. A few minor differences are called out between Public Cloud (cloud.ibm.com) and CPD (IBM Cloud Pak for Data) , that will help you navigate as you experience and switch between the both the worlds as required.

Creating an Analytics Engine instance

You begin by first creating an Analytics Engine service instance. An instance is a logical representation of your spark runtime environment. This stores the metadata for the spark environment like the Spark configurations, Spark runtime version (3.3, 3.4), instance name and so on.

You can choose the Analytics Engine Service from Catalog in IBM Cloud and Services Catalog from the CPD cluster.

IBM Cloud Services Catalog vs CPD Service Catalog

Create Analytics Engine Instance in CPD in two selections:

In the CPD Cluster, you can create an instance by a few simple selections: Name you want to give the instance, the NameSpace you want the instance to be in. And the “home volume” storage to be associated with the instance.
The storage is standard CPD volume from one of the storage classes that has been enabled on the CPD cluster at the time of installation. You can either choose an existing storage or create a new volume. This is used as storage area for the instance — like the spark events generated by the your spark applications, applications logs of your spark applications etc.

Instance Creation in CPD — Specify the Name, Namespace and Storage

Create Analytics Engine in IBM Cloud in three selections:

In public cloud, the experience is comparable. Here you need to choose a region/location where you want the instance to be created in. And in this world, your instance home is a IBM Cloud COS bucket (the bucket is created for you by AE with name as “do-not-delete-<instance-id>”.
When you are trying this through UI, existing IBM Cloud COS service instances in the same account are automatically discovered. If you want to associate some other COS instance bucket — you can do that as well through an API later after the instance is created.
Additionally you can choose a Default Spark Runtime (Apache Spark 3.3 or 3.4 etc). This means that all applications that run against this instance will run on that version of Spark unless overridden at the application submission.

Choose location, instance name, Spark Default Version in IBM Public Cloud to create instance

Instance Details

Instances List from the Dashboard

Once you have created your instance(s), you can see the list of instances from Resources List in public cloud and in the Service -> Instances in CPD cluster.

IBM Cloud Resource List vs CPD Instances List

Instance Details Page

This instance details page is a summary of the the various parameters/configuration of the instance. From this page — you can edit the Default Runtime and the Default Spark Configurations of the instance that you want all the applications submitted against this instance to inherit (unless overridden at the application level). For CPD, this page shows all of the service endpoints and for Public Cloud, the service credentials tab in the public cloud instance will give you the endpoints for the instance.

Service Details Page

Submitting a Sample Application

Sample Application (PySpark)

Here’s my favourite starter spark application that takes some data, creates a dataframe out of it and saves it to a COS bucket and then reads it back. Note that the code refers to a bucket. In my case — my bucket is matrix2

Note how the URI in the code → cos://matrix2.mycosservice/players.csv. Here the literal mycosservice is used in the configuration for specifying the bucket access credentials. You can use any literal name here — it is not the COS instance name. For e.g — it can be mytestcos , devcos etc. But what you specify here MUST match with the conf you specify while submitting the application. We’ll see that later below.

from pyspark.sql import SparkSession

def init_spark():
spark = SparkSession.builder.appName("demo-cos").getOrCreate()
sc = spark.sparkContext
return spark,sc


def main():
spark,sc = init_spark()

data = [('Cristiano','Ronaldo',10000),
('Sadio','Mane',8000),
('Karim','Benzema',8000),
('Mohamed','Salah',6000),
('Harry','Kane',4000)
]
columns = ["firstname","lastname","salary"]
playersDF = spark.createDataFrame(data=data, schema = columns)
playersDF.show()

#First write to COS
playersDF.write.option("header",True).mode("overwrite").csv("cos://matrix2.mycosservice/players.csv")

#Then read from COS
allPlayersDF = spark.read.option("header",True).csv("cos://matrix2.mycosservice/players.csv")
lowplayersDF = allPlayersDF.filter(allPlayersDF.salary < 5000)
lowplayersDF.show()

if __name__ == '__main__':
main()

Upload Application to IBM Cloud COS

In the case of Public cloud you have to upload your application to a bucket. Here again I have uploaded it to the same bucket matrix2. Later you will see how we submit the application referenced in this location.

Upload Application to a Storage Volume

From CPD you can execute an application that is uploaded to IBM COS. You can also upload the application to a CPD volume and run the application from that location. For this you need to create a new Storage Volume.

Note!! You cannot use the home storage volume to upload your applications.

Storage Volumes can be managed and created under Administration. You can upload files into the volume just like you can in IBM COS world.

Note that while creating the volume, the mount point specified is myapps which we will use later.

Submit Applications

Submit application in IBM Cloud

curl -v -X POST https://api.us-south.ae.cloud.ibm.com/v3/analytics_engines/$instance_id/spark_applications \ 
--header "Authorization: Bearer $token" -H "content-type: application/json" \
-d @submit.json

where submit.json is :

{
"application_details": {
"application": "cos://<<BUCKET-CHANGEME>>.mycosservice/test-cos.py",
"conf": {
"spark.hadoop.fs.cos.mycosservice.endpoint": "https://s3.us-south.cloud-object-storage.appdomain.cloud",
"spark.hadoop.fs.cos.mycosservice.access.key": "accesskey",
"spark.hadoop.fs.cos.mycosservice.secret.key": "secretkey"
}
}
}

In this example — I have my instance in us-south (Dallas). Hence the hostname is api.us-south.ae.cloud.ibm.com. You have to change according to your region.

Submit Application in CPD

curl -k -X POST https://<<CPD_CLUSTER_HOST>>/v4/analytics_engines/1d970398-4a6e-4a4f-a15c-9dade81a6ded/spark_applications \
-H "Authorization: Bearer $token"
-d @submit.json

where submit.json is:

{
"application_details": {
"application": "/mnts/myapps/test-cos.py",
"arguments": [
""
],
"conf": {
"spark.hadoop.fs.cos.mycosservice.endpoint": "https://s3.us-south.cloud-object-storage.appdomain.cloud",
"spark.hadoop.fs.cos.mycosservice.access.key": "accesskey",
"spark.hadoop.fs.cos.mycosservice.secret.key": "secretkey"
}
},
"volumes": [
{
"name": "cpd-instance::mrmadira-volume",
"mount_path": "/mnts/myapps"
}
]
}

Few points to note-

  • Note that we have used mycosservice literal to specify the creds and endpoint of the bucket. This MUST tie-in with the URI of the object referenced from code or the application path etc.
  • The Bearer token that you generate and use in public cloud vs CPD will be very different. Refer to the documentation for each of the methods. In public cloud it will be IAM based. And in CPD it will vary depending on the type of Identity Provider setup at the time of installation — LDAP, IAM etc.
  • Also note the slight differences in the application payload if you use application from the volume in case of CPD. You have to refer to the volumes and mount path in the application parameter and also reference the volume that is to be mounted
  • If you want to use the application from IBM COS even in CPD, you can use the exact same payload for both. Note that in case of CPD though you have to use the public endpoint (otherwise it cannot connect) and in Public coud payload you have to use direct endpoint to COS (to be better efficient and performant)

Applications tab

From the Applications tab you can see the list of applications that have been run so far, the state and time taken.

You can also view Spark UI of a running application. (The link is clickable as long as the application is running)

Knowing details of submitted application

Once an application is submitted you get back and application ID. You can then use that ID to know the state and other details of the application . The APIs are very similar except for minor differences you can see below.

curl -X GET https://<<CPD_CLUSTER_HOST>>/v4/analytics_engines/$instance_id/spark_applications/$application_id -H "Authorization: Bearer $token"
curl -X GET https://<<CLOUD_END_POINT>>/v3/analytics_engines/$instance_id/spark_applications/$application_id -H "Authorization: Bearer $token"

Looking up Logs of Applications

For any kind of troubleshooting and tracing you need to see logs of the application (this could be spark driver, executor and /or your print()/show() statements in code). For all of this — you have to go to the “instance home” — the automatically created bucket in the case of public cloud OR in case of the CPD — the home storage volume. Within that you will have to navigate to your application id / logs folder and download the driver log file and see its contents.

For example — for the application that was run in this blog — the logs would look like this:

Spark History

From both the worlds, you can start Spark History server and further analyze your historical applications to see the spark execution plan and identify bottlenecks, straggler tasks and so on.

Summary

This article gives an overview comparison of application submission and tracking experience between public and private cloud forms of IBM Analytics Engine. More advanced topics like customizations will be contents of another article for another day. Good day!

--

--

Mrudula Madiraju

Dealing with Data, Cloud, Compliance and sharing tit bits of epiphanies along the way.