Skip to content

Basic starter setup for building Spark job in Scala.

Notifications You must be signed in to change notification settings

Nasruddin/spark-scala-scaffold

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spark Scala Scaffold

Overview

This repository provides a scaffold for Apache Spark applications written in Scala. It includes a structured project setup with essential configurations, making it easier to kickstart Spark-based data processing projects.

Dependencies

This setup is built using

  • Java 11
  • Scala 2.13.16
  • sbt 1.9.8
  • Spark 3.5.4

Features

  • Pre-configured SBT build system
  • Sample Spark job implementation
  • Converts parquet to CSV
  • Organized project structure
  • Logging setup using SLF4J
  • Easy-to-run example with SparkSession

Installation & Setup

Clone the repository and navigate to the project directory:

$ git clone https://github.com/Nasruddin/spark-scala-scaffold.git
$ cd spark-scala-scaffold

Building the Project

Compile and package the application using SBT:

$ sbt clean compile
$ sbt package

Running the Application

Run the application locally using SBT:

$ sbt run --input-path data --output-path output

Or submit the packaged JAR to a Spark cluster:

$ spark-submit --class com.example.Main \
  --master local[*] \
  target/scala-2.12/spark-scala-scaffold_2.12-0.1.jar \
  --input-path data --output-path output

Project Structure

spark-scala-scaffold/
├── src/
│   ├── main/
│   │   ├── scala/com/example/
│   │   │   ├── sparktutorial/
│   │   │   │   ├── Analysis.scala
│   │   │   │   ├── Configuration.scala
│   │   │   │   ├── package.scala
│   │   │   │   ├── SparkExampleMain.scala
│   ├── resources/
│   │   ├── application.conf
├── build.sbt
├── README.md
├── project/
├── target/
  • Analysis.scala - Contains the main transformation logic for the Spark application.
  • Configuration.scala - Handles the application configuration using Typesafe Config.
  • package.scala - Contains utility functions and implicit values for the Spark session.
  • SparkExampleMain.scala - Entry point for running Spark examples.
  • build.sbt - Project dependencies and build configuration.

Configuration

The configuration is managed using Typesafe Config. The configuration file application.conf should be placed in the resources directory.

Example application.conf

default {
  appName = "Spark Scala Basic Setup App"
  spark {
    settings {
      spark.master = "local[*]"
      spark.app.name = ${default.appName}
    }
  }
}

Dependencies

Add the following dependencies to your build.sbt file:

libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core" % "3.5.4",
  "org.apache.spark" %% "spark-sql" % "3.5.4",
  "com.typesafe" % "config" % "1.4.1"
)

Contributing

Contributions are welcome! Feel free to submit issues or pull requests.

Releases

No releases published

Packages

No packages published

Languages