Bigdata
- What is Bigdata
- Why Bigdata came into picture
- What is Hadoop
- Why spark came into picture?
- Limitations in Hadoop and RDBMS
Hadoop, & Spark installation in Ubuntu Hands on
- Create Hadoop & Scala environment in Intellij
- Run sample scala program inIntellij
Scala Basics
- Variables, Strings & Numbers
- Arrays, List, tuple, type hierarchy
- Scala: Expressions and Conditionals
- For loop & match if else
- Functions & Objects, class methods
- HDFS: Responsibilities of Namenode, Datanode
- How HDFS replicated data?
- Read/Write data from HDFS/local
- Namenode, Application manager internals
- Power of YARN
- How Resource Master functioning
- Node manager responsibilities
- How Application master work?
- How Yarn communicate HDFS
- Spark on Yarn
- How Spark run on Yarn
- What is Mesos?
- Power of containers & Executors
- In-memory concept
AWS Intro
- EC2 creation,
- Hortonworks installation in ec2
- Cloudera installation in ec2
- image, windows, linux servers
- Autoscala ec2
AWS RDS
- Create and insert data in Oracle, mysql, mssql,
- postgre sql databases.
- Sqoop import export examples
AWS IAM
- Users
- Groups
- Roals,
- Policies
- S3 Cli commands
- S3 Bucket privileges
- Emr
- Create multi node cluster
Redshift & datapipe line
- create and process large amount of data
- Get data from oracle to redshift
- Get data from s3 to redshift
AWS Glue and Athena
- Pyspark script in glue
- Scala spark script in glue
- Hive script in Athena
Sqoop Introduction
- Import data from oracle mysql mssql
- Store data in hive
- Delemeter change
- Incremental data lode
- Performance tuning
- Sqoop automate using shell script
Sqoop export
- Problems to export
- Clean data
Hive Introduction
- Create table to csv, json data. Serdes, process
- orc, parquet datasets.
- Hive Partition, bucketing, advanced techniques
Introduction Why Spark?
- What is RDD?
- RDD properties
- Spark Architecture
- Why spark is Fast?
- Key-Value Pair RDDs
- DAG
- Rdd Operations (Transformations &
- actions)
- RDD advanced topics (debugging, web
- UI)
- Most frequently used spark functions
- RDD easily process
- rdd to dataframe
- SparkSQL:
- Ways to create Dataframes
- CaseClass
- Process CSV data using RDD
- Process sample json & complex jsondata
- Process xml, avro, parquet, orc, process hive
- data using spark.
Process different type of datasets
- Create a jar and submit API
- Dataframe operations
- Memory management and catalyst optimizer
- internals.
- Spark Cassandra integration
- In dev env & aws env
Introduction about AWS
- IAM & EMR how to create and practice spark
- in EMR
- How to submit a job in EMR
- Spark Hbase Phoenix integration
- DataSet API
- Power of Decoder
- Serialization concept in DataSet
- Detaset APz
- Dag Scheduler
- Memory management in Spark
- Web UI & debugging
- Spark streaming Architecture
- DStreams & micro batching
- Batch vs Streaming
- Spark Streaming Architecture
- Kafka introduction,
- How kafka working
- Spark & Kafka integration
- Spark Kafka Nifi integration
- Spark Structure streaming introduction
- Spark Structure streaming Kafka
- Optional Training
- Flink introduction
- Flink table API
- Flink streaming
- Spark Overview Training Curriculum - Confidential
- Cloudera, certification and AWS certification tips
- How to practice Hortonworks, cloudera and
- databricks.Commercials