overview
- Responsibilities
- Involved in Analysis, Design, and Implementation/translation of Business User requirements
- Implemented ETL processes to transform and cleanse data as it moves between MySQL and NoSQL databases
- Leveraged PySpark's capabilities for data manipulation, aggregation, and filtering to prepare data for further processing
- Joined, manipulated, and drew actionable insights from large data sources using Python and SQL
- Developed PySpark ETL pipelines to cleanse, transform, and enrich the raw data
- Ingested large data streams from company REST APIs into EMR cluster through AWS kinesis
- Integrated Amazon Kinesis and Apache Kafka for real-time event streaming and used Databricks for real-time analytics, reducing latency by 50
- Streamed data from AWS Fully Managed Kafka brokers using Spark Streaming and processed the data using explode transformations
- Created data models and schema designs for Snowflake data warehouse to support complex analytical queries and reporting
- Developed Hive queries for analysts by loading and transforming large sets of structured, semi-structured data using Hive
- Automated resulting scripts and workflow using Apache Airflow and shell scripting to ensure daily execution in production
- Used Spark SQL and Data Frames API to load structured and semi-structured data from MySQL tables into Spark Clusters
- Built data ingestion pipelines (Snowflake staging) using disparate sources and other data formats to enable real-time data processing and analysis
- Developed ETL workflows using AWS Glue to efficiently load big data sets into the data warehouse
- Implemented Visualized BI Reports with Tableau
- Administered users, user groups, and scheduled instances for reports in Tableau. Monitored Tableau Servers for their high availability to users
- Deployed web-embedded Power BI dashboards refreshed using gateways by using workspace and data source
- Leveraged SQL scripting for data modeling, enabling streamlined data querying and reporting capabilities, which contributed to improved insights into customer data
- Collaborated with end-users to resolve data and performance-related issues during the onboarding of new users
- Developed Airflow pipelines to efficiently load data from multiple sources into Redshift and monitored job schedules
- Successfully migrated data from Teradata to AWS, improving data accessibility and cost efficiency
- Worked on migrating the reports and dashboards from OBIEE to Power BI
- Assisted multiple users from the data visualization team in connecting to Redshift using Power BI, Power Apps, Excel, Spotfire, Python
- Extracted real-time feed using Kafka and Spark Streaming, converting it to RDD, and processed data in the form of Data Frame and saved the data as Parquet format in HDFS
- Used Kubernetes to orchestrate the deployment, scaling, and management of Docker containers
- Finalized the data pipeline using DynamoDB as a NoSQL storage option
- Instantiated, created, and maintained CI/CD (continuous integration & deployment) pipelines and applied automation to environments and applications
- Planned and executed data migration strategies for transferring data from legacy systems to MySQL and NoSQL databases
- Actively participated in scrum meetings, reporting progress, and maintaining good communication with each team member and managers
- Environment: Apache AirFlow, Kafka, Spark, MapReduce, Hadoop, Snowflake, Hive, Databricks, PySpark, Docker, Kubernetes, AWS, DynamoDB, CI/CD, Tableau, Redshift, Power BI, Rest APIs, Teradata, Windows