overview
- Responsibilities
- Responsible for the design, implementation, and architecture of very large-scale data intelligence solutions around big data platforms
- Analyzed large and critical datasets using Hive and Zookeeper
- Developed POC's using Spark, Scala and deployed on the Yarn Cluster, compared the performance of Spark, with Hive and SQL
- Used Amazon Elastic Cloud Compute (EC2) infrastructure for computational tasks and Simple Storage Service (S3) as storage mechanism
- Collaborated with data scientists to document the development process, model configurations, and best practices for using generative AI technologies
- Leveraged Azure Databricks to clean, transform, and analyze large datasets, providing actionable insights for business decision-making
- Designed and implemented robust data architecture solutions using Postgres, Amazon Aurora, and DynamoDB to support high-throughput data processing and analysis
- Capable of using AWS utilities such as EMR, S3, Glue crawler, ThoughtSpot, Lambda and Cloud Watch to run and monitor Hadoop and Spark jobs on AWS
- Configured and customized Collibra workflows to streamline data cataloging, classification, and lineage tracking, improving data transparency and accessibility for clinical and administrative staff
- Experience in developing Spark applications using Spark-SQL and PySpark in Databricks for data extraction, transformation, and aggregation from multiple file formats for Analyzing and transforming the data
- Designed and developed complex ETL pipelines using AWS glue store, Snowflake's SQL and Snowflake's PySpark and JavaScript connectors, integrating data from various sources, including APIs, databases, and flat files
- Developed reusable data flows in Azure Data Factory for common data processing tasks, promoting code maintainability, and reducing development time for future pipelines
- Integrated Azure Data Factory with other Azure services such as Azure Synapse Analytics, Azure Databricks and Azure Analysis Services to create end-to-end data processing solutions
- Spearheaded data governance initiatives that leveraged Collibra to support research and reporting activities, ensuring data compliance and integrity in healthcare studies
- Designed and implemented interactive dashboards and reports in Power BI, providing real-time insights and data visualizations to support business decision-making
- Experience in SQL query optimization for ThoughtSpot to ensure fast and efficient data retrieval
- Leveraged Azure Data Factory's integration runtime to securely orchestrate data movement across hybrid environments, ensuring data governance and compliance
- Led efforts to build and maintain data warehouse and data lake solutions, ensuring they scaled with evolving business requirements
- Facilitated the integration of Collibra with electronic health record (EHR) systems, enabling seamless data exchange and enhancing patient care coordination
- Independently identified and resolved complex issues within Hive and Spark applications
- Developed and enforced data security policies, ensuring the confidentiality, integrity, and availability of sensitive data across cloud platforms
- Environment: HDFS, Python, SQL, Spark, Azure Data Factory, Scala, Kafka, Hive, Yarn, Erwin Data Modeler, Sqoop, PySpark, TypeScript, Snowflake, GenAI, AWS Cloud, Glue, GitHub, Node.js, ThoughtSpot, Shell Scripting