Role Overview:This role focuses on designing and developing robust, scalable ETL/ELT pipelines, with a strong emphasis on Databricks and PySpark. The successful candidate will be responsible for building both batch and real-time streaming data ingestion and transformation frameworks, implementing advanced data architectures, and optimizing data processing workflows for performance and efficiency.
Key Responsibilities:- Design and develop scalable ETL/ELT pipelines utilizing Databricks and PySpark.
- Build and maintain batch and real-time streaming ingestion frameworks.
- Develop reusable ingestion and transformation frameworks to ensure consistency and efficiency.
- Implement the Medallion architecture (Bronze, Silver, Gold layers) for data organization.
- Develop incremental and Change Data Capture (CDC)-based ingestion pipelines.
- Design and implement real-time streaming pipelines using technologies like Kafka and Structured Streaming.
- Optimize Spark jobs, SQL queries, and streaming pipelines for enhanced performance.
- Implement Delta Lake-based ingestion and transformation frameworks.
- Tune partitioning, caching, and Spark execution strategies to maximize throughput.
Required Skills:- Strong proficiency in SQL and data modeling.
- Extensive experience with cloud platforms and distributed systems.
- Familiarity with CI/CD pipelines and DevOps practices.
- Expertise in MySQL.
- Proficiency with Databricks.
- Strong skills in PySpark.
Qualifications:- Minimum 10 years of overall work experience.
- Minimum 5 years of experience specifically as a Data Engineer.
- 8-10+ years of relevant experience is required.