5-7 years of experience in data engineering or related field
Proficiency in Apache Spark architecture and PySpark API
Strong command of HiveQL and ANSI SQL for data management
Hands-on experience with cloud platforms like AWS, Azure, or GCP
In-depth knowledge of big data file formats such as Parquet, ORC, and Avro
Understanding of Dimensional Data Modeling and Data Lakes
Familiarity with CI/CD tools like Git and Jenkins
Responsibilities
Design, build, and maintain scalable ETL/ELT data pipelines using PySpark
Manage and scale data infrastructure components in cloud environments
Optimize data layout and indexing in Apache Hive
Identify and resolve performance issues in Spark jobs
Develop solutions for ingesting diverse datasets from various sources
Implement automated workflows with scheduling tools like Apache Airflow
Collaborate with data scientists and analysts to meet business data needs
Benefits
Opportunity to work with cutting-edge technologies in a dynamic cloud environment
Focus on big data processing and advanced analytics
Collaboration with cross-functional teams and data professionals
Exposure to scalable data infrastructure and cloud architectures
Potential for professional development with cloud certifications opportunities
Full Job Description
Role description
Job Title: Pyspark Developer
Work Location : Irving, Texas
Job Summary
We are seeking a highly skilled and motivated Data Engineer to play a pivotal role in designing building and optimizing our next generation scalable data pipelines This position requires expertise in processing massive datasets using cutting-edge technologies like Apache Spark PySpark and Hive within a dynamic cloud environment Your primary objective will be to ensure the utmost data reliability speed and efficiency providing a robust foundation for downstream business intelligence and advanced analytics initiatives
Key Responsibilities
Data Pipeline Development Maintenance Design build and maintain highly scalable and efficient ETLELT data pipelines utilizing PySpark and Spark SQL for complex data transformations
Cloud Data Infrastructure Management Deploy manage and scale critical data infrastructure components on leading cloud platforms such as Amazon Web Services AWS eg EMR Glue Microsoft Azure eg Databricks Synapse or Google Cloud Platform GCP
Data Warehousing Storage Optimization Strategically manage data layout partitioning and indexing within Apache Hive and various cloud data lake solutions to optimize performance and accessibility
Performance Tuning Optimization Proactively identify and resolve performance bottlenecks in Spark jobs leveraging Spark UI for indepth analysis effectively managing data skewness and optimizing memory utilization
Diverse Data Integration Develop robust solutions for ingesting highvolume and diverse datasets from both structured relational databases and unstructured flat files into our data ecosystem
Automated Workflow Orchestration Implement and manage automated data workflows using industrystandard scheduling tools like Apache Airflow or platformnative schedulers ensuring timely and reliable data delivery
Strategic Collaboration Partner closely with data scientists business analysts and crossfunctional enterprise teams to translate complex business requirements into technically sound and efficient data solutions
Required Core Technical Skills
Big Data Frameworks Expertise Demonstrated high proficiency in Apache Spark architecture including a deep understanding of drivers executors and Directed Acyclic Graphs DAGs
Advanced Programming Exceptional coding skills in Python and extensive experience with the PySpark API for developing intricate data transformations and processing logic
Querying Schema Management Strong command of HiveQL and ANSI SQL coupled with expertise in data partitioning techniques and effective schema definition
Optimized Storage Formats Indepth understanding and practical experience with optimized big data storage file formats such as Parquet ORC and Avro
Cloud Ecosystem Development Handson development experience utilizing cloudnative big data utilities eg AWS EMR Azure Databricks within major cloud platforms
Data Warehousing Fundamentals Solid foundation in Dimensional Data Modeling including Star and Snowflake schemas and practical experience with Data Lakes concepts and implementation
Preferred Qualifications
CICD DevOps Automation Experience with Continuous IntegrationContinuous Deployment CICD practices and automation tools like Git Jenkins or Ansible
NoSQL Database Integration Exposure to and experience with NoSQL databases such as HBase Cassandra or MongoDB
Professional Cloud Certifications Relevant professional cloud certifications eg AWS Certified Data Engineer Microsoft Certified Azure Data Engineer Associate are highly valued