Required Qualifications:
- 5+ years of experience in data engineering using Python with a focus on AWS S3, EMR, Glue, Step Functions, Apache NiFi and Spark.
- Proven track record of building scalable data pipelines in cloud environments.
- Proficiency in flow design, processors, and data provenance in Apache NiFi.
- Strong expertise in Spark, Hadoop, and distributed computing on AWS EMR.
- In-depth knowledge of AWS services (S3, Glue, Redshift, RDS, Lambda, Step Functions).
- Experience with data formats (JSON, CSV, Parquet, Avro) and transformation techniques.
- Strong problem-solving skills and ability to troubleshoot complex data processing issues.
- Excellent communication skills with the ability to document and explain technical details clearly.
Want more jobs like this?
Get jobs in Mexico City, Mexico delivered to your inbox every week.
- AWS Certified Solutions Architect or Data Analytics Specialty.
- Experience with data governance frameworks and compliance requirements.
- Familiarity with CI/CD pipelines and version control (GitLab, Jenkins).
Key Responsibilities:
Design & Develop Data Pipelines:
- Architect and implement end-to-end data pipelines using AWS S3, EMR, Glue, Step Functions, Apache NiFi, Spark.
- Manage data ingestion processes from AWS S3, ensuring secure and efficient data transfer.
- Implement initial data routing, validation, and transformations using Apache NiFi processors and Spark Data Engines
- Integrate using AWS EMR, Apache NiFi, Spark to perform complex data transformations and analytics.
- Optimize Spark jobs for processing large-scale datasets with a focus on performance and resource utilization.
- Handle both historical and incremental data loads, ensuring data consistency and integrity.
- Define and implement data storage strategies across S3, RDS, and Redshift, adhering to business requirements.
- Manage data catalog creation and schema management using AWS Glue.
- Develop and manage workflows using Apache Airflow, AWS Step Functions to automate data processing tasks.
- Implement monitoring, error handling, and retries within the orchestration framework.
- Ensure data security with encryption (AES-256, TLS) and IAM role-based access controls.
- Implement data governance policies using AWS Glue Data Catalog to ensure compliance with regulatory requirements.
- Utilize AWS CloudWatch to monitor the performance of EMR clusters, NiFi flows and data storage.
- Continuously optimize Spark job configurations and NiFi data flows for maximum throughput and minimal latency.