1. What you'll be responsible for
• Design and develop scalable data pipelines using Spark, Kafka, and distributed data systems.
• Build and optimize batch and real-time data processing workflows.
• Implement data lake or lakehouse architectures with structured data layering.
• Automate data workflows using orchestration tools like Airflow or dbt.
• Collaborate with analysts,
data scientists, and platform teams to deliver usable data products.
• Ensure data quality, lineage, and governance across the pipeline.
• Monitor, tune, and maintain performance of large-scale distributed processing jobs.
• Document technical processes and contribute to team knowledge sharing.
• 3-5+ years of experience in data engineering or big data environments.
• Proficient in Python, Scala, or Java; hands-on with Apache Spark, Kafka, and Hive.
• Experience with workflow orchestration (Airflow, Luigi, Dagster) and CI/CD pipelines for data.
• Knowledge of containerization tools (e.g., Docker, Kubernetes) is a plus.
• Strong understanding of distributed computing and big data architecture.
• Experience with cloud platforms (AWS, GCP, or Azure) and object storage (e.g., S3).
• Familiar with data modeling, ETL/ELT best practices, and pipeline automation.
• Strong problem-solving and documentation skills.
• Excellent communication skills (verbal and written in English), with the ability to effectively influence and collaborate across various functions and organizational levels.