Published on 06.04.2021
Data Engineer Apprentice
We started the data centric activities at Sigfox 5 years ago, to transform Sigfox toward a data driven company. Among all these activities, the current Data Management Service (DMS) team is responsible to develop the big data platform and analytics, for ourself and for our customer.
It consists currently of a team of 6, where inside each member and each decision and action are important. As an apprentice, you will be fully integrated and be considered as full team member.
You will have one principal and one secondary activity during your internship :
- You will own fully a subject as your main objective: A data driven company must have absolute trust in its data. We have defined the business requirement for the Data Quality (DQ) functionality. Your objective, under the supervision of your mentor and with help of your colleagues, will be to define and implement the MVP solution: find open source components to speed up implementation, design the data quality datamart model and pipelines, implement a first round of quality probe and the data flow to a DQ datamart. This will allow us to control data quality inside our development projects (acceleration of validation) and maintain a DQ dashboard on all our production data assets.
- After 5 years, first generation of our data flows are already legacy, as things are moving fast. We need to migrate them to a more industrial mature stack based on Spark, Airflow and Kafka. As secondary mission, you will be working with other data engineers to enhance and migrate parts of these legacy data flow to the new technical stack/environment. That will help you understand how and where DQ is valuable.
We are looking for a Computer Science 3rd year master level or equivalent apprentice, with data specialization. You must have :
- Good knowledge of SQL and relational database, with practice.
- Python development skills a must, other languages a plus (ex: Java, Scala, bash)
- English proficient, at least in reading/writing (our spec and documentation are in english)
You should have ideally some basic understanding, knowledge or experience with :
- Data models (ex: Codd relational, star schema or fact-dimension)
- Big data paradigms, batch and streaming paradigms
That may help if you have already some knowledge or practical experience on :
- big data tools and framework (Hadoop, Spark, Hive, Flink, Kafka, noSQL database, …)
- cloud platform (AWS, GCP, Azure)
- Docker and/or Airflow
The technical environment is AWS cloud, using many of their technology offering: mainly EC2, EMR (Spark), S3, CloudWatch, Redshift and outside AWS: Tableau, Airflow