Interviewing Data Engineer Candidates

At my current employer, as a part of the interviewing process we have a technical challenge & team interview (after an initial phone screen interview). One thing that I really appreciate about where I work now is that when it comes to considering data engineer candidates, we absolutely will bring in people that lack any experience with the databases and/or ETL tool that we use (Pentaho) as long as the candidate has quality previous data engineer experience.

Meanwhile I see–and have almost never been contacted for a phone screen myself–countless DE job postings that really are only interested in candidates that have experience with the database(s)/tool(s) that they are currently using. The biggest problem is that it’s very difficult if not impossible for an applicant to know at the time they submit their resume is whether or not the prospective employer is willing to accept candidates with specific experiences/skills or not. (Like I put in my last post, there is no penalty in applying for a DE job!)

That aside, there needs to be a way for a potential employer to determine whether a candidate has the skills to match all of those acronyms & claims on their resume, thus having a technical challenge.

Data Engineering Technical Challenge

Now if I (gave up a piece of my soul &) worked for one of those companies that was only looking for DE candidates with specific experiences & skills, I’m sure we’d have a technical challenge where we’d ask a candidate to perform a task in that tool. Thankfully–oh so thankfully!–that’s not how we roll.

The purpose of a technical challenge is that we really want to see how a candidate thinks. We felt that the best way to have a candidate show their DE skills is to have a challenge that is tool independent but speaking the universal language of SQL. So while little actual production work would ever be done purely in SQL, having an advanced grasp of SQL to carry out ETL actions should expose candidates who really understand what’s going on underneath any tool vs those who maybe rely too much/take for granted what ETL tools might actually be doing.

The premise of our technical challenge is pretty simple: we provide 2 tables. One table contains ~100 rows of “raw” source data; the other table contains a few rows of “production” data. The candidate is given a block of time to write a SQL query to transform & migrate as much of the source data into the production table as they can.

We do allow the candidate to ask questions to gain clarification, meanwhile they are free to google any technical/functional/syntax issues on their own…just like they would do in a real world situation.

The source table contains column names that are different from the target but are obviously named. There are a number of pitfalls & conversions in the source data requiring techniques such as:

  • Gracefully handling NULLs & duplicate values
  • Concatenating & sanitizing first & last names to a single column (FIRST_NAME | LAST_NAME -> “Last_name, First_name”)
  • Converting discrete integer values to text labels (1 & 2 -> “Male” & “Female”)
  • Splitting US ZIP codes into ZIP code & Plus_4 columns–and this one is tricky because not all ZIP codes given have the +4 and/or may not contain the leading zeroes
  • Convert 8-digit integers to date
  • Use those dates from the previous item to calculate the age in years
  • Perform an aggregate calculation

It has been pretty surprising–not to mention extremely insightful–to see where & why candidates get tripped up. And how do we do that? Well–especially in this time of the COVID-19 pandemic–we do the technical challenges over Zoom. That allows us to watch the candidate work–and for us to comment amongst ourselves about the decisions they make–in real time. And while we give the candidates 45 minutes for the challenge, it often takes maybe half that long to tell whether a candidate has the SQL chops or not. (And to date, only one candidate has completed the entire challenge in the allotted time. We know we’re asking a lot from a candidate so finishing isn’t expected or necessary if they demonstrate sound SQL skills with a well-thought out approach.)

Technical Challenge Code Smells

So while we feel that this technical challenge is DE-centric, I have found that there are plenty of times where employers will ask a candidate to perform some not-so-DE-specific tasks. Broadly, I’d suggest that you maybe not consider this employer because if they can’t even be bothered to ask you to perform a directly related technical challenge, what do you think it’s going to be like (if you were to accept the job) when it comes to the actual work?

Examples of technical challenge code smells (for DE):

  • Being asked to perform a technical challenge with no advanced warning (read: it should be made clear during a previous conversation/interview that there will be a technical challenge if nothing but allowing the candidate to mentally prepare for it).
  • (Re)implement any sorting algorithm or algorithm that already exists in a language’s standard library
  • Perform any type of task purely from memory (read: without the being able to search for help)
  • Perform any obscure or non-job related task
    • Once upon a time I was asked to perform a bunch of DDL queries (CREATE TABLE, ALTER INDEX, etc.) when applying for a data analyst position.
  • Perform a task using a technology that you in a previous conversation with the prospective employer expressly made known that you have little to no previous experience with

So if it isn’t obvious, these are all red flags that the prospective employer probably doesn’t understand or worse, doesn’t care what a data engineer actually is supposed to do.

Expectations Beyond The Technical Challenge

So with the technical challenge over, that’s no time for a candidate to relax as they are introduced to the team that they would be working with. While technical questions are still quite likely to be asked, it’s also when potentially suspect items on the candidate resume have a light shone on them for clarity (What did you do for those 6 months between your last 2 jobs? Tell us more about your experience using [fill in the blank of ETL tool or database]?) And this is a time for both sides to figure out if the candidate is going to be a good match, both technically & inter-personally, in the team. Does the candidate possess a skill set and/or business knowledge that may not already exist in the team? Could you see working with this person/these persons all day, five days a week? What is your approach to learning a new tool or technology?

And it’s respectful to manage a candidate’s expectations up front (remember: no surprise technical challenges!) & stick to a schedule–start on time & don’t run over schedule, at least not without asking if that is OK with the candidate. There’s nothing wrong with scheduling a second team interview for another day if you don’t feel comfortable with the candidate based on your last interview/conversation!

Getting Consensus

While the candidate has been vetted initially by HR and/or management, the team should have a considerable voice in deciding whether or not a candidate is offered because they will be the ones working with the candidate and also understand the requirements of the job better than anyone else in the organization. Getting the team together–and we do it immediately once the candidate has disconnected from the Zoom meeting, which has the benefit of having our thoughts be extremely fresh not to mention everyone on the team is already together! Usually these post-interview discussions take 15-20 minutes at most. Of course doing this generally keeps the speed of the hiring process high–whether extending an offer or declining–to move along to whatever the next step is & inform the candidate.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.