Changing Paradigms of Technical Skills for Data Engineers
[This Proceedings paper was revised and published in the 2018 issue of the journal Issues in Informing Science and Information Technology, Volume 15]
This paper investigates the new technical skills that are needed for Data Engineering. Past research is compared to new research which creates a list of the 20 top tech-nical skills required by a Data Engineer. The growing availability of Data Engineering jobs is discussed. The research methodology describes the gathering of sample data and then the use of Pig and MapReduce on AWS (Amazon Web Services) to count occurrences of Data Engineering technical skills from 100 Indeed.com job advertisements in July, 2017.
A decade ago, Data Engineering relied heavily on the technology of Relational Database Management Sys-tems (RDBMS). For example, Grisham, P., Krasner, H., and Perry D. (2006) described an Empirical Soft-ware Engineering Lab (ESEL) that introduced Relational Database concepts to students with hands-on learning that they called “Data Engineering Education with Real-World Projects.” However, as seismic im-provements occurred for the processing of large distributed datasets, big data analytics has moved into the forefront of the IT industry. As a result, the definition for Data Engineering has broadened and evolved to include newer technology that supports the distributed processing of very large amounts of data (e.g. Hadoop Ecosystem and NoSQL Databases). This paper examines the technical skills that are needed to work as a Data Engineer in today’s rapidly changing technical environment. Research is presented that re-views 100 job postings for Data Engineers from Indeed (2017) during the month of July, 2017 and then ranks the technical skills in order of importance. The results are compared to earlier research by Stitch (2016) that ranked the top technical skills for Data Engineers in 2016 using LinkedIn to survey 6,500 peo-ple that identified themselves as Data Engineers.
A sample of 100 Data Engineering job postings were collected and analyzed from Indeed during July, 2017. The job postings were pasted into a text file and then related words were grouped together to make phrases. For example, the word “data” was put into context with other related words to form phrases such as “Big Data”, “Data Architecture” and “Data Engineering”. A text editor was used for this task and the find/replace functionality of the text editor proved to be very useful for this project. After making phrases, the large text file was uploaded to the Amazon cloud (AWS) and a Pig batch job using Map Reduce was leveraged to count the occurrence of phrases and words within the text file.
The resulting phrases/words with occurrence counts was download to a Personal Computer (PC) and then was loaded into an Excel spreadsheet. Using a spreadsheet enabled the phrases/words to be sorted by oc-currence count and then facilitated the filtering out of irrelevant words. Another task to prepare the data involved the combination phrases or words that were synonymous. For example, the occurrence count for the acronym ELT and the occurrence count for the acronym ETL were added together to make an overall ELT/ETL occurrence count. ETL is a Data Warehousing acronym for Extracting, Transforming and Loading data. This task required knowledge of the subject area. Also, some words were counted in lower case and then the same word was also counted in mixed or upper case, thus producing two or three occur-rence counts for the same word. These different counts were added together to make an overall occur-rence count for the word (e.g. word occurrence counts for Python and python were added together). Fi-nally, the Indeed occurrence counts were sorted to allow for the identification of a list of the top 20 tech-nical skills needed by a Data Engineer.
Provides new information about the Technical Skills needed by Data Engineers.
Twelve of the 20 Stitch (2016) report phrases/words that are highlighted in bold above matched the tech-nical skills mentioned in the Indeed research. I considered C, C++ and Java a match to the broader cate-gory of Programing in the Indeed data. Although the ranked order of the two lists did not match, the top five ranked technical skills for both lists are similar. The reader of this paper might consider the skills of SQL, Python, Hadoop/HDFS to be very important technical skills for a Data Engineer. Although the programming language R is very popular with Data Scientists, it did not make the top 20 skills for Data Engineering; it was in the overall list from Indeed. The R programming language is oriented towards ana-lytical processing (e.g. used by Data Scientists), whereas the Python language is a scripting and object-oriented language that facilitates the creation of Data Pipelines (e.g. used by Data Engineers).
Because the data was collected one year apart and from very different data sources, the timing of the data collection and the different data sources could account for some of the differences in the ranked lists. It is worth noting that the Indeed research ranked list introduced the technical skills of Design Skills, Spark, AWS (Amazon Web Services), Data Modeling, Kafta, Scala, Cloud Computing, Data Pipelines, APIs and AWS Redshift Data Warehousing to the top 20 ranked technical skills list. The Stitch (2016) report that did not have matches to the Indeed (2017) sample data for Linux, Databases, MySQL, Business Intelligence, Oracle, Microsoft SQL Server, Data Analysis and Unix. Although many of these Stitch top 20 technical skills were on the Indeed list, they did not make the top 20 ranked technical skills.
Some of the skills needed for Database Technologies are transferable to Data Engineering.
There is not much peer reviewed literature on the subject of Data Engineering, this paper will add new information to the subject area.
I'm developing a Specialization in Data Engineering for the MS in Data Science degree at our university.