5 points to consider before setting sails in data science projects in the food industry

In the slipstream of the digitalization of modern food industry, part of Industry 4.0, a paramount need to handle larger and larger amounts of data has called for new developments in both the analytical toolbox and in digital infrastructure. It has expanded the possibilities of the creation of value from data.  

Terms like Data Science, Big Data, Machine Learning and more recently AI flourish. The possibilities rising from these techniques are immense. Howeverit should be stressed that it is no Columbus’ egg. One size does not fit all and going full steam on advanced data science can be a bumpy ride if not thought through and done properlyHere we list a few points to consider before starting a data science project in the food production industry.    

WHAT IS OUR GOAL? 

A sharp question calls for a sharp answer, also when the question is put to data. We have experienced data science projects started up with a description a la “we would like to learn more about what is going on” or “we must be able to use AI/Big Data on all the data we have”. It would be intriguing if a data scientist could answer that, but it is also highly unlikely. Vague goals yields vague results. 

Before proclaiming a need of AI and Big Data it is advisable to consider what you hope to achieve. What would bring most value to the company? That might not be insights from fancy complex algorithms. If a simple display of your data brings value you may not need fancy AI to begin with.  

ACCESS TO DATA 

Even though there is a lot of data, how accessible is it truly?  

  • Data kept in paper files. Making paper-data suitable for analytical work calls for manual handling. This might be labor intense and time consuming, hence expensive.  
  • Data stored in detached databases without keys. The data cannot be linked and the Big Data dream is in reality just many smaller data nightmares. Detective work in cooperation between the data owner and the data scientist might be able to deduct keys in a somewhat cumbersome time-consuming joint venture.  
  • Data stored in detached databases with keys. When keys do exists, the logistics of gathering the pieces and getting data in to shape is doable, but all parties should be realistic in planning the project. This part is time-consuming. Analyses of how data scientists spend their time estimate that between 51% and 79% of the time is spend on cleaning and organizing data [1,2,3]. 
  • Data stored in a database. Even when your data looks easily accessible there will be a phase where the data scientist massages his or her way into data. Often, the variables have cryptic names and lack description. The project that did not find skeletons in the data closet has yet to be experienced. 
  • Legal green light. Compliance with the rules for data sharing, GDPR and other interests should be cleared before starting the project. 

DATA QUALITY 

Data quality is a great many things. A few to consider are: 

  • Relevance. Data should relate to the questions asked. You may e.g. be interested in learning about quality variations or occurrences of faults in your production, but these may not be a well-defined measure nor registered to match your purpose. 
  • Correctness. No amount of wrong data will give you the right answer. If data is deeply flawed, there really is not much to do about that. That said, missingness, outliers and other artifacts can (to some extent) be handled in different machine learning methods. 
  • Amount. Naturally, there should be an adequate amount of data, but when quality goes up the need for data in order to find signals goes down. And vice versa – large amounts of irrelevant data is not useful. Another issue can be that fundamental changes, like new equipment, means that data cannot be compared before and after the change, which consequently limits the amount of data for the analysis. 
  • Variation. In food production, a recipe is usually followed leading to very little variation in the data. This is good for the production, but if the objective of the project is to learn how the recipe in some ways expands beyond the usual production frames, then observing variation is crucial. 
  • General dirtiness. This is what is usually refeed to as janitor-work, and it might just still be one of the biggest hurdles to finding insights in data [4] It can be so many things like labels changing over time, finding and handling missing data, finding and handling outliers, dates and timestamps, finding a suitable feature representation, etc. It is important to notice that this task usually consumes the majority of the project hours and is crucial for the outcome, but often has limited focus in comparison. 

ENGAGEMENT 

Success in a data science project requires engagement from all parties as they bring very different but equally important knowledge and competences to the table. Data science might have been oversold as a ‘silver bullet’; just hand over your (big) data to the experts, leave them simmering for a while, and groundbreaking new knowledge will emerge. This is not how it works. The data owner possess knowledge crucial for the project.  

  • Data collection. Changes in equipment, procedures or sensors etc. over time. Data engineers are an important addition to the data scientist in overcoming data infrastructure issues as well as general dirtiness. 
  • Domaine knowledge. What are we looking at? What can we expect? What do we already know and thus can be incorporated in the modeling work? Are there important factors that are not measured or registered (there always is)?  
  • End users. What are the expectations and requirements of the users, which can be operators, managers, etc. Putting the outcomes into a context can guide the project towards success in terms of ownership, adaptation of results, as well as value creation.  

PLAN FOR THE FUTURE 

Long term sustainable solutions based on data science need anchoring of the model in the business. The project should from the beginning, and as it proceeds, consider how the results of the analyses can be implemented and equally important maintained for the future benefit of the business. The model will need checkups and updates over time to stay relevant. A solid hand over is crucial for the longevity of the project success.  

 

 

 

Data from many sources can be quite a challenge to sort out and bring together.

 

 

When data science projects are well prepared they can bring a lot of value to the production.

 

 

Cleaning and handling data is the most time-consuming and often forgotten task in data science projects [2].

[1] https://visit.figure-eight.com/rs/416-ZBE-142/images/CrowdFlower_DataScienceReport.pdf 

[2] https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/#3038fafb6f63 

[3] https://www.kaggle.com/surveys/2017 

[4]https://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html