Celeste Fralick is Principal Engineer for Intel Security.
Big data has arrived, and so have the data scientists. The demand for immediate business intelligence and actionable analytics is encouraging many people to adopt the title of data scientist. However, there is a missing link between big data and effective data models that is too often being filled by people without sufficient background and expertise in statistics and analytics. There is an underlying belief that data modeling is just another tool set, one that can be quickly and easily learned.
While these “fake” data scientists mean well, what they don’t know could substantially hurt your business. Developing, validating, and verifying the model are critical steps in data science, and require skills in statistics and analytics, but also creativity and business acumen. It is possible to build a model that appears to address the foundational question, but may not be mathematically sound. Conversely, it is also possible to build a model that is mathematically correct, but does not satisfy the core business requirements.
With past teams, we assembled a data science center of excellence in response to these concerns. One of our first tasks was the creation of an analytic lifecycle framework to provide guidance on the development, validation, implementation, and maintenance of our data science solutions. A key part of this process is an analytic consultancy and peer review board that provides external viewpoints as well as additional coverage for these complex products.
There are a range of possible types of analytics, from descriptive (what is happening) to prescriptive (what will happen and what is recommended), but all require a rigorous development methodology. Our methodology begins with an exploration of the problem to be solved, and runs through planning, development, and implementation.
The first step to developing an effective analytic model is defining the problem to be solved or the questions to be answered. Complementing this is a risk assessment, which includes identifying sources of error, boundary conditions, and limitations that could affect the outcome.
The second step is detailed input planning, which starts with a more complete definition of the requirements necessary to meet the expectations of the ultimate consumer of the output. An assessment of existing models and the current state of analytics should follow, to avoid duplicating efforts or recreating existing work. The first peer review happens during this step, to assess the plan and get comments on the concept.
The third step is development of the actual algorithm to be used. This is usually an iterative approach, beginning with an initial hypothesis, working through one or more prototypes, and refining the algorithm against various cases. When a final version is ready, it is put through two series of tests: validation that the model meets the requirements, that the right algorithm has been developed; verification that the model is mathematically correct, that the algorithm has been developed right. There is another peer review during this step, which will include, or be immediately followed by, a review by the customer or end user.
Whether the intended user is an internal department or business unit, or an external customer, there are some key questions that they should be asking during this review:
- How accurate is the model, how are the ranges and sources of error dealt with?
- How does the model answer the requirements?
- How does the model react to various scenarios in the environment?
- What are the test results, confidence values, and error rates?
- Is there any intellectual property incorporated into the model, and who owns it?
Finally, once these reviews have been successfully completed and questions answered to the customer’s satisfaction, it is time to implement the model. During the operating life of the model, regular reviews should be conducted to assess if any new or updated data is affecting the results, or if any improvements are required. Part of this review should be an evaluation of the analytic product and the criteria for when its output is no longer relevant or its use should be discontinued.
Data science is rapidly growing as a tool to improve a wide range of decisions and business outcomes. It is important to know what questions to ask, both about the qualifications of your data scientists, and about the proposed analytic model. A good analytic process can bring the "fake" data scientists into the fold without too much heavy lifting, and your team - and output - will be stronger for it. There is a way to do analytics correctly, and doing it wrong can be worse than doing nothing.
Opinions expressed in the article above do not necessarily reflect the opinions of Data Center Knowledge and Penton.