How Fake Data Scientists Could Hurt Your Business

Celeste Fralick<br/>Intel SecurityCeleste Fralick
Intel Security

Celeste Fralick is Principal Engineer for Intel Security.

Big data has arrived, and so have the data scientists. The demand for immediate business intelligence and actionable analytics is encouraging many people to adopt the title of data scientist. However, there is a missing link between big data and effective data models that is too often being filled by people without sufficient background and expertise in statistics and analytics. There is an underlying belief that data modeling is just another tool set, one that can be quickly and easily learned.

While these “fake” data scientists mean well, what they don’t know could substantially hurt your business. Developing, validating, and verifying the model are critical steps in data science, and require skills in statistics and analytics, but also creativity and business acumen. It is possible to build a model that appears to address the foundational question, but may not be mathematically sound. Conversely, it is also possible to build a model that is mathematically correct, but does not satisfy the core business requirements.

With past teams, we assembled a data science center of excellence in response to these concerns. One of our first tasks was the creation of an analytic lifecycle framework to provide guidance on the development, validation, implementation, and maintenance of our data science solutions. A key part of this process is an analytic consultancy and peer review board that provides external viewpoints as well as additional coverage for these complex products.

There are a range of possible types of analytics, from descriptive (what is happening) to prescriptive (what will happen and what is recommended), but all require a rigorous development methodology. Our methodology begins with an exploration of the problem to be solved, and runs through planning, development, and implementation.

The first step to developing an effective analytic model is defining the problem to be solved or the questions to be answered. Complementing this is a risk assessment, which includes identifying sources of error, boundary conditions, and limitations that could affect the outcome.

The second step is detailed input planning, which starts with a more complete definition of the requirements necessary to meet the expectations of the ultimate consumer of the output. An assessment of existing models and the current state of analytics should follow, to avoid duplicating efforts or recreating existing work. The first peer review happens during this step, to assess the plan and get comments on the concept.

The third step is development of the actual algorithm to be used. This is usually an iterative approach, beginning with an initial hypothesis, working through one or more prototypes, and refining the algorithm against various cases. When a final version is ready, it is put through two series of tests: validation that the model meets the requirements, that the right algorithm has been developed; verification that the model is mathematically correct, that the algorithm has been developed right. There is another peer review during this step, which will include, or be immediately followed by, a review by the customer or end user.

Whether the intended user is an internal department or business unit, or an external customer, there are some key questions that they should be asking during this review:

  • How accurate is the model, how are the ranges and sources of error dealt with?
  • How does the model answer the requirements?
  • How does the model react to various scenarios in the environment?
  • What are the test results, confidence values, and error rates?
  • Is there any intellectual property incorporated into the model, and who owns it?

Finally, once these reviews have been successfully completed and questions answered to the customer’s satisfaction, it is time to implement the model. During the operating life of the model, regular reviews should be conducted to assess if any new or updated data is affecting the results, or if any improvements are required. Part of this review should be an evaluation of the analytic product and the criteria for when its output is no longer relevant or its use should be discontinued.

Data science is rapidly growing as a tool to improve a wide range of decisions and business outcomes. It is important to know what questions to ask, both about the qualifications of your data scientists, and about the proposed analytic model. A good analytic process can bring the “fake” data scientists into the fold without too much heavy lifting, and your team – and output – will be stronger for it. There is a way to do analytics correctly, and doing it wrong can be worse than doing nothing.

Opinions expressed in the article above do not necessarily reflect the opinions of Data Center Knowledge and Penton.

Industry Perspectives is a content channel at Data Center Knowledge highlighting thought leadership in the data center arena. See our guidelines and submission process for information on participating. View previously published Industry Perspectives in our Knowledge Library.

Add Your Comments

  • (will not be published)

One Comment

  1. A Real Data-Scientist

    Term “Scientist” is awarded to some chosen people which means something. A scientist (whether a “data-scientist” or any other) cares about answering “Why” before “How”. A data-scientist is more into understanding the very intrinsic nature of data. His/her analytical mind constantly tries to find underlying patterns in data. To be officially accepted as a “scientist”, one must have a Ph.D. degree. Ph.D. means Doctor/Doctorate in Philosophy. It is awarded to those who have demonstrated/proven and have accepted abilities (by well-known scientific community) in a particular field. These abilities are demonstrated not just once, and not just in one way. . .these are proven over and over in different ways. Some examples to prove these abilities are: (1) First getting into a reputable Ph.D. program at a competitive university. Just getting there is a big challenge. You need to have a track record of good undergraduate and/or master’s degree, references, etc. (2) Winning funding from competitive sources, such as, NSF, NASA, DoD, DoE, etc. (3) Ability to teach undergraduates by performing TA duties. (4) Being a research fellow/assistant for someone who has a lot more knowledge and experience than you. (5) Winning funding to attend conferences and to present your research-findings before scientific community. (6) Winning funding for scientific workshops, camps, etc. (7) Passing advanced level courses in your disciplines. (8) Passing qualifying/comprehensive exams in your areas of research. (9) Be known to current scientific community AND to know current scientific community. (10) Develop some meaningful scientific methods and discoveries. (11) Publishing your findings in reputable conferences and/or journals (not some crappy, low level conferences/journals). (12) Convincing scientific and/or federal organizations to give you research grant (which is very big achievement as you are anonymously evaluated by your peers and they trust your abilities for carrying out research in a meaningful way by providing you funding, which comes from tax-payer’s money). (13) Graduating with a Ph.D. It is estimated that only 64% students enrolled in a Ph.D. program actually graduate with a Ph.D. degree in the USA. Most drop-out as they are unable to complete the above mentioned steps. They are able complete only a few steps but not all, so they fail. Also, there are only 1% Ph.D. graduates in the USA. So earning a Ph.D. from a reputable institution under the supervision of a real scientist is a big achievement. It changes your title from Mr./Ms. to Dr. Does that mean something to undergraduates/Masters? I can continue defining characteristics of a “Real Scientist”. Giving someone a title of “Data Scientist” without him/her having any solid track-record listed above is a joke and an insult to scientific community. Above mentioned scales are standard at most universities. It means that not everyone has an ability to sustain pressure of carrying out research and prove himself/herself before scientific community. This scale filters out those who do not deserve to be scientists. It could be due to their personal problems, or perhaps due to the mother nature who did not give them an ability to cross that line which separates a “Real scientist” from a “Non-real scientist”. Many industries now-a-days are freely awarding data scientist title, but for scientific community it means nothing. It is said that when you cannot give someone monetary promotion, give them title (recognition) award so that they calm down and feel better about themselves. It is common in most IT companies to award someone a “Data-Scientist” position only with an undergraduate/master’s degree. So the definition of “Data-Scientist” is being misused by industries. Then comes such articles, like the one I am responding to, in order to distinguish between “Real vs. Fake” data scientist. Industries incur business problems on themselves by breaking the standard norm of “Scientist” vs. “Analyst” positions. A real data scientist would not care what industries think. S/he cares about “science” and not “business”. In general, any scientist (data or non-data), would be interested in discovering the scientific facts rather than existing tools for implementation. A scientist would invent new tools if necessary. S/he would prove theoretically & empirically “why” these tools work, where they work, etc. An analyst would work with existing tools and would care only about learning “how” they work. Being able to perform some basic statistical analysis, writing regression/classification/clustering model in R, Python, etc. does not make you a data-scientist. These are just the tools. Everyday new tools appear in the market. Nothing is special about them. If you have only an undergraduate/master’s degree, you can for sure call yourself a “Data-Analyst”, but trust me, if you meet an actual scientist, s/he will be humored to hear you calling yourself a “Data-Scientist” with an undergraduate/master’s degree. A real data-scientist does not give a damn about business/profit, etc. For a data-scientist a data is just a mixture of numbers, characters, etc. Data-scientist is interested only in finding hidden patterns in data. It is similar to what a patient is to a medical doctor; just a subject who has some known/unknown symptoms. A doctor’s job is to treat those symptoms whether the patient being a president or a criminal.