Back in early February, I attended a webinar hosted by the Section on Pharmacy Informatics and Technology about Data Science and (by inference) artificial intelligence. It was actually pretty good and covered a lot of the ground that people need to know about how Data Science and Artificial Intelligence converge. It is available on the ASHP Learning Center, and I would recommend it for your viewing.
The webinar brought to mind several things I find myself teaching repeatedly, so I am writing them here in the hope that others find them useful. They are the product of many hours spent training machine learning systems, and in trying to get data from disparate systems to drive meaningful data models. I need to assert here that I am not a data scientist; I am privileged to work with several who are patient teachers and eager students of our practice environment.
Data science is about a lot more than knowing where to find the data. One cannot and should not accept data at face value without knowing a lot about where it came from (its provenance), what assumptions were present in its creation, the tendency of users who produce the data to use the system inappropriately (or, perhaps, “creatively”), and how the semantics of any source of data may relate to the semantics in other sources of data.
Semantics, you say? Does data have semantics? The answer is that it most certainly does and, just like in English class, understanding the semantics of the data is key to understanding the story that data tells. In the case of automated systems, the semantics may arise from the way terms are used within the software, from the assumptions of the designers of that software, and from the sometimes-creative ways users actually use those systems.
Imagine, for example, that you have a system that expects to have one, and only one National Drug Code (NDC) associated with a medication concept, while there is another that expects there to be many NDCs for each medication concept. Further, let us presume that the first system requires manual entry of NDC data while the second system acquires NDCs from a feed from the EMR. This is an example of semantic dissonance. Each system uses the same term (NDC) but has different rules about how they work. If one is to acquire data from the system with only one NDC per medication, one must be aware that the NDC reported by that system may not be current, since users are required to intervene and change the NDC associated with a particular medication, and that task may not always be done correctly, or done at all.
Semantics aside, what may often be a bigger issue is that the systems generating the data may not facilitate keeping that data current and accurate, with the result that the data are incomplete, inconsistent, or inaccurate. So, Data Scientists often must perform “data wrangling”, that is, working with the data in the context of other sources of data to ensure that it is used properly and not misleading when it is consumed.
Finally, for the data to be useful to any kind of artificial intelligence, there must be some notion of what reality looks like, and how that reality is reflected in the data. The presenters in the webinar talked about modeling but didn’t really talk about how a model gets built. For the model to be useful, the behaviors the model must evaluate must be unambiguously visible within the data. That may mean that there are explicit data in the data set that demonstrate those behaviors, or that the data in the data set can be manipulated to demonstrate those behaviors.
In my experience, models wind up getting built iteratively because our understanding of both reality and its reflection in the data are often incomplete (and sometimes inaccurate), and will change as we learn.
So, one starts with a training data set and a general impression of how the process being modeled works, and how that work is reflected in the data, and builds a computational model based on that impression.
From that point, the model must be iteratively refined using the same training data set until it produces a result that generally defines what appears to be reality. This often involves review by subject matter experts (or against rules generated by subject matter experts) until the model produces an acceptable approximation of what is believed to be contained in the data. I have participated in such reviews, and they can be mind-numbing in their detail.
Eventually, when the model appears to be generating appropriate results, it is then tested against a validation data set (a different data set than was used to generate the model) and is found to be useful if it produces usable results.
Note that, for both data sets, expected outcomes may be known (a supervised data set) or may be unknown (an unsupervised data set). Unsupervised data sets can still demonstrate behaviors and relationships within the data, but it can be hard to tell whether those behaviors and relationships represent what we are trying to find or demonstrate.
Sometimes we get lucky and get additional outcomes we didn’t expect but that turn out to be real.
The presenters in the webinar discussed that the tools used in data science are statistical. That is true, but they are far more complex than simple normal distribution statistics often used in clinical trials. Indeed, even though I am reasonably fluent in basic statistics, and have been exposed to isolation forests, kernel density estimation, and other methods you likely have never heard of, the statistical tools used by data scientists in machine learning are not for the faint of heart, and may not be amenable to the “tweaking” we have become accustomed to when dealing with means and standard deviations thereof.
It is important to realize that these statistical models are approximations of reality. In my experience, it is unreasonable to expect them to be correct 100% of the time. The longer they are used, and the more data they experience, the closer they will come to demonstrating reality in a useful way. In a famous quote, George E. P, Box stated "All models are wrong, but some are useful".
The webinar focused on machine learning which is this model development. Note that, in most implementations, machine learning may not be able to correct or advance its model on its own but rather may have to rely on regular review of the model output in the real world.
There is a variant of machine learning called deep learning, that runs on an artificial neural network, reviews extremely large amounts of data, and comes to its own conclusions. Some of these conclusions may be obvious (e.g., feathers usually come from birds) but, with a large enough data set and enough time, it can build up an extremely complex approximation of reality that may yield some truly unexpected and useful insights. The machine learning link above references a discussion of the differences between machine learning and deep learning that you may find useful. The deep learning link goes into more detail about how they differ.
Sorry for the rambling. If you learn anything from this it should be:
- AI isn’t magic; it takes a lot of work to set it up and keep it running
- AI requires well maintained and managed data; there are things you could do in your practice that would make data more useful.
- It is unlikely that an AI system will always make the decisions you expect it to, especially as it is learning your practice.
- Before you discard an AI program because you think it is wrong, it may be worth exploring whether your perception of reality is correct. Sometimes, the AI will surprise you.
What do you think?
As always, the views expressed in this blog are my own, and not necessarily those of ASHP or of my employer, BD.
Dennis A. Tribble, PharmD, FASHP
Ormond Beach, FL 32174