Deep Learning: Hospital Readmissions

Figure 1: Visualization of diagnosis code profiles for 430,000 hospital discharges

“Hospital readmission” has a simple enough definition: a patient is readmitted to a hospital within a given window of time after leaving a hospital. The causes and consequences of hospital readmissions are much less simple to understand. Recent changes in healthcare policy reflect the fact that many readmissions are avoidable and include legislation to provide economic disincentives to hospitals that produce large numbers of readmissions. Patients who are readmitted to hospitals are often at risk for severe illness or even death. Readmissions also increase the national expense of Medicare. Therefore, it is beneficial for the patients, the hospitals, and the country to minimize the number of hospital readmissions as much as possible.

If you knew that a particular person leaving the hospital was at high risk for readmission, is there anything you could do to stop it? Hospitals think so, which is why there are significant efforts to identify the patients with the highest chances of readmission before they leave. If hospitals know who is at risk, then they can make follow-up interventions to reduce the danger.


Non-Linear Effects

There are several factors involved in hospital readmissions including the patients’ diagnoses, the quality of the hospital and its infection rate, and even unavoidable or random accidents. Previous research has found that different combinations of simultaneous diagnoses (comorbidities) have a significant impact on readmissions. The risk involved with having two different diagnoses at the same time can be dramatically higher than the individual risk associated with each condition combined. Therefore, these comorbidities are a “nonlinear” effect, something that will end up being very important in our model selection later. A non-linear effect is one that doesn’t “add up”. E.g., Suppose P(d1) is the probability of readmission due to diagnosis 1, and P(d2) is the probability of readmission due to diagnosis 2 alone, and P(d1+d2) is the probability of readmission for someone with both diagnoses 1 and 2. If

then the effects are linear in nature, and the diagnoses do not interfere with each other. If on the other hand,

then the effects are non-linear, and there is “something else” at work that kicks in when these diagnoses are mixed. The research literature on the subject suggests that hospital readmissions are a strongly non-linear function of diagnoses (Donzé et al. 2013).


Health In High-Dimensional Spaces

Explicitly modeling all the different combinations of diagnoses is a computationally intractable problem. There are about 13k different diagnosis codes in the International Statistical Classification of Diseases and Related Health Problems, version 9 (ICD-9). Assuming that a typical patient has about 5 diagnoses, then the number of unique combinations can be represented using the binomial coefficient,

which gives the number of ways to select “k” things from “n” things, without replacement and regardless of order. “N!” is called “N factorial” and is defined as


So 1! = 1, 2!=2*1=2, 3!=3*2*1=6, and so on. In our case, n=13,000, and k=5. All three of my efforts to use the top Google hits for an “n choose k” calculator to compute this number ended in failure. Probably because they naively began by computing the numerator: 13000! — a number with roughly 50,000 zeros! That’s a huge number, much larger than a googol (one followed by 100 zeros). By comparison, the sum total of all protons, neutrons, and electrons in the entire observable Universe is estimated to be a number with a puny ~80 zeros. Armed with a little algebra, we can skip the calculators and compute the binomial coefficient for n=13,000 and k=5 ourselves:

About a billion billion, still a huge number.

So there are a billion billion combinations of ICD-9 codes, just for patterns of 5. How many for patterns of 10? 1E34 — a billion billion times more than patterns of 5. Since only 1 to 10 codes of the total 13000 are used at one time for one patient discharge, each record in our dataset is extremely sparse. If we represent every patient discharge as a vector with 13000 dimensions, then 5 entries would have “1”s in them, and the other 12995 would have zeros. Representing that vector in CSV format would be terribly inefficient since 99.962% of the vector is zero.


How do you look at a 13,000 dimensional space?

Singular Value Decomposition (SVD) is a numerical process commonly used in machine learning to reduce the dimensionality of a dataset into the minimum needed to explain the dataset’s variance within some specified criteria. E.g., imagine you have a dataset with 50 columns, and each column measures the distance between different parts of the body. Column one is floor to waist distance, column two is the shoulder to fingertip distance, column three is the wrist to fingertip distance, and so on or a collection of people with varying ages… By running SVD on this dataset you would likely find all the measurements are highly correlated and can be explained with a single new variable “M”. Although we already know there is a single factor that can predict all these measurements, like age or body mass, the algorithm itself doesn’t tell us what it is, and it may even be something that is not easy to describe or even understand.

When we ran SVD on our 430k diagnosis profiles, we were able to explain 50% of the data with just 70 “factors”. The other half of the data requires the remaining 12930 codes to explain. This means that a large chunk of all the ICD-9 data can be explained in just 70 high-level categories. E.g., a high-level category might be “hips and knees injury” which is simple to understand but may have tens/hundreds of codes and thousands of different individual ICD-9 patterns used for individual profiles. Having this insight about the data helps us craft an effective machine learning model (see the next section).

In order to visualize the 70 different “conditions” within the larger space of 13,000 codes, we ran the K-means clustering algorithm on our SVD-transformed data and let it label the conditions for us. Clustering is a form of unsupervised machine learning, where an algorithm segments a dataset into self-similar partitions based on some metric of comparison. In Figure 1, each point represents a complete ICD-9 diagnosis profile, and the colors represent which condition or “cluster” it belongs to based on the K-means decisions. You can see that the full space of ICD-9 code combinations is actually fragmented into larger “clusters” of similarity.


Deep Learning with DSSTNE.

Figure 2: Visualization of diagnosis code profiles for 430,000 hospital discharges.

Understanding that our data is has high dimensionality (13,000 codes), high sparsity (typically only 1-10 codes per person), high non-linearity (combinations of codes will be important than single codes), and can be approximately explained by ~70 hidden “factors” helps use choose an appropriate machine learning model to represent the data. We selected the feed-forward artificial neural network, a model in the category of “deep learning”. In particular, we used Amazon’s Deep Scalable Sparse Tensor Network Engine (DSSTNE, pronounced “destiny”) since it is optimized for sparse features, like our diagnosis code profiles.

Deep learning is also used by Amazon to predict which products you might like to purchase and Google to automatically label your photographs. One way to think about deep learning applied to diagnosis codes is that the algorithm looks at an “image” of 13,000 “pixels”, e.g., a ~100×100-pixel thumbnail image, where all but ten pixels are 0, and then classifies the image as “readmission” or “not readmission”, rather than, e.g., “cat” or “not a cat”, as Google’s image classifier might try to do.

DSSTNE is designed to utilize onboard GPUs for parallel computation during training and scoring of models. In our training dataset, we have 430k discharges from CMS-affiliated hospitals into Skilled Nursing Facilities. By using an Amazon pre-configured Amazon Machine Image (AMI), with a GPU enabled instance (g2.2xlarge), we were able to get up and running with DSSTNE in minimal time. The g3.2xlarge instance has 1 GPU, 8 CPUs, and 15GB ram. Figure 2 shows the same dataset as Figure 1, but now it’s color-coded according to “readmission” or “non-readmission”. You can see that now the two groups are much more uniformly distributed across the spatial patterns. The fact that a single plane would never be able to separate all the red points from all the blue points (a property known as linearly separable) tells us again that the non-linear model will be important to distinguish the two cases from each other.

One drawback of using artificial neural networks for modeling is that there are many important model parameters that must be chosen, like the number of hidden layers in the network, and the number of “neurons” in each hidden layer. Luckily, our SVD analysis gives us a little insight here. We know that in our data there are close to 70 abstract “conditions” that can explain a lot of the original 13,000 codes. Therefore, we set the number of hidden layers to 1 and use 64 “neurons” as a starting point.



After scanning through different network architectures, we found little performance gain beyond 1 hidden layer with about 64 neurons, a reassuring confirmation of our understanding of the data based on the SVD analysis. To benchmark the model performance, we compared it against the results of a simpler linear binomial classification. The linear model consistently stayed between 0.5-0.6 Area under the receiver operating Curve (AUC), while DSSTNE reached 0.6-0.7 AUC. The AUC is a metric of performance for binary classification models that ranges between 0.5 = “the model is randomly guessing”, to 1.0 “perfect prediction”. Therefore, the out-of-box performance gained by deep learning made a big difference in model performance. There are dozens of ways to push this further and improve the modeling performance using more data or other algorithmic steps, but we’ll need to end here.

This post was designed to give a sense of the daily research/work undertaken by a typical data science/machine learning project. I hope this post was entertaining, or even useful for some readers. Even with all these words, we’ve just scratched the surface when it comes to the topic of predicting hospital readmissions. Thank you for reading!



Donzé J, Lipsitz S, Bates DW, Schnipper JL. BMJ. 2013;347:F7171