It’s common in a number of fields to try to develop a screener to identify people who might have depression, have suicidal tendencies, etc. For example, you might have a 50-question survey on depression, but as a psychologist or therapist, you wouldn’t want to have to ask each person coming into your office all 50 questions, right? So, the solution is to develop a screener – a small subset of questions that will identify as many depressed people as possible.

There’s several simpler ways to do this, but I received a request to find the *best possible combination* of items – the combination, including up to 5 variables, that would identify the greatest number of candidates. The best possible way? To try every combination.

So, how does one do this? The full surveys consisted of 15 items, which means there are 32,767 combinations all together (this is equal to 2^{15}-1). We want to find the smallest combination of items, of up to 5 items, that captures at least 95% of people with that condition (this is called sensitivity). A colleague and I had decided that this could best be done by creating an Excel template. First, I had to come up with some way of representing all 32k+ possible combinations of variables. I realized that by counting in binary one by one, from 1 up to 15 1’s, you’d be able to represent every possible combination. To illustrate:

1 = 000000000000001 = only item 1

2 = 000000000000010 = only item 1

3 = 000000000000011 = items 1 and 2

4 = 000000000000100 = only item 3

5 = 000000000000101 = items 1 and 3

…

32,767 = 111111111111111 = all 15 items

It was simpler to do this for all combinations and then create a filter in Excel to select only 5 or less variables. Excel has a limitation in that it can’t convert decimal to binary for over 9 binary digits, or something like that, so I wrote a really short QBasic program to convert decimal to binary. I had it do that for every number from 1 to 32,767 and to also export it as text file. I then copied this in Excel. I then could create a series of equations in Excel such that if a user pasted in the dependent variable (say, depression), and then pasted in the 15 items, Excel, in a new column, would create a list of all variables in that combination.

For example:

## 000000000000101 item3 item1

Now, with this, I was able to create another column, using a VERY LONG formula, that would generate the appropriate SPSS syntax to look at the crosstabs of depression by this combination, which was dichotomized. Within this crosstabulation is all the information you need to calculate sensitivity. I also created filters so you could choose the maximum number of variables to include in the combination. At this point, all that was needed was to copy and paste this syntax into SPSS and run it. Now, this was interesting: I never ran anything in SPSS that took more than a few minutes, and these crosstabs took about 6 hours (times 8, as we were creating 8 different screeners). At this point, I promptly took over the lab.

Next, I needed to create a Word macro to edit the SPSS output into a format that we could copy back into another Excel template. This was difficult due to the 20,000+ pages of output in Word. On our outdated computers, I ended up having to run the macro manually, but found that by putting MS Word in “draft” mode and by turning off autosave, I was able to run the macro in 20-30 minutes per screener.

Finally, to create the second Excel template. This was the easiest part, and only took maybe 30 minutes. After pasting in the edited output, I created four columns which would calculate the specificity, sensitivity, positive predictive value, and negative predictive value. Also, I created filters so you could sort by the total number of variables in the combination as well as the sensitivity. Finally, I was done! It looked like an impossible task at the beginning, but I was able to get it done, and in the end, really enjoyed myself.