In this post, I apply three statistical and machine learning tools to my panic recovery journal data: linear regression/correlation, the Fast Fourier Transform, and maximum entropy modelling.
First, A Word About Tools
I suppose it is tempting, if the only tool you have is a hammer, to treat
everything as if it were a nail.
Now, A Necessary Disclaimer
My experiment has fewer than 50 samples, which is nowhere near enough to draw statistically significant conclusions. That’s not the point. The primary purpose of this post is to demonstrate analysis techniques by example. These same methods can be wielded on larger datasets, where they are much more useful.
Getting Ready
To follow along with the examples here, you’ll need the excellent Python toolkits scipy, matplotlib, and nltk:
1


Linear Regression
What?
Linear regression answers this question:
What is the line that most closely fits this data?
Given points $ P_i = (x_i, y_i) $, the goal is to find the line $ y = mx + b $ such that some error function is minimized. A common one is the least squares function:
The Pearson correlation coefficient $ R $ and pvalue $ p $ are also useful here, as they measure correlation and statistical significance.
Why?
In a selftracking context, you might ask the following questions:
 Have I been exercising more over time?
 Does exercise affect mood? By how much and in what direction?
Linear regression can help address both questions. However, it can only find linear relationships between datasets. Many dynamic processes are locally linear but not globally linear. For instance, there are practical limits to how much you can exercise in a day, so no linear model with nonzero slope will accurately capture your exercise duration for all time.
The Data
You can see the code for this analysis here. I look at only the first 31 days, that being the largest consecutive run for which I have data.
My alcohol consumption did not decrease over time, but rather stayed fairly constant: with $ R = 0.0098 $, there is no correlation between alcohol and time.
Sugar consumption is a similar story: although the bestfit slope is slightly negative, $ R = 0.0671 $ indicates no correlation over time. It seems that my alcohol and sugar consumption were not modified significantly over the tracking period.
I decided to graph alcohol and sugar together. It looks like they might be related, as the peaks in each seem to coincide on several occasions. Let’s test this hypothesis:
The positive slope is more pronounced this time, but $ R = 0.1624 $ still indicates a small degree of correlation. We can also look at the pvalue: with $ p = 0.3827 $, it is fairly easy to write this off as a random effect.
Finally, let’s take another look at a question from a previous blog post:
On days where I drink heavily, do I drink less the day after?
There’s a negative slope there, but the correlation and pvalue statistics are in the same uncertain zone as before. I likely need more data to investigate these last two effects properly.
Fast Fourier Transform
What?
Fourier analysis answers this question:
What frequencies comprise this signal?
Given a sequence $ x_n $, a Discrete Fourier Transform (DFT) computes
The $ X_k $ encode the amplitude and phase of frequencies $ \frac{f k}{N} $ Hz, where $ T $ is the time between samples and $ f = 1 / T $ is the sampling frequency.
As described here, the DFT requires $ \mathcal{O}(N^2) $ time to compute. The Fast Fourier Transform (FFT) uses divideandconquer on this sum of complex exponentials to compute the DFT in $ \mathcal{O}(N \log N) $ time. Further speedups are possible for realworld signals that are sparse in the frequency domain.
Why?
In a selftracking context, you might ask the following questions:
 Do I have regular exercising patterns?
 Do these patterns cycle weekly? biweekly? monthly?
 How much does my amount of exercise fluctuate during a cycle?
With the FFT, Fourier analysis can help address these questions. However, it can only find periodic effects. Unlike linear regression, it does not help find trends in your data.
The Data
You can see the code for this analysis here. Again, I look at the first 31 days to ensure that the frequency analysis is meaningful.
There are some apparent maxima there, but it’s hard to tell what they mean. Part of the difficulty is that these are frequencies rather than period lengths, so let’s deal with that:
1 2 3 4 5 6 7 8 9 10 

If you’re not familiar with the Fourier transform, the last line might be a bit mysterious. That corresponds to $ X_0 $, which is just the sum of the original samples:
Other than that, the most pronounced cycles have period lengths of 2.21 days and 7.75 days. The former might be explained by a seesaw drinking pattern, whereas the latter is likely related to the dayofweek effects we saw in the previous post.
Which day of the week? The phase is 3.67 days, and our sample starts on a Monday, placing the first peak on Thursday. The period is slightly longer than a week, though, and the data runs for 31 days, so these peaks gradually shift to cover the weekend.
There are two caveats:
 I have no idea whether a Fourier coefficient of about 3 is significant here. If it isn’t, I’m grasping at straws.
 Again, the small amount of data means the frequency domain data is sparse. To accurately test for bidaily or weekly effects, I need more finegrained period lengths.
Maximum Entropy Modelling
What?
Maximum entropy modelling answers this question:
Given observations of a random process, what is the most likely model
for that random process?
Given a discrete probability distribution $ p(X = x_k) = p_k $, the entropy of this distribution is given by
(Yes, I’m conflating the concepts of random variables and probability distributions. If you knew that, you probably don’t need this explanation.)
This can be thought of as the number of bits needed to encode outcomes in this distribution. For instance, if I have a doubleheaded coin, I need no bits: I already know the outcome. Given a fair coin, though, I need one bit: heads or tails?
After repeated sampling, we get observed expected values for $ p_k $; let these be $ p’_k $. Since we would like the model to accurately reflect what we already know, we impose the constraints $ p_k = p’_k $. The maximum entropy model is the model that also maximizes $ H(p’) $.
This model encodes what is known while remaining maximally noncommittal on what is unknown.
Adam Berger (CMU) provides a more concrete example. If you’re interested in learning more, his tutorial is highly recommended reading.
Why?
In a selftracking context, you might ask the following questions:
 Which treatments have the greatest effect in preventing panic attacks? Which have the least effect?
 Today I exercised for at least 30 minutes and had four drinks. Am I likely to get a panic attack?
 What treatments should I try next?
Maximum entropy modelling can help address these questions. It is often used to classify unseen examples, and would be fantastic in a data commons scenario with enough data to provide recommendations to users.
Feature Extraction
Since I’m now effectively building a classifier, there’s an additional step. I need features for my classifier, which I extract from my existing datasets:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 

Note that the features listed here are binary. I use my daily goals as thresholds on caffeine, sweets, and alcohol.
(If you know how to get floatvalued features working with NLTK, let me know! Otherwise, there’s always megam or YASMET.
The Data
You can see the code for this analysis here. This time I don’t care about having consecutive dates, so I use all of the samples.
After building a MaxentClassifier
, I print out the most informative features
with show_most_informative_features()
:
1 2 3 4 5 6 7 8 9 10 

Exercise, relaxation breathing, and vitamin supplements help with panic. Caffeine, alcohol, and sweets do not. I knew that already, but this suggests which treatments or dietary factors have greatest impact.
Let’s consider the supplements finding more closely. Of the 45 days, I took supplements on all but two. It’s dangerous to draw any conclusions from a feature for which there are very few negative samples. This points out some important points about data analysis:
 Know your data: otherwise, you may ascribe undue meaning to outliers or noise.
 Know your features: supplements are probably not a good feature here. A feature inclusion threshold on number of positive and negative samples might be helpful here.
 Beware magic: even when you understand their inner workings, machine learning algorithms can produce results that are difficult to interpret.
Up Next
In my next post, I look at a panic recovery dataset gathered using qscounters, a simple utility I built to reduce friction in selftracking. I perform these same three analyses on the qscounters dataset, then compare it to the recoveryjournal dataset.