## Arduino Flickering Candle

Update December 24, 2013: Mokus refined his code so that the distribution is now well-behaved (nearly normal) and the PSD no longer turns up at high frequencies). The plots and post have been updated to reflect this change. He will push code to the same link as available.

In my previous post on Candle Flame Flicker I describe the statistics of the optical intensity flicker from a candle in terms of the probability density of the brightness samples and in terms of the power spectral density of the brightness. In this article I discuss how to make an Arduino drive an LED to flicker in a way that is statistically similar to the measurements I took on a real candle.

In the measurements I observed the spectral roll-off of the candle to start between about 5 and 8 Hz, and to decline at a nominal rate of around 44 dB for each decade increase in frequency. The 2nd-order infinite impulse response filter is computationally efficient in terms of using a small amount of memory and requiring few calculations. However, the Arduino is not good at floating point arithmetic. On the Arduino, floating point is done in software, has relatively few bits of precision, and is about 4 to 40 times slower than fixed point (integer) math. It is quite difficult to find useful benchmarks. The basic process is to create white noise and then filter it to the correct spectral shape. Afterward, the signal is scaled to have the appropriate variance and offset with a constant to represent the average brightness of the “flame”.

The approach I used was to design the IIR in Python with scipy’s signal module. I specified a 2nd order lowpass Butterworth filter, specifying a cutoff frequency of 8 Hz, and a sample frequency of 60 Hz. I normalized the coefficients to work in a 16 bit integer space, following Randy Yates’ 2010 Practical Considerations in FIR Fixed Filter Implementations, mainly. From a synthesis perspective, there is some prior art. Philip Ching, a student at Cornell synthesized candle noise quite cleverly, though he neither reported nor replicated the correct statistics. A fellow with the handle Mokus did a very, very tiny implementation for a microcontroller with only 64 bytes of RAM. He helped me modify his code so I could compare his statistics, and after adjustment his spectrum and distribution match fairly well. The high-frequency of his PSD looks a little different from the other methods, but these may not be noticeable to the observer. Finally, there was Eric Evenchick’s coincidental post on hackaday. Mokus reported that Evanchick’s implementation has too slow an update rate; I noticed that Evanchick did not report on the statistics he was targeting nor what he achieved. I did not recreate his work to test.

Then, on to the tests. I really was interested in building and comparing statistics from a 16 bit implementation, a 32 bit implementation in both a Direct Form I and a Direct Form II implementation. Indeed, I had great difficulty getting the filters to perform because I kept misdesigning the integer coefficients and overflowing the registers. As I sought a solution to the 2nd-order filter approach, I also created a 4-stage digital equivalent of an RC filter. The 4-stage RC approach was always stable though it needed a higher sample rate and used much more processor power to create statistically consistent results. On the other hand, it has a more accurate spectrum. A comparison of three different 16-bit methods to Mokus’ and to the actual measurements is shown in the figure below. The legend shows the mean, standard deviation, and their ratio to the right of the label. The All my filters did a credible job of reconstructing the histogram.

The power spectral density (PSD) of the different methods tells a different story. The Direct Form II 16 bit filter is the most visually appealing of the methods I tried. It rolls off more like the physical data than the other methods, except compared to the 4-stage RC filter. The Direct Form II filter is more computationally efficient.

The results for the 32-bit versions show greater variance than the 16-bit versions, but the quality is not markedly better.

I wrote a proof code for the Arduino UNO both to see it flicker and to test the processor speed—separate parts of the code. The results are that compiling with 1.0.3 resulted in a 4,722 byte program that calculated 10,000 new values in 6,292 ms, or 629 microseconds per value. In theory this could produce a sample rate of nearly 1.6 KHz. Or another way of thinking about this is that the final code uses about 629 us/17 ms or about 4% of the processor capability of the Arduino UNO. That leaves a lot of resources available to do other Arduino work or maybe means it can fit in a cheaper processor.

I have released two pieces of code under the GNU Public License 3, you can get the Python I used for filter design and the Arduino test code at the links. If you want the data, please contact me through the comments and I am willing to provide it.

## Ballistics and Repeatability – A Fuller Solution

The last post relied on the Monte Carlo method to determine the probability of correctly assessing relative accuracy. The linked document is a more a complete answer without any of the Monte Carlo drawbacks.

## Ballistics and Repeatability

I have a rifle that shoots 3-inch 5-shot groups at 100 yards.  This is not adequate for hunting where commonly shots are 200 to 300 yards distant.  As I work toward resolving this, there are some interesting statistical problems of practical importance.

In each experiment, how many shots should I fire to measure the spread, 3, 5, 10?

Having fired these shots, how should I measure the “spread”?

When comparing different groups, how likely is it that I will reach a true conclusion with my measurement?

The two methods for measuring spread are (1) the distance between the two farthest points or (2) the standard deviation of the radial distance.  The standard deviation is much more difficult to measure, probably you would record an x and y position for each shot, and then calculate the average position, and each point’s distance to the average position, and then take the standard deviations of each point’s distance.  The maximum point-to-point distance is very easy to measure.  Given the work load, the standard deviation would have to provide better results than the maximum distance.  For example, if the standard deviation of a 3-shot group provided a better measurement than the 5-shot group, then it would save ammunition and precious range time.

To compare the performance of the two measurements I assumed a model.  The model consists of two different ammunitions (different brand of bullet, amount of powder, whatever).  Ammunition A produces a group with standard deviation of 1 inch at 100 yards.  This is just an example—a standard deviation of 1 inch would constitute really dreadful performance.  Ammunition B produces a group with a standard deviation between one and two inches.

The contrast ratio is defined as the standard deviation of B divided by the standard deviation of A.  If the contrast ratio is 1 then both kinds of ammunition perform the same.  We expect that it will be hard to tell the difference between ammunitions when the contrast ratio is near one.  The other critical element of the model is the statistical distribution.  Just like everyone else, I used the Gaussian.  I think that the results don’t generalize—heavy-tailed distributions probably produce different outcomes.

Then I simulated 10,000 groups, each containing 3, 4, 5, …,10 shots.  I simulated contrast ratios between 1 and 1.5.  With the data sets I considered a few metrics:

• What are my chances of correctly determining which ammunition produced a larger group, using the sample standard deviation and the maximum spread measurements?
• On average how well can I measure the contrast ratio, using the sample standard deviation and the maximum spread measurements?

The following figure shows that, using the maximum spread measurement, for contrast ratio of 1.5, a 3-shot group provides about a 76% chance of correctly determining that ammunition B produces a larger group.  A 5-shot group has taken the chance to about 85%.

Next, the same measurement using the sample standard deviations shows the same three shot group can differentiate correctly about 70% of the time, or for a  5-shot group about 75% of the time.  The difference is stark—no sane person would ever prefer measurements of the standard deviation in favor of the maximum.

How well can you quantify how much better A is than B?  The following chart shows the average ratio of the maximum spread from A divided by the maximum spread from B.  One observation is that by 10 shots the results appear to be biased.  On average the contrast ratio is underestimated.  The bias seems to be consistent enough to be corrected.

The ratio of the sample standard deviations, as the next plot shows, may be less biased.  By 10 shots, though, the estimator is shown to be terrible.  As with the maximum spread technique, the bias may be correctable.  The slow rate of convergence is not heartening.

My conclusion is that for small ratios a 5 shot group is barely adequate, though coarse differences can be revealed.  I will not waste time with exotic measurements of the spread—the maximum spread appears to be superior than the any of the other methods I have tried.  Without any information in advance of the test I would prepare for a five shot comparison.  Some test of statistical power should probably be applied, analogous to Student’s t-test but for the maximum spread.  Obviously, similar spreads should be treated with caution.

This study was performed with Monte Carlo processes because the analytical solutions for maximum spread probabilities are relatively complex—n-fold convolutions of distributions.  I believe the solutions to be available, but I lazily elected computer horsepower to Park-power.

## Coffee: Model Fitting Goodness

You have seen the response function contours through the stationary point, and you have seen the 3D response function visualizations. You may recall that I have two competing models, one is derived from the experimental data design for the response surface, and the other is all that data and the screening results too. Is either model any good? Which one is better?

First, a quick review of the model:

q = b0T2 + b1t2 + b2r2 + b3Tt + b4tr + b5Tr + b6T + b7t + b8r + b9

Fitting produces numbers for all those coefficients, b0, b1, etc. Look at the goodness of individual coefficients in the following table. The more asterisks in the significance column, the better the coefficient corresponds to the data. Notice that the “all data” model has an overall terrible goodness, with the residual standard error of 0.7, compared to the residual error of 0.358 for the RSO data only.

 All Data RSO Data Only Std Error 0.749 0.358 Estimate Std. Error Signif. Estimate Std. Error Signif. (Intercept) 3.344 0.3963 *** 3.6667 0.2064 *** temp -0.355 0.2369 -0.4375 0.1264 * time 0.235 0.2369 0.0375 0.1264 cwrat 0.4 0.2369 0.275 0.1264 . temp:time -0.5 0.2901 -0.75 0.1788 ** temp:cwrat 0.02 0.2901 -0.375 0.1788 . time:cwrat 0.09 0.2901 -0.075 0.1788 temp^2 -0.401 0.3592 -0.6833 0.1861 * time^2 -0.201 0.3592 -0.4833 0.1861 * cwrat^2 0.174 0.3592 -0.1083 0.1861 Significance codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1

The two graphs below show the quantile normal plots for the residuals from each model. The all-data model looks more systematically erroneous than the RSM data only. The systematic error suggests that the choice of model is not particularly good. Furthermore, it suggests that the estimates of F-statistic should be considered hesitantly, as F-statistics are calculated by assuming normally distributed independent errors.

With hesitation then, let’s examine the F-statistic for these models. The RSO data model produces an F-statistic of 6.4, well above the Box, Hunter, and Hunter criterion that the F-statistic exceeds 4. On the other hand, the all data model’s F-statistic is a miserly value of 1.

 All Data RSO Data Only Degrees of Freedom 9 5 Residual Standard Error 0.749 0.358 F-statistic 1.06 6.43

In conclusion, I get a better fit—to the indicated model—using only some of the data. This is, frankly, an indictment of the model. Anyone can carefully select data to produce a good fit to a model, and the fact that I have done so may be considered a round condemnation of my objectivity. I may reconsider the model, and attempt to drop the least important terms while adding either a three-way interaction or two-way involving squares. Perhaps more creative math will be required.

## Coffee: Visualizing the Response Function

I have been calling my 3D response function a “surface”—which is linguistically feckless at best. Of course, I have some company, the immortal Box, Hunter, and Hunter call the experiment design a response surface method. So note carefully, the response function is not a surface, it completely fills a volume.

Naturally, this is hard to visualize. The contour plots are slices through the response space, but I really have a hard time inferring what my coffee quality function is telling me. Better visualization to the rescue.

First, put the contour plots together into slices, using just the shaded parts and not the lines. Note that these cuts run through the center of the design point, and not the stationary point.

This is still quite difficult to interpret. The blue and green corners of the areas are clearly to be avoided, while the red-tinted areas near the top are to be embraced. Red, in all these plots, is a higher value of q than blue. The weird little balls are the points on the steepest ascent experiment path. Their radius corresponds to their predicted q values. Tufte, of course, would have a fit; clearly the volume should be proportional to the indicated variable.

Those planar cuts are nice to look at, but we can do even better. Add “isosurfaces”—these are surfaces that lie on constant values of q. In two dimensions a topo map shows lines of constant elevation. For 3D data, the lines become surfaces. In the following figure I’ve kept the horizontal cut, and added isosurfaces. The big black line shows the steepest ascent.

This is getting useful. To make a good cup of coffee, you want to be somewhere in the red cup on the upper-left-front corner. The zone is not enormous. To more clearly visualize this, here is the same concept from a few different angles.

So, here we are, and I still don’t have a good idea what the response function looks like outside the design space. Clearly there is a region in the middle which is relatively high quality, and it appears that higher C/W ratios are favored.

Finally, to get a solid sense of the response function I thought it would be helpful to zoom out. The next graph shows the response function in extrapolated over a much wider range.

The little box in the middle is the design space. In this case there are four isosurfaces drawn, for q=1,2,3,4. With this visualization you can clearly see the shapes represented by the response function. I’m sure that the function is a poor representation of truth in the regions far from the design space. Nevertheless, this diagram is more helpful to me than any other, as I can see there is clearly a cup-shaped region in which my coffee should be really awesome. The region is surprisingly narrow over C/W ratio, time, and temperature. A poor-man’s sensitivity analysis.

All the previous analysis has been on the response surfaces for the RSO data set only, that is, I left out the original data from the screening experiment. If I include the alternate data the fit remains qualitatively similar in that there is a tasty region in the upper left hand part of the original design space. However, the rest seems very different. There is now a saddle point, and more of the design space is better than before. Now, too, the steepest ascent path points toward increasing C/W ratio and decreasing temperature. Unlike the previous fit, though, there is substantial decrease in time inferred. Furthermore, this surface admits quality values in excess of 5. The upper isosurface is q = 5.

# Resources

On the off chance you’re wondering how I made these really cool plots…

Analysis of the experiment and fitting of the response model to the data was accomplished in the wonderful language R. I couldn’t find any way to do cut-planes in R, and in any case am not terribly good with it. So I used Python(x,y) with the Enthought Tool Suite, in particular the mlab and Mayavi interfaces. Many, many thanks to the good people of the Pythonic world, and especially Enthought, for the quality product they have created. Who needs Matlab anyway?

My source code is available upon request.

## Coffee: Analysis & Results of the Response Surface

I have executed the experiment design discussed Coffee: Design of the Experiments (Part 2: RSO). This “response surface method” experiment design is carefully crafted to provide the data necessary to fit a formula of the form

q = b0T2 + b1t2 + b2r2 + b3Tt + b4tr + b5Tr + b6T + b7t + b8r + b9

This model describes quadratic terms (T2, t2, r2), main effects (T, t, r), and first-order interactions (Tt, tr, Tr). The model basically assumes that quality is locally curved or locally linear, but not an undulating surface. This assumption of local smoothness is a good one in this case—were it not the experience of buying a cup of a coffee would be a dangerous gamble where small changes made by the brewer resulted in big changes in our cup.

Some possible findings in our analysis

• There is a maximum value of q at some point in the design space, whose approximate location we can find with the preceding method.
• There is a maximum value of q at the edge of the design space, that is, we might find that the best q occurs at some maximum value of concentration (a face of the design cube), or even, for example, the maximum value of concentration and temperature (an edge of the design cube), or a maximum of all the values (a corner of the design cube).
• Some variables don’t matter, and the entire experience is governed by only one variable—indeed the screening experiment hinted at this outcome.

If the optimum is inside the design space, then I know immediately how to brew the best cup of coffee—within some amount of error. If the optimum is outside the design space, at least I know which experimental trend to pursue. Namely, new experiments are positioned along the path of most rapid ascent, or a new design cube is positioned outside the current design cube, with the center moved along the path of most rapid ascent.

# Results

The following three graphs show slices through the response surface fit to the experiments. The slices are positioned so that they intersect the “stationary point”. If there is a local maximum, then it is the stationary point; the plots then display the local sensitivity around that stationary point. The quadratic fit can also produce saddle-shaped responses, in which case the center of the saddle is a stationary point.

Note that these plots are all based on the fit to just the response surface design. A different fit is obtained by including the data from the screening results.

The graphs all work in “coded variables”, that is, they are all adjusted so that the values lie between -1 and 1. It is necessary to decipher these values into measureable numbers, which the following equations do. The coded value is indicated with a c subscript.

Time tc = 1.35log10(t) – 2.35

Temperature Tc = 0.1(T – 195)

C/W Ratio rc = 50(r – 0.055)

These equations are easily inverted, though below I provide uncoded (normal) values for the points in the steepest ascent.

The fitted space suggests that water temperature is important—certainly more important than the screening experiment suggested. However, my cup quality actually improves as the temperature declines. Almost all the brewing words I’ve read, including the latest Consumer Reports, harp on the importance of getting water hot enough. My experiment suggests that modulating extraction duration and C/W ratio are at least as important.

## Steepest Ascent

The response surface methodology encourages further experiment design. Starting from the stationary point, you can follow the path of steepest ascent to move toward ever better quality—at least in theory. To my surprise, this worked for a one-trial qualitative test; more on that later. The steepest ascent predicted using the fit to just the RSO data is shown in the next table. Note that the concentration of coffee is quickly leaves the design space, which was limited to concentrations less than 0.075.

 Steepest (RSO only) Time [sec] Coffee [g] Water [fl oz] Temp [F] C/W Ratio [g/ml] Predicted Q 55 22.3 13.7 195 0.055 3.7 69 22.3 12.1 192 0.062 3.9 90 22.3 10.7 190 0.071 4.1 116 22.3 9.5 187 0.079 4.3 148 22.3 8.6 185 0.088 4.4 188 22.3 7.8 183 0.096 4.6 238 22.3 7.2 181 0.105 4.7 301 22.3 6.6 179 0.114 4.8 380 22.3 6.2 177 0.123 4.9 479 22.3 5.7 175 0.131 5.0 601 22.3 5.4 173 0.140 5.0

A similar fit using all the available data produced the steepest ascent trials in the next table. It is, perhaps, less believable since the quality values rise in excess of 10.

 Steepest (RSO + Screening) Time [sec] Coffee [g] Water [fl oz] Temp [F] C/W Ratio [g/ml] Predicted Q 55 22.3 13.7 195 0.055 3.3 78 22.3 12.0 193 0.063 3.7 105 22.3 10.4 192 0.072 4.0 132 22.3 9.2 191 0.082 4.5 161 22.3 8.2 190 0.092 5.1 192 22.3 7.4 190 0.102 5.7 226 22.3 6.8 190 0.112 6.4 264 22.3 6.2 189 0.121 7.2 307 22.3 5.7 189 0.131 8.1 355 22.3 5.3 189 0.141 9.1 409 22.3 5.0 188 0.151 10.2

Despite the implausibility, I conducted a trial of the 3rd experiment in the combined table, that is, 105 seconds extraction at 192 F for a 0.072 concentration coffee. My results:

“Quality = 4.2. Toasted marshmallow flavor! Aroma is rich and without char. Acid and bitter are nicely balanced in the first sip but the bitter dominates the aftertaste.”

The quality agreement (and I really didn’t peak) is unexpectedly close—4.2 instead 4. Furthermore, there I was genuinely surprised at entirely new flavors. It really was a great cup of coffee. Damned strong though.

For comparison, the contour plots produced by the fit to the RSO data and the screening data are shown next.

A future post will discuss the goodness of fit, and whether the models are meaningful. For the present, rest assured that predicting a quality of 4 and measuring the same convinces me that the results are useful.

# Data

The actual data, along with my notes at the time, is shown in the following table.

 Coffee [g] Water [fl oz] C/W [g/ml] Time [sec] Temp [F] Trial Order Q Date Comment 22.8 14.0 0.055 300 205 1 1.5 2/20/2009 Not good. Bitter flavor with simple aroma. Too bitter 22.8 22.0 0.035 55 205 2 2.5 2/23/2009 Simple flavor with muted cemplexity. Aroma and flavor dominated by char. Neither too bitter nor too acid. Not balanced though. 22.8 14.0 0.055 10 205 3 2.5 2/24/2009 Too bitter. Aroma fairly simple. Mouth feel too watery. Little acidity, poorly balanced between acid and bitter. 22.8 10.3 0.075 55 185 4 4 2/25/2009 Aroma is excellent w/ complex interesting smells. There is some crema. Nicely acidic and reasonably balanced w/ bitter. 22.8 14.0 0.055 55 195 5 3.5 3/2/2009 Acid-bitter balance too heavy on bitter side. Nice crema and nice aroma. Rich mouth feel. 22.8 22.0 0.035 10 195 6 3 3/3/2009 Moderately well balanced. Aroma is simple. Mouth feel watery. 22.8 14.0 0.055 55 195 7 4 3/4/2009 Well balnced w/ good aroma. A little too bitter, but not unpleasant. 22.8 10.3 0.075 10 195 8 3.5 3/5/2009 Aroma somewhat simple. Flavor strong or even rich. Too bitter overall and acid/bitter balance poor. 22.8 14.0 0.055 10 185 9 2 3/6/2009 Bitter simple flavor without much acid. Aroma is farily good but not as rich as I like. 22.8 10.3 0.075 300 195 10 3 3/9/2009 Smelled bitter. Taste moderately well balanced with a little too much bitter. Rich feel. 22.8 22.0 0.035 300 195 11 2.8 3/10/2009 Watery. Yet implanaced toward bitter. Flavor is well-bodied and aroma good. 22.8 10.3 0.075 55 205 12 2.5 3/11/2009 Burned and charred taste. Too bitter while also being nicely acidic. Not well balanced. Aroma dominated by char. 22.8 22.0 0.035 55 185 13 2.5 3/12/2009 Aroma dominated by char and relatively uninteresting. Taste is poorly balanced–too bitter. Strong flavor. 22.8 14.0 0.055 300 185 14 4 3/16/2009 Well balanced and nicely aromatic. A little bitter on the finish. 22.8 14.0 0.055 55 195 15 3.5 3/17/2009 Bitter flavor with weak mouth feel and low complexity. Aroma nicely balanced though.

# In Review

The results of my 4-trial fractional factorial experiment (main effects) showed that extraction time is the most important variable, and that temperature and C/W ratio are less important. In the interest of completeness this experiment will not simply vary extraction time, but rather seek to efficiently find the maximum.

# Lessons from the Screening Experiment

I am accustomed to making my coffee with about 22 fluid ounces of water, which makes two nice cups for my morning. During the screening I used 22 fl oz as a baseline, and increased or decreased the amount of coffee accordingly. The result was that some experiments required 48.8 g of coffee. You can’t possibly imagine how much coffee that is. It almost completely fills a coffee cup with just the beans!

In subsequent experiments I will limit the amount of coffee beans to something more modest, perhaps about 20 g, and vary the amount of water. With this more cautious approach I may live to see the final results.

# The Design

NIST offered three designs for response surface objective (RSO) experiments. Specifically there are two interpretations of the Box-Wilson central composite design, each of which requires 20 total runs. An alternative design reduces the number of experiments to 15, which is the reason I chose it.

Specifically, the Box-Behnken design provides an RSO approach for 3 factors in 15 experiments. That is the same number I originally suggested through my naïve approach, but without all the baggage. Effectively, my original 5 values for each factor get reduced to three, and those are shuffled about in a way that is well-suited to estimating their effects. In my case, the specific experiments would be:

 Response Surface Objective Actual Values Statistics Jargon Trial r (g/ml) t (sec) T (F) R T T 1 0.035 10 195 -1 -1 0 2 0.075 10 195 +1 -1 0 3 0.035 300 195 -1 +1 0 4 0.075 300 195 +1 +1 0 5 0.035 55 185 -1 0 -1 6 0.075 55 185 +1 0 -1 7 0.035 55 205 -1 0 +1 8 0.075 55 205 +1 0 +1 9 0.055 10 185 0 -1 -1 10 0.055 300 185 0 +1 -1 11 0.055 10 205 0 -1 +1 12 0.055 300 205 0 +1 +1 13, 14, 15 0.055 55 195 0 0 0

This set will take me about three weeks. With luck by the time you read this I will have trickled out previous articles, and you will have results in two weeks or less. Sorry for the suspense.

Oh yes, I’m interested in other people’s Q functions too, so if you want to run my exact experiments, please post your data and I will post your results. Public or anonymous, at your discretion.

## Coffee: The Detestitron

Please, when you are shopping for a grinder, don’t believe anything you hear (except what I tell you…). A blade grinder is probably as good a grinder as you want to afford. Don’t presume that a burr grinder will automatically produce a more consistent grind. Take, for example, my manual burr machine. Compare the grind consistencies from my cheapo blade grinder with my burr grinder in the following picture. Neither is “consistent”, and the variation from the burr grinder goes from about 4 mm to dust. The blade grinder produces 1 mm to dust. Neither appears to produce a dominant particle.

You may want to cover your children’s eyes; even a brief look at this grinder could give them nightmares.

Detestitron is a heavily modified portable burr grinder. It was about $20, but it uses conical ceramic burrs and is quite ingeniously designed. The modifications I made to the original grinder maintain concentricity between the inner and outer cones, and facilitate holding the cursed thing to the table while turning the crank. The grind consistency, for all that, did not actually improve noticeably. The Detestitron’s grind axle is centered in ball bearings. The original shaft was extended with a small brass rod, brazed on. The bottom bearing is fixed in an aluminum bar, which is in turn screwed to the chassis. Note that the center cone is not perfectly round, and so this keeps the grinder very stable—but it is not really centered. The burrs are basically the same as you would see on a pepper mill, but much larger. There is a screw-like mechanism to force the beans into the narrow grinding slot. Detestitron’s fundamental problem is that the length of the surfaces over which there is a consistent separation is very short—the coarser the grind, the shorter the overlapping length. I believe that any grinder with similar burrs would produce similar results. There are other designs, but they seem to start at about$200. Did I mention donations are welcome?

## Coffee: Analysis & Results of the Main Effects Study

I have conducted the trials in the following table.

 Actual Values Statistics Jargon Trial r (g/ml) t (sec) T (F) r t T Q 1 0.035 10 205 -1 -1 +1 2.5 2 0.075 10 185 +1 -1 -1 3 3 0.035 300 185 -1 +1 -1 4 4 0.075 300 205 +1 +1 +1 4.1

This study is a 2-factor partial factorial design, indicated as 23-1. The good people who write for Wikipedia tell me that this experiment design gives “main effects” but those may be confounded with two-factor interactions. Indeed, there is some reason to be concerned for the simple quality function evaluation in this situation—for example, I worry that row 1 and row 2 could easily produce similar effects. C/W is lower in 1, but the temperature is higher, so extraction should be similar to a higher ratio with lower temperature. In any case, if the result is not revealing, the experiment may be augmented with the remaining 4 cases in the factorial design.

The main effects plot shows how each parameter influences the result, though recognize that interactions may be masking main effects. The main effects plot for this experiment is shown below. From this it would appear that extraction time is the most sensitive parameter—to me a surprising observation. I had certainly expected the coffee/water ratio (cwrat) to be the dominant variable, but here it does not appear to be so. The negative correlation with temperature is very surprising, and indeed I am skeptical. Still, the variation appears to be quite small, and is likely due to two-factor interactions.

I would be inclined to fix the temperature and the C/W ratio at their midpoints, and vary the time alone; however, I believe the response hypersurface is more complex. I shall press forward with the response surface methodology on all three variables.

In case you want to see my data.

 Trial Order Q Date Comment 1 2.5 2/16/09 Dishwatery taste and aroma. Very simple palate with almost no mouth feel. Equivalent to mediocre drip. 2 3 2/17/09 Not very aromatic and too simple. Has nice bitterness and some acidity but not well balanced between them. 3 4 2/18/09 Mildly aromatic. Good mouth feel with acid/bitter well balanced. Slight char flavor. 4 4.1 2/19/09 Too bitter. The aroma is marvelous being very rich and tantalizing. Mouth feel is creamy. If bitterness where attenuated this would be sublime.

I wasn’t going to use fractions, but I decided I needed them. Note that trial 4 was assigned a quality of 4.1, which is really a note saying that this was a little bit better than trial 3, but not much.

## Coffee Design of Experiments (Part 1: Screening)

I will invoke a variety of principals from experiment design methods. The truth is, though, that I’m self-taught in experiment design. It is altogether probable that there are better ways of conducting this experiment which my ignorance has hidden from me. I’m open to suggestions!

# Quick Review

The variables in control are temperature t, extraction time T, and coffee-to-water ratio (by weight) r. Each of these will be considered over a domain of 5 values. Temperature and coffee-to-water ratio are linearly divided into 5 steps between their minimum and maximum values, which I selected based on published coffee brewing advice. Extraction time is divided logarithmically over its domain, since my intuition is that extraction is an approximately logarithmic process in time.

# Linear Uncorrelated Approach

I might assume that t, T, and r are uncorrelated and independent. Then, without assuming that Q is linear in t, T, or r, I could easily conduct the following experiments (if the mathese bothers you, skip ahead to the text):

• T = 195 F, t = 55 seconds, r =
(0.035, 0.045, 0.055, 0.065, 0.075) g/ml
• T = 195 F, t = (10, 23, 55, 128, 300) seconds, r = rmax = argmaxr
Q(r, T = 195, t = 55)
• T = (185, 190, 195, 200, 205) F, t = tmax = argmaxt Q(r = rmax, T = 195, t)

In English, first find the best point by finding the optimum coffee-to-water ratio for a fixed extraction period and temperature. My experience suggests that the ratio is likely to be one of the more sensitive parameters, so sweeping this first should provide good insight. Then, using that optimum C/W ratio, sweep the extraction time to find the best duration. Finally, sweep the temperature.

Really, this is a sort of bastardized gradient search. It doesn’t account for correlation among parameters. For example, you might imagine that a very high C/W ratio combined with a short extraction time would produce a very different flavor than the same high C/W ratio and a long extraction time. In fact, one might be delicious and the other dreadful. The proposed methodology, however, would not reveal that condition.

Another problem is that it really isn’t safe to assume only 15 trials would be required. The quality function (my enjoyment) is likely to be quite subjective and quite variable—I’ll assume for argument that it is also Gaussian. Q can take integer values from 1 to 5, and it is quite reasonable to assume that my variability has a standard deviation is 1.5 or so. To know the actual quality accurate to a single value of Q with 95% certainty requires the standard deviation to be about 0.25. We can reduce the standard deviation of the estimated Q by averaging multiple experiments. How many? (1.5/0.25)2 = 25. Now the original 15 trials have turned into 15×25 = 375, and at 1 experiment per day pushed the answer a year into the future.

A savvy person might ask if there is a way to find an equally good answer with fewer experiments or if there is a better way to arrange those 375 experiments to get more broadly useful answers.

NIST provides a sublime search tree for selecting an experiment design, and it is clear that what I’ve outlined is best served by a response surface objective (RSO) on 3 factors, or possibly a main effects design first, followed by a response surface objective on 2 or 3 factors, or even a simple 1-factor search.

While I expect to consider the RSO approach in terms of number of experiments, and then examine a screen+RSO approach to see if there is a design which offers possibly reduced number of experiments. The Box-Behnken design for RSO in 3-factors requires a paltry 15 experiments. Still, that is three weeks away, and I would prefer to have some data to work with sooner.

The screening test is much smaller, and perhaps worthwhile in that the data can be included in the RSO experiment, even though not strictly required. For my 3-factor system the Level III screening design requires a modest 4 runs. This would be small enough that I could even do two trials of each, which would be of great help in reducing my variability. The experiments are listed in the following table.

 Screening Design Actual Values Statistics Jargon Trial r (g/ml) t (sec) T (F) r t T 1 0.035 10 205 -1 -1 +1 2 0.075 10 185 +1 -1 -1 3 0.035 300 185 -1 +1 -1 4 0.075 300 205 +1 +1 +1

# If I Had a Grinder

The other major effect that I would like to measure is grind size, predicated on substantially uniform product. My grinder doesn’t do it. Consider, though, the 4-factor version of this same exercise. The four-factor RSO requires 33, 46, or 52 experiments, depending on the method. The four-factor screen requires 8 experiments. Even if the screening doesn’t reduce the number of factors substantially, the experiment is tractable at 41 trials. Of course all this assumes Q has tolerable variance.

If you donate the grinder, I’ll do the experiments…