|
|
|
The Hyper-Greco-Latin Square Experimental Design as a Formulation
Ingredient Selection Tool
This paper was presented to the National Meeting, October, 1995 of the
NLGI
Introduction
Formulations are functional mixtures. As such, each ingredient is present to provide or
modify one or more properties or functions for the final product. A dilemma facing the
development team is this. For each function, there are usually at least a few and often
several possible ingredient choices. Therefore, one of the early tasks in developing a new
formulation is to find a particularly good combination of ingredients to start an
optimization process. This is usually to be done in the face of an often overwhelming
number of possible combinations of the choices.
This paper is an survey of the risks one faces when embarking on such a task. It features
an industrial example of the use of the Latin-square experimental design to minimize those
risks during formulation development. Finally, it describes an expansion of the Latin
square to a hyper-Greco-Latin-square(HGLS) in which 6 different ingredients, each having 5
different choices, can be examined in 25 experiments. The HGLS is a method of ingredient
selection that has proven to be a superior strategy for use in the early stages of
formulation development.
Comparisons
One simplifying view of ingredient selection is to consider each choice as a chain of
comparisons between the performance of the several ingredient choices. One wishes to know
which of the several choices is best. The basic framework is contained in the following
question. "Is Treatment A sufficiently different from Treatment B to be of
significant interest?" Examples of comparisons occur in all stages of R&D. These
comparisons could be addressing the relative worth of two different additives in a
lubricant, two different vendors for the same additive, two different processing
treatments (high or low) of temperature, time or shear, or perhaps two different machines
or analytical instruments. The essential nature of the question is that it can be answered
with a yes or no as to whether or not there is an interesting difference between them.
Decisions and Risk
There are two obvious positions one might take depending on the outcome of one's
laboratory effort. The experimentalist might declare that "Yes, there is an important
difference," or "No, there is not sufficient difference to be of interest."
Our task is to discover the truth about which additive is better. However, we can never
actually know this truth, we can only infer it from our experience. There are two ways to
draw a conclusion that is congruent with this physical truth. We can declare that a
difference between the two treatments exists, when it does in fact, and we can declare
that no significant difference exists when there is truly none.
Conversely, there are two ways to draw a wrong or incorrect conclusion. In such cases, our
belief is not congruent with the physical world. We can declare that one system is better
than the other, when we have actually mistaken the background noise for an interesting
signal. Or, we may declare that there is nothing interesting in the difference, when there
is, in fact, a difference that would be important to know about.
False Positives and False Negatives
We will call the former erroneous outcome a 'False Positive.' We make such an error
when we mistakenly declare that we have a treatment difference when all we have been doing
is observing the background noise. This is also called a Type I error, and the probability
of such an event happening is called the a-risk. Experience shows that for many
experimentalists (and their supervisors and managers), the rate of false positives can be
over 30% in the absence of appropriate statistical defenses. We will discuss the defenses
below.
Returning to the issue of errors, we will term the error in which we overlook a true
signal a 'False Negative.' We make this error when we mistakenly declare that we do not
see a treatment difference, when one actually exists. In such cases, we have overlooked
something we would be interested in knowing. This is called a Type II error, and the
probability of such an event happening is called the b-risk.
Our situation is shown in Figure 1 below. There are two ways to draw the correct
conclusion, and two ways to get it wrong.
Figure 1
The terms Type I and Type II are largely used by statisticians.
Errors versus Mistakes
At this point, we must distinguish between errors and mistakes. We make an error when
we draw conclusions that do not agree with the long-term behavior of our system due to the
effects of background noise alone. This is not a mistake. A mistake is a blunder. If there
is a special situation or event that has a consequence that influences our decision, we
have made a mistake. An example of a mistake is using the wrong amount of an additive
(mis-weighing), but concluding that the resulting difference was due to the ingredient
choice.
Symptoms
The basic symptom of the false positive is irreproducible results. A false positive is
usually discovered after further study of the 'interesting' difference that has been
declared. The experimentalist finds that the result is not repeatable. Additional
consequences often include expenditure of significant resources trying to figure out how
to get the 'good' result again, when actually the result was due to noise. Another
significant consequence is the loss of credibility to the researcher (or the lab that
generated the data, more about this later). His or her work comes under question if a
significant number of these situations occur.
The worst case scenario for programs with high rates of false positives are new product
candidate leads which do not pan out under further investigation. Usually, these further
investigations involve more expensive tests, additional tests, increased work load. In the
worst cases, field trial failures can result when candidates are moved forward based on
false positives. Often it is concluded that the candidates were moved forward too fast.
On the other hand, there are typically no symptoms of false negatives. These are silent
errors. Typically, we walk away from the experimental area of study when we declare that
there is nothing of interest in the results. Rarely do we go back and re-examine areas
that we believe are without interest. Finally, if there is an important discovery that we
have overlooked, we create an opportunity for the competition with a false negative.
Noise and Its Impact
Background noise and its misinterpretation puts a scientist and his or her business
patrons, at risk of drawing the incorrect conclusion and expending R&D resources in
pursuit of erroneous hypotheses. The minimization of these errors is the reason to engage
in experimental design. It is the major reason to use techniques like the
hyper-Greco-Latin square in formulation development.
Variation
These difficulties stem from the fact that all things vary. We can never actually
'repeat' a batch, a run, or a reaction exactly. We will get results that are (usually)
similar to but not exactly like our previous experience. Thus, in any comparison we are
measuring both the treatment differences and the background noise simultaneously.
Strategy
Minimizing false positive declarations requires a clear expectation of what magnitude
of differences we might reasonably expect due to background noise alone. This is like the
'blank' used by analytical chemists. We run comparisons and ask the following question.
"Is the difference I measure unusual or unexpected based upon what I know about the
background noise of my system?"
We then protect ourselves from false positives by making positive declarations only if we
can answer this question affirmatively. Furthermore, we can decide before we run the
experiments how unusual the difference needs to be for us to call it a signal. We ask the
following question:
"What is the probability that such a difference could come from background noise
alone?"
We use a table of background noise to address this question. This table is the (often
misunderstood) Student's t-table.
What is the key to not overlooking true signals? Minimizing false negative declarations
requires obtaining enough of the right kind of information to defend against the
background noise of the system. We need to average together replicated experiments to
'smooth out' the noise so that the signals are more evident. By using averages we make our
comparisons more sensitive.
How Much Data Do I Need?
Another outcome of the above discussion is this. The amount of data to be averaged can
usually be calculated at the outset of any experimental project. One can decide if the
information is worth the cost and do something else with our resources if it is not. This
is not a bad thing to decide. It is far better than engaging in a project with the vain
hope that we can get an answer quickly and/or cheaply.
There are two major benefits to this approach. The first is that we don't fall into the
'just a few more experiments' trap. In my experience, a significant portion of some
R&D budgets get consumed this way. The other benefit is that enough data will be
collected before the opportunity to get more data is gone. Field trials with insufficient
tests are a classic example that can be avoided. Another is that we will manufacture
enough test grease of each type, once we know how many tests must be performed.
How much data one needs is dependent upon three things:
- How variable the system is. What is the magnitude of the back-ground noise of the
system.
- How small of a difference do you want to be reasonably confident of detecting. What is
the magnitude of your signal threshold?
- How confident do you want to be in your conclusions? How much risk are you willing to
accept? What is the likelihood that you will make an incorrect declaration.
The mathematical formula is given below.
- in which
-
N is the number of replicated experiments required at each of the two levels.
is the standard
deviation of the population from which the experiments will be taken,
-
is the true signal
size that one wants to be reasonably sure of detecting, should it really exist, and
-
is a constant that
incorporates the level of risk that the experimentalist is willing to accept. =3.0 if we are willing to accept
5% false positives and 15% false negatives.
We will not dwell on this equation, but it is the fundamental relationship between
sample size, signal, noise, and confidence in decision-making.
This relationship has several consequences. For example, the larger the background noise,
the more data one needs to detect a difference of any given size. Or, the larger the
noise, the less confident one is in a decision given similar signal and noise. Another is
that the larger the noise, the more data one needs to detect a difference of any given
size at the same level of confidence.
R&D Strategy & Experimental Design
Let us return to the issue of developing a formulation, and choosing winning
combinations of ingredients. We now have in hand a framework to explain why the one
experiment at a time approach is fraught with difficulties, and why we need to average
together several (sometimes many) replicated experiments before we make a decision.
However, without some care, this averaging increases our work load and hence our costs. We
cannot afford to do this without some way to improve our efficiency.
What we need is a method to combine together the right number of replicated experiments in
a particular way that gives answers to more than one question at a time. There are many of
these schemes. Collectively, they are called 'experimental design.'
Experimental design is a collection of vigorous methods for obtaining information about
any experimental system under study. The reason to learn and use them is to obtain
unambiguous results at minimum cost.
There are several basic types of designs. Those that examine many possible factors and
separate which are potent from which are weak are called screening designs. These screen
out the interesting from the useless. We can subsequently focus our attention on the
interesting.
Early in a formulation development plan, we often wish to pick winning combinations of
ingredients. We will optimize their levels, and the production factors such as temperature
and time later. First we need to know which ingredients we wish to study. One such family
of designs, the Latin-Squares has been used repeatedly in functional mixture development.
This author collaborated with Dr. Carl E. Ward in planning to use such a design for grease
formulation ingredient selection.
The Hyper-Greco-Latin Square
Let us consider some of the work of Ward and Littlefield, which was reported at last
year's NLGI conference. This publication has received the Clarence E. Earl Memorial Award
for the best technical paper of last years meeting. Briefly, this work represents the
sequential use of experimental design tools in a formulation development program. They
used a Latin square experimental design as a first-step. This method allowed them to
select the best of each of a group of 5 choices for each of 3 ingredients . This was their
opening move in their program of developing a new grease. The diagram of the Latin square
is shown in Figure 2.

Figure 2
Ward and Littlefield examined 5 different EP/anti-wear agents, 5 different rust
inhibitors, and 5 different copper passivators. Note that the columns and rows contain,
respectively, the 5 levels of E/P agents and Rust Inhibitors. Each of the squares contains
one copper passivator. Each square represent a single experimental preparation.
Note that no pair-wise combinations of any two ingredients occur more than once. For
example, the E/P-anti-wear agent, "C2" combined with the rust inhibitor,
"R4" occurs only once. This is true for each column and row Each copper
passivator occurs once in each row and once in each column. Each square has a single
choice of each ingredient type. Also, the plan is balanced in that each ingredient of a
given type is used in 5 of the formulations. Thus, the average of the 5 responses from Row
1, is to be compared with the average of the response of the 5 experiments in Rows 2
through 5. Likewise, the averages of the 5 columns will be compared among each other to
determine the best performance of E/P agents. Finally, the average of the 5 experiments
for each letter, A through E will be compared. Thus, the best choice for each can be found
easily. Note that in each average, all other levels of each other factors occur only once,
their contributions in effect, canceling out.
This design represents 125 possible combinations. Ward and Littlefield made all of their
comparisons with the power of averages of 5 vs. averages of 5. They did so with only 25
experiments. They then prepared a prediction equation and calculated the expected outcome
the outcome of all of the 125 possible combinations. The predictive ability of the
equations were tested by preparing those that were expected to have superior performance.
Thus, they were able to identify winning combinations with a minimum of effort, and an
maximum of confidence.
Twenty five may seem like a lot of experiments, but they examined a total of 125 possible
combinations ( )with the power of averaging 5 vs. 5. This
represents an effort of 625 separate preparations that were examined with the effort of 25
experiments. This is an extremely efficient use of one's experimental and testing effort.
Adding One More Variable
Yet, as efficient as buying the results of an effort of 625 experiments for the cost of
25 appears, this design can accommodate additional variables. Consider studying an
oxidation inhibitor, in addition to the Ward-Littlefield list of 3. If we do this, we now
have 25 experiments in which we examine 5 levels of each of 4 different ingredients, all
with an average of 5 vs. an average of 5. This is shown in Figure 3.

Figure 3
This Greco-Latin square represents a total of 625 combinations ,( ),
and, with averaging, the effort of 3,125 separate experiments. Again, this is done in only
25 experiments. Each row has each of A, B, C, D, and E only once. Each row has a, b, g, d,
and e occur only once. The same is true for each column. Therefore, we have maintained the
balance of the Latin square. Now that we have added Greek letters, the design is called a
Greco-Latin square design.
Does it stop here? No! In fact, two more factors may be examined at 5 levels each with
these same 25 experiments. We add each as a group of 5 levels. These highly saturated
designs are called hyper-Greco-Latin squares. Consider Figure 4, below.

Figure 4
We wish to examine 6 different factors, each with 5 possible choices. The factors could
be anti-oxidants, metal passivators, E/P-anti-wear agents, viscosity modifiers,
thickeners, thickener complexes, dyes, or any other additive to our formulation. In this
6-ingredient, 5-level case we have 15,625 possible combinations of ingredients ( ).
We will still consider the contribution of each with the confidence of an average of 5
compared with averages of 5. This represents the effort of 78,125 separate experiments.
Again, we will examine them with only 25 experiments. This is efficiency!
Although there is not time to go through an actual case study of this design, the method
has been used in many industrial settings. Paints, inks, plastics, and pesticides, as well
as now greases have been developed through the use of this experimental design.
Conclusions
The Hyper-Greco-Latin Square is a robust design for screening the best combinations of
ingredients from the many possibilities. The mathematics for its effective use is simple,
and understandable. This method deserves to be considered the next time a functional
mixture formulation is the goal of an R&D effort.
References
- Box, George E. P., Stuart G. Hunter, and William G. Hunter; 'Statistics for
Experimenters: An Introduction to Design, Data Analysis, and Model Building'; John Wiley,
1978.
-
Ward, Carl E., and Carlos E. Littlefield; 'Experimental
Design in New Grease Development'; Presented to the 61st Annual Meeting of the National
Lubricating Grease Institute; October 23-26, 1994; Palm Springs, CA
1125-B Arnold Drive, Suite 271
Martinez, CA 94553-4108
Voice
FAX
|