Live Free or Dichotomize
/tags/rms/index.xml
Recent content on Live Free or DichotomizeHugo -- gohugo.ioen-usThe dire consequences of tests for linearity
/2017/02/18/the-dire-consequences-of-tests-for-linearity/
Sat, 18 Feb 2017 09:03:06 -0600/2017/02/18/the-dire-consequences-of-tests-for-linearity/<!-- BLOGDOWN-BODY-BEFORE
/BLOGDOWN-BODY-BEFORE -->
<p>This is a tale of the dire <strong>type 1 error</strong> consequences that occur when you test for linearity.</p>
<a href="https://upload.wikimedia.org/wikipedia/commons/thumb/f/f4/The_Scream.jpg/603px-The_Scream.jpg" target="_blank"> <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/f/f4/The_Scream.jpg/603px-The_Scream.jpg" width=50% alt="the scream"> </a>
<p style="color:#EB6864; font-size: 10pt;LINE-HEIGHT:15px;">
<em>Edvard Munch’s The Scream (1893), coincidentally also the face <a href="https://twitter.com/f2harrell">Frank Harrell</a> makes when he sees students testing for linearity.</em>
</p>
<p>First, my favorite explanation of <strong>type 1 error</strong> 🐺:</p>
<center>
<blockquote class="twitter-tweet" data-conversation="none" data-lang="en">
<p lang="en" dir="ltr">
<a href="https://twitter.com/jgschraiber"><span class="citation">@jgschraiber</span></a> <a href="https://twitter.com/eagereyes"><span class="citation">@eagereyes</span></a> Pro-tip that changed my life: in The Boy Who Cried Wolf, the villagers first make a Type 1, and then a Type 2 error.
</p>
— Sam (<span class="citation">@geometrywarrior</span>) <a href="https://twitter.com/geometrywarrior/status/781162199540719616">September 28, 2016</a>
</blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
</center>
<p>We generally fix (or claim to fix) this <strong>type 1 error</strong> at 0.05, but sometimes our procedures can make this go awry!</p>
<p>I’ve prepared a <strong>very</strong> basic simulation.</p>
<ul>
<li>generate 100 data points from two independent random normal distributions, an outcome <span class="math inline">\(y\)</span> and a predictor <span class="math inline">\(x\)</span><span style="color:#EB6864"> (Since these are generated randomly, we would <strong>not</strong> expect there to be an association between <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span>. If all goes as planned, our <strong>type 1 error</strong> would be 0.05) </span></li>
<li>fit simple linear model with a restricted cubic spline on the predictor <span class="math inline">\(x\)</span></li>
<li>test whether the nonlinear terms are significant</li>
<li>if they are, leave them in and test the association between <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span></li>
<li>if they are not, remove them and refit the model with only a linear term for <span class="math inline">\(x\)</span> & proceed to test the association between <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span>.<br />
</li>
<li>calculate the <strong>type 1 error</strong>, how many times we detected a spurious significant association between <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span>.</li>
</ul>
<p>Here’s my simulation code (run it yourself!):</p>
<pre class="r"><code>library('rms')
sim <- function(wrong = TRUE){
#generate completely random data
y <- rnorm(100)
x <- rnorm(100)
#fit a model with a restricted cubic spline
mod <- ols(y~rcs(x))
if (wrong == TRUE & anova(mod)[3,5] > 0.05){
#if the test for non-linearity is not "significant", remove nonlinear terms
mod <- ols(y~x)
}
#save the p-value
anova(mod)[1,5]
}</code></pre>
<center>
<p><span style="color:#EB6864; font-size: 20pt"> [Type 1 error when removing non-significant nonlinear terms]</p>
<p></span></p>
</center>
<pre class="r"><code>test <- replicate(10000, sim())
cat("The type 1 error is", mean(test <= 0.05)) </code></pre>
<pre><code>## The type 1 error is 0.0853</code></pre>
<p>Uh oh! That <strong>type 1 error</strong> is certainly higher than the nominal 0.05 we claim!</p>
<center>
<p><span style="color:#EB6864; font-size: 20pt"> [Type 1 error when not removing non-significant nonlinear terms]</span></p>
</center>
<p>We would expect the <strong>type 1 error</strong> to be 0.05 – I perform the same simulation omitting the step of removing non-significant nonlinear terms and calculate the <strong>type 1 error</strong> again.</p>
<pre class="r"><code>test <- replicate(10000, sim(wrong = FALSE))
cat("The type 1 error is", mean(test <= 0.05))</code></pre>
<pre><code>## The type 1 error is 0.0501</code></pre>
<p>Much better 👯</p>
<p>The conclusion: <a href="http://livefreeordichotomize.com/2017/01/27/yoga-for-modeling/">fit flexible models</a> - skip the tests for linearity!</p>
<p><em>This has been elegently demonstrated by others, check out <a href="http://onlinelibrary.wiley.com/doi/10.1002/sim.4780100504/full">Grambsch and O’Brien</a>.</em></p>
<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
(function () {
var script = document.createElement("script");
script.type = "text/javascript";
script.src = "https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML";
if (location.protocol !== "file:" && /^https?:/.test(script.src))
script.src = script.src.replace(/^https?:/, '');
document.getElementsByTagName("head")[0].appendChild(script);
})();
</script>
<!-- BLOGDOWN-HEAD
/BLOGDOWN-HEAD -->
Yoga for modeling
/2017/01/27/yoga-for-modeling/
Fri, 27 Jan 2017 14:30:26 -0600/2017/01/27/yoga-for-modeling/<!-- BLOGDOWN-BODY-BEFORE
/BLOGDOWN-BODY-BEFORE -->
<p>A New Year’s resolution for all of our models: get more flexible!</p>
<p><em>As an aside, I’m better at implementing yoga for my models than yoga for myself, most of the time I end up like this:</em></p>
<div class="figure">
<img src="http://imgs.xkcd.com/comics/picture_a_grassy_field.png" />
</div>
<p>Anyways, let’s make our models flexible! By flexible, we mean let’s be more intentional about fitting nonlinear parametric models.</p>
<div id="what-do-you-mean-by-nonlinear-modeling" class="section level2">
<h2>What do you mean by nonlinear modeling</h2>
<p>By nonlinear modeling, we mean fitting parametric models with nonlinear terms. In particular, Professor Harrell suggests restricted cubic splines (more on this below). So by this definition, you could be fitting a “linear” model (as on ordinary least squares), with nonlinear terms. We brought up in class that this can cause some confusion, and setting on calling it “Gaussian regression” So you heard it here first, folks, from now on, refer to your ordinary least squares models as “Gaussian”.</p>
<p style="text-align:right; color:#EB6864; font-size: 10pt; LINE-HEIGHT:15px;">
Update: Professor Harrell proofed this and suggested <br> “what about just calling it OLS?”, so nevermind on <br> Gaussian, it had a good run…
</p>
<p>I spent a really long time trying to think up a whitty pun/slogan for this name change, alas, all you get is this:</p>
<div class="figure">
<img src="http://imgs.xkcd.com/comics/math_paper.png" />
</div>
<p><em>If you have a particulary clever slogan, <a href="http://twitter.com/LucyStats">send it to me.</a></em></p>
<p style="text-align:right; color:#EB6864; font-size: 10pt; LINE-HEIGHT:15px;">
Update 2: Actually! We’re putting it to a vote. <br> Gaussian may have another chance. <br> Will update in a further post.
</p>
<p>To do this nonlinear modeling, restricted cubic splines are suggested because</p>
<ol style="list-style-type: decimal">
<li><p><strong>they allow for flexibility without costing too many degrees of freedom</strong> (it only costs <span class="math inline">\(k-1\)</span> degrees of freedom, here <span class="math inline">\(k\)</span> is the total number of knots) <em>for comparison, unrestricted cubic splines require <span class="math inline">\(k+3\)</span> parameters to be estimated</em></p></li>
<li><p><strong>they force linearity in the tails</strong> (that is before the first knot and after the last knot) <em>this is good because cubic spline functions can have weird behavior in the tails</em></p></li>
<li><p><strong>regression coeffcients can be estimated using standard techniques</strong></p></li>
<li><p><strong>there are easy functions in <code>R</code> to do it</strong> Harrell’s <code>rms</code> package makes restricted cubic spines very easy to implement. Simply wrap your predictor in the <code>rcs()</code> function within your model statement. For example, if I were to fit a Gaussian model (see what I did there 😉), I would run the following code to fit <code>x</code> flexibly with 4 knots:</p></li>
</ol>
<pre><code>library('rms')
k <- 4
ols(y~rcs(x, k))</code></pre>
</div>
<div id="why-should-i-care-about-nonlinear-modeling" class="section level2">
<h2>Why should I care about nonlinear modeling</h2>
<p>Fitting flexible models is an excellent way to avoid having to test a whole bunch of assumptions that are hard to prove/disprove! Additionally, once you’ve fit a model, changing it based on “model checking” procedures has the danger of inflating your type 1 error 😱. We discuss an excellent paper by <a href="http://onlinelibrary.wiley.com/doi/10.1002/sim.4780100504/full">Grambsch and O’Brien</a> demonstrating that testing for nonlinear effects & then proceeding to drop from from the model if the test is not significant has real implications for the type 1 error. Might as well just model everything flexibly from the get-go! <em>But what about my degrees of freedom? ah yes, the next section is just for you!</em></p>
</div>
<div id="what-if-i-dont-have-enough-degrees-of-freedom-for-nonlinear-modeling" class="section level2">
<h2>What if I don’t have enough degrees of freedom for nonlinear modeling</h2>
<p>In predictive modeling, we are always concerned with <em>overfitting</em>, or creating models that describe the population we’ve sampled very well, but are not generalizable. In the extreme case, you can imagine if you fit a model with 50 participants and included 50 covariates, one for each participant, you could perfectly predict any outcome (but that would be a pretty silly model). In order to avoid overfitting disasters, there are useful rules of thumb we try to follow. In general, we try to fit models that have around <span class="math inline">\(m/15\)</span> covariates, where <span class="math inline">\(m\)</span> varies depending on the type of model in the following manner (for those following along, this is on page 74 of <a href="https://www.amazon.com/Regression-Modeling-Strategies-Applications-Statistics/dp/3319194240/ref=sr_1_1?ie=UTF8&qid=1484713166&sr=8-1&keywords=regression+modeling+strategies">Regression Modeling Strategies</a>).</p>
<table>
<thead>
<tr class="header">
<th>Response</th>
<th>m</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>continuous</td>
<td>the total sample size</td>
</tr>
<tr class="even">
<td>binary</td>
<td># of events</td>
</tr>
<tr class="odd">
<td>survival</td>
<td># of failures</td>
</tr>
</tbody>
</table>
<p>Now that we’ve convinced you that degrees of freedom matter, I’m sure you’re thinking “but nonlinear terms add degrees of freedom!” True. Restricted cubic splines are a bit less aggressive that regular cubic splines in terms of degrees of freedom usage, but even still sometimes that is not enough. In those cases, we need to determine how to best utilize our degrees of freedom. Professor Harrell recommends the following (page 67 of <a href="https://www.amazon.com/Regression-Modeling-Strategies-Applications-Statistics/dp/3319194240/ref=sr_1_1?ie=UTF8&qid=1484713166&sr=8-1&keywords=regression+modeling+strategies">RMS</a>):</p>
<ul>
<li>Let all continuous predictors be represented as restricted cubic splines</li>
<li>Let all categorical predictors retain their original categories except for pooling of very low prevalence categories (e.g., ones containing < 6 observations).</li>
<li>Fit this general main effects model.</li>
<li>Compute the partial <span class="math inline">\(\chi^2\)</span> statistic for testing the association of each predictor with the response, adjusted for all other predictors.</li>
<li>Sort the partial association statistics in descending order.</li>
<li>Assign more degrees of freedom for variables high on the list (ie allow for nonlinear terms/more knots for highly ranked covariates)</li>
</ul>
<p><span style="color:#EB6864; font-size: 15pt"> VERY IMPORTANT: Make certain that tests of nonlinearity are not revealed as this would bias the analyst.</span></p>
<p>Here is a little snippet of <code>R</code> code to do just that in a logistic regression case. We are using the <code>support2</code> dataset as an example.</p>
<pre class="r"><code>library('rms')
getHdata(support2)
f <- lrm(hospdead ~ rcs(age,5) + rcs(temp, 5) + rcs(hrt, 5) + rcs(pafi, 5) + edu + sex, data = support2)
plot(anova(f))</code></pre>
<p><img src="#####../content/post/2017-01-27-yoga-for-modeling_files/figure-html/unnamed-chunk-1-1.png" width="672" /></p>
<p>And there you have it.</p>
<p><img src = "https://raw.githubusercontent.com/LucyMcGowan/lucymcgowan.github.io/master/figs/df_meme.png" width = 60%> <em>meme cred: Nick Strayer</em></p>
</div>
<div id="take-away" class="section level2">
<h2>Take Away</h2>
<ul>
<li>Always fit flexible models using restricted cubic splines</li>
<li>Never test for linearity & proceed to remove a nonlinear term based on the result</li>
<li>If necessary be choosy about where to spend degrees of freedom</li>
</ul>
</div>
<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
(function () {
var script = document.createElement("script");
script.type = "text/javascript";
script.src = "https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML";
if (location.protocol !== "file:" && /^https?:/.test(script.src))
script.src = script.src.replace(/^https?:/, '');
document.getElementsByTagName("head")[0].appendChild(script);
})();
</script>
<!-- BLOGDOWN-HEAD
/BLOGDOWN-HEAD -->
Regression modeling strategies: a student's perspective
/2017/01/17/regression-modeling-strategies-a-students-perspective/
Tue, 17 Jan 2017 22:15:13 -0600/2017/01/17/regression-modeling-strategies-a-students-perspective/<!-- BLOGDOWN-BODY-BEFORE
/BLOGDOWN-BODY-BEFORE -->
<p>Frank Harrell teaches an amazing course “Regression Modeling Strategies” based on his <a href="https://www.amazon.com/Regression-Modeling-Strategies-Applications-Statistics/dp/3319194240/ref=sr_1_1?ie=UTF8&qid=1484713166&sr=8-1&keywords=regression+modeling+strategies">book</a> each spring at Vanderbilt.</p>
<p>This was one of my all time favorite courses. It has just the right amount of practical strategies, brilliant statistical insight, and zealous disdain for all things stepwise regression. In fact, Frank’s valiant fight against dichotomization inspired the name of this very blog.</p>
<p>This semester, Nick is enrolled in the course, and I will be TAing it, so we thought it would be a perfect time to do what all good 21st century students do, blog about it. We will be using the #rms tag to thread the series to make it easy to follow along.</p>
<p>To kick things off, I encourage everyone to get a flavor of our beloved professor’s style by trying out the following <code>R</code> code:</p>
<pre><code>library("fortunes")
fortune("Harrell")</code></pre>
<p>Here is one of my personal favorites:</p>
<pre><code>##
## The fact that some people murder doesn't mean we should copy them. And
## murdering data, though not as serious, should also be avoided.
## -- Frank E. Harrell (answering a question on categorization of
## continuous variables in survival modelling)
## R-help (July 2005)</code></pre>
<p>We have been planning on kicking this series off for a while, but it happens to serendipitously coincide with Frank jumping online via <a href="www.twitter.com/f2harrell">twitter</a> and a <a href="http://www.fharrell.com">blog</a>! We are so excited to see how this goes.</p>
<!-- BLOGDOWN-HEAD
/BLOGDOWN-HEAD -->