Thursday, April 21, 2016

A Basic Guide to Randomization in Excel

Throughout this semester, we've discussed many many many times how important randomization is to proper experimental design. In order to infer anything about a population, your experimental sample must randomly choose samples from the population. Any bias in choosing some samples over others will invalidate one of the fundamental assumptions of proper statistical analysis. So, on a more practical note, I wanted to provide a (hopefully) quick and easy guide to randomizing a list of numbers in Excel.

For this tutorial, I made up an imaginary data set. Let's assume I'm examining individual cells in live cell microscopy and timing how long they stay in prophase (one stage of the cell cycle). In my preliminary data, I counted all 40 cells imaged in this group, but I only want the data from 20 randomly selected cells. Below is the table of all 40 cells and their time in prophase.

Next, make a new column to the right of the data points titled "RANDOM". In the first cell of the column, add the formula "=RAND()". Excel will produce a random number between 0 and 1 in this cell.


Apply this same formula to every cell in the column corresponding to a data point by double clicking or dragging down the highlighted bottom right corner of the first random cell (or just copy/paste).


Importantly, you must convert all of the random formulas to permanent numbers that won't keep changing every time you refresh the page or perform another function. So, copy all cells in the Random column, and right click on those same cells to choose Paste Special --> Values. This will replace the formula with a permanent number of the same value. This is really really important to do or the next steps won't work.


Next, go to the Data menu in the top bar and select the Sort button (see button in upper right corner in image below).


A window will appear. Choose to sort the RANDOM column by values, smallest to largest.




Now, your list will be sorted by smallest to largest random value, meaning the data points are now in a completely random order. The entire row is maintained when you sort in this way, meaning that the same ID number, data point, and random number are sorted together as a group. Now you can select the first n number of cells that you need for random selection. Since I wanted 20 random data points in this scenario, I highlighted the first 20 entries that appear in the list.



And voila! You have a randomly selected sample from your list! Have any of you used Excel for randomizing data before? Do you prefer a different program? Any other tips or tricks for people learning to randomize data? Comment below!

4 comments:

  1. Ok 1) This is so helpful! I have a hard time with Excel sometimes, so thank you for putting this together.
    2) Was it mentioned in class that Excel doesn't actually make random numbers and that the numbers it generates are based on your computer/IP address or something like that? Would this effect your outcome?
    3) Does it count as throwing away your data for the 20 cells you didn't chose to examine here? Why would you count all 40 if you new you only needed a sample size of 20?

    ReplyDelete
  2. Thanks for posting this! It's probably the most helpful entry on this blog. My labs' current system for randomization is forcing someone who has nothing to do with the project give each animal a 'codename,' usually from a random movie, that blinds the experimenter when they're analyzing data (we do a lot of IHC and image analysis). But this doesn't necessarily randomize the order of analysis, which the system you've described could be used to do too. I've always dreaded having to write something like "mice were randomized by naming them after hobbits. Frodo was excluded as an outlier."

    In regards to the above question, I can't think of why the slight non-randomness of this excel technique would ever skew your results. This is especially the case if you're just trying to blind yourself rather than select subsets of your data for analysis. Unless you're brain is capable of recognizing the algorithm, I think you're safe.

    ReplyDelete
  3. Helpful post! Since we are getting to the end to the semester, it is really nice to refresh the experimental design basics.
    I was trying to apply this trick to the design that I used to work with: bacteria strains and extract treatments. If we want to do randomization, it would not be a good idea to test all strain or the same extract at a same batch, but it is hard to stay on track if you just pick one you like.
    Thus, it would be really nice if there is a way to randomize design in form of text.

    ReplyDelete
  4. This is fantastic! Thanks, Emily! I think one of the main problems with adopting randomization in the lab is actually knowing HOW to do so easily. Let's just say the activation energy for doing so is quite high when you are unsure of a straightforward, easily accessible method (as you have demonstrated here). You have effectively lowered the activation energy for me. I'm looking foward to trying this and investigating some other approaches to randomization in my own experiments.

    ReplyDelete