Sampling methodologies, the basics:
- Andrea Osika
- Nov 30, 2020
- 4 min read
Updated: Dec 9, 2020

To gather data on a population we can make observations either overtly - where the participant(s) is aware they are being observed or covertly - the participant(s) are unaware. In the digital age with recommendation engines and other machine learning advancements, we can easily extract insights and trends. These trends are implied. Songs listened to in a certain genre might be due to a holiday or mood (that can last weeks or months) and can be clouded with extraneous noise. Sometimes we need to gather explicit data, so: we ask explicitly through surveys.
When we survey a population, we get information that’s fully revealed or expressed without vagueness or ambiguity like that which is gained through an algorithm or otherwise. These insights collected directly from the individual are explicit data. The different modalities for data collection can even be used together thereby augmenting the performance of an algorithm with the explicit input. When Amazon surveys its customers with a question like: "Are you buying this as a gift?" they can track the order separately to avoid altering future suggestions (AND upsell the gift-wrapping).
See how that works?

In most circumstances, gathering data on the entire population (aka a census) can be unrealistic due to cost and time constraints and really - human nature. Enter survey sampling: the process of selecting a sample of elements from a target population.
A sample survey usually offers greater scope than a census. Sampling may make it possible to study a population of a larger geographical area. It can help to find out more about the same population by examining an area in greater depth through a smaller sample.
Some considerations:
* Diversity - It's important that the sample is truly representative of the population. Age, gender, race, socioeconomic, geographic, and many other considerations should be examined.
* Consistency - Standardized procedures helps to collect accurate data, especially over time.
* Transparency - By sharing methods used for selecting the sample, researchers can offer their perspectives to help with how the results are interpreted.

With these considerations in mind, we can move along to some
basic sampling methods:
Random Sampling:
This comes in handy when the population is large - very large - even hard to define or identify every member of the population. The principle of this technique is that every object has the same probability of being chosen. So unless your population all lives on the corner, randomly asking 5 people on the corner isn't true random sampling.
Pros: Every member has equal chances of being represented
Cons: In the case of not every member of the population being identified - it's difficult to ensure that the diversity consideration is met.
Use Case: Perhaps we want to understand cereal consumption in the United States. It would be unrealistic to survey every household who eats cereal - BUT we could randomly sample throughout varied geographies depending on what we were trying to find out and other constraints.
Systematic Sampling:
If your population is somewhat the 'same' in nature - homogeneous. If you're sampling gym members at a specific location, this is a pretty specific and homogeneous population. Once you've established your sample size, you can pick at regular intervals from the list.
Pros: It' much more simple than random sampling, and you have pretty much assured even population sampling.
Cons: There MIGHT be a pattern in the list you create that you don't know about - see how careful you have to be?
Use Case: The gym wants to understand the reasons for membership level choices so it picks every 25th member. (I'd make sure to pick a larger interval here to represent the varied hours of the day)
Benchmark or Stratified Sampling:
This is where things get compartmentalized. For real: Age, gender, food allergies, household earnings, name it. So if the percentage of people aged 20-24 is 6.7% (in the US it is) then 6.7% of the sample should be 20-24 years old. It's often used when a segment of the population has a low occurrence compared to the other segments. (Imbalanced data - happens a lot!) Another case would be having 3 sub-populations of varying sizes - Let's say 150, 200 and 100. The sample for the first group would be 150*0.5= 75, 200*0.5=100, and 250*0.5= 125. Here the constant factor is the proportion ratio for each population subset. In any case, random sampling is taken from each segment or strata.
Pros: This is an attempt at being pro-active about getting an accurate representation.
Cons: It's the opposite of simple. Trying to select exactly 6.7% of your population aged 20-24 while meeting all the other criteria gets pretty labor-intensive.
Use Case: This would apply best when the populations or situations are what we are examining. For example, trying to understand what age group would be most apt to buy our product.
Then there are the not-so-statistical sampling techniques that require no probability. These methods therefore don't necessarily reflect the true nature of the population but give us a general sense:
Convenience Sampling:
This is just like the name sounds. Samples are selected based on their availability. Typically this happens in the very early stages of research.
Pros: As easy as it gets
Cons: NOT representative
Use case: to get a very high-level understanding
Snowball Sampling:
This is where participants are asked for referrals to take part in the survey. Here you rely heavily on your initial respondents for future participants. But in cases where your population is hard to locate or are closely connected - this is the method you'd use.
Pros: Homogeneous representation
Cons: Dependent on initial respondents
Use case: Drug user research or organized crime intel
Quota Sampling:
This is a non-probabilistic method that resembles stratified or benchmark sampling. Let's say you just need 50 men and 50 women.
Pros: Simple and various features can be measured simultaneously
Cons: The simplicity doesn't offer the statistical or probabilistic outcomes that surveys can inform.
Use case: If you need a gross-estimate of respondents in cases of preliminary research that does not require statistical or probabilistic inference.
In any case, there's more than one way to survey to gather explicit data. It really depends on what question you're trying to answer, your audience, and your resources. Once you pick your method, remember to aim to have a diverse sample, and be consistent and transparent in your methodologies. Happy sampling!
Comments