Statistically, 10% of a population is enough to estimate the survey results of 100%. But if you have a huge dataset, such as 1 billion records, instead of looking at 10% of the population (which is still large), you can look for the optimal (minimum) amount of data to survey.
This standard equation defines the appropriate sample size () of people to use for a survey:
It is very common to use this equation for population sizes of big data projects in order to define the appropriate sample of data that should be analyzed.
The parameters to define the sample size are:
Confidence level : the precision required for the survey
Confidence interval : the error tolerance for the survey,
Accuracy : the data quality or trustworthiness of the information in the data
Data size : the total population (or number of records in the database)