Create dataset with observational error for log likelihood calculation using clustering. — create.dataset.for.parameter.fit • ambrosia

There are a few steps that the function will perform on a sample dataset (see details).

create.dataset.for.parameter.fit(
  min_price_pd = 20,
  min_cal_fd = 1000,
  min_clusters = 300,
  lower_limit_sigma = 0.01,
  data = NULL,
  outdir = "tests/testthat/test_outputs/",
  print_progress = FALSE
)

Arguments

min_price_pd	Minimum price paid for non-staples.
min_cal_fd	Minimum calories for the food demand model
min_clusters	Minimum number of clusters to be generated by clustering algoritm. It is recommended to not lowewr this parameter below 20.
lower_limit_sigma	Lower limit for sigma values calculated
data	A data.frame or data.table with the raw data. Data should contain following names, s_cal_pcap_day_thous (Containing 1000 calories per capita per day for staples) ns_cal_pcap_day_thous (Containing 1000 calories per capita per day for non-staples) gdp_pcap_thous (Containing GDP per capita) s_usd_p1000cal (Price of 1000 calories for staples per person per day) ns_usd_p1000cal (Price of 1000 calories for non-staples per person per day)
outdir	Directory to store output csv. Default is set to test_output folder.
print_progress	A parameter that allows the user to track progress of function.

Value

A dataframe called Processed_Data_for_MC with the following columns

sig2Qn- which is the observational error for non-staples

sig2Qs- which is the observational error for staples

Details

The steps that will be performed by the function are :

1) It will ensure that the user's dataset contains all columns required for parameter estimation.

2) It will filter out anomalies and outliers using parameterized cutoff values selected by the user. This step is necessary since data on food consumption and prices are often incomplete which may lead to unrealistically high or low values of consumption or prices in the dataset.

3) After this, the function will create clusters of observations from the dataset based on income levels, and prices of staples and non-staples. This step is necessary because this being economic data, the observational error can only be calculated within different clusters. The code will also check for a user specified minimum number of clusters (if there are anomalies within the dataset, the clustering can be incorrect leading in a small number of clusters). The clustering is implemented using the Divisive Analysis Clustering Algorithm (DIANA).

4) Once the clustering is completed, the code will calculate the observational error which is the variance in food demand for staples and non-staples .Note that the user can chose a lower limit on the observational error calculated. The default value of the lower limit is 0.01.

Author

KBN 2020