<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>retrospective-harmonization | Reprex</title><link>https://reprex-next.netlify.app/tag/retrospective-harmonization/</link><atom:link href="https://reprex-next.netlify.app/tag/retrospective-harmonization/index.xml" rel="self" type="application/rss+xml"/><description>retrospective-harmonization</description><generator>Wowchemy (https://wowchemy.com)</generator><language>en-us</language><lastBuildDate>Sat, 06 Mar 2021 00:00:00 +0000</lastBuildDate><image><url>https://reprex-next.netlify.app/media/icon_hub9491570ac57158c0eeecc95c95b13e5_20247_512x512_fill_lanczos_center_3.png</url><title>retrospective-harmonization</title><link>https://reprex-next.netlify.app/tag/retrospective-harmonization/</link></image><item><title>Where Are People More Likely To Treat Climate Change as the Most Serious Global Problem?</title><link>https://reprex-next.netlify.app/post/2021-03-06-individual-join/</link><pubDate>Sat, 06 Mar 2021 00:00:00 +0000</pubDate><guid>https://reprex-next.netlify.app/post/2021-03-06-individual-join/</guid><description>&lt;pre>&lt;code>library(regions)
library(lubridate)
library(dplyr)
if ( dir.exists('data-raw') ) {
data_raw_dir &amp;lt;- &amp;quot;data-raw&amp;quot;
} else {
data_raw_dir &amp;lt;- file.path(&amp;quot;..&amp;quot;, &amp;quot;..&amp;quot;, &amp;quot;data-raw&amp;quot;)
}
&lt;/code>&lt;/pre>
&lt;p>The first results of our longitudinal table &lt;a href="post/2021-03-05-retroharmonize-climate/">were difficult to
map&lt;/a>, because the surveys used
an obsolete regional coding. We will adjust the wrong coding, when
possible, and join the data with the European Environment Agency’s (EEA)
Air Quality e-Reporting (AQ e-Reporting) data on environmental
pollution. We recoded the annual level for every available reporting
stations [&lt;em>not shown here&lt;/em>] and all values are in μg/m3. The period
under observation is 2014-2016. Data file:
&lt;a href="https://www.eea.europa.eu/data-and-maps/data/aqereporting-8" target="_blank" rel="noopener">https://www.eea.europa.eu/data-and-maps/data/aqereporting-8&lt;/a> (European
Environment Agency 2021).&lt;/p>
&lt;h2 id="recoding-the-regions">Recoding the Regions&lt;/h2>
&lt;p>Recoding means that the boundaries are unchanged, but the country
changed the names and codes of regions because there were other boundary
changes which did not affect our observation unit. We explain the
problem and the solution in greater detail in &lt;a href="http://netzero.dataobservatory.eu/post/2021-03-06-regions-climate/" target="_blank" rel="noopener">our
tutorial&lt;/a>
that aggregates the data on regional levels.&lt;/p>
&lt;pre>&lt;code>panel &amp;lt;- readRDS((file.path(data_raw_dir, &amp;quot;climate-panel.rds&amp;quot;)))
climate_data_geocode &amp;lt;- panel %&amp;gt;%
mutate ( year: lubridate::year(date_of_interview)) %&amp;gt;%
recode_nuts()
&lt;/code>&lt;/pre>
&lt;p>Let’s join the air pollution data and join it by corrected geocodes:&lt;/p>
&lt;pre>&lt;code>load(file.path(&amp;quot;data&amp;quot;, &amp;quot;air_pollutants.rda&amp;quot;)) ## good practice to use system-independent file.path
climate_awareness_air &amp;lt;- climate_data_geocode %&amp;gt;%
rename ( region_nuts_codes : .data$code_2016) %&amp;gt;%
left_join ( air_pollutants, by: &amp;quot;region_nuts_codes&amp;quot; ) %&amp;gt;%
select ( -all_of(c(&amp;quot;w1&amp;quot;, &amp;quot;wex&amp;quot;, &amp;quot;date_of_interview&amp;quot;,
&amp;quot;typology&amp;quot;, &amp;quot;typology_change&amp;quot;, &amp;quot;geo&amp;quot;, &amp;quot;region&amp;quot;))) %&amp;gt;%
mutate (
# remove special labels and create NA_numeric_
age_education: retroharmonize::as_numeric(age_education)) %&amp;gt;%
mutate_if ( is.character, as.factor) %&amp;gt;%
mutate (
# we only have responses from 4 years, and this should be treated as a categorical variable
year: as.factor(year)
) %&amp;gt;%
filter ( complete.cases(.) )
&lt;/code>&lt;/pre>
&lt;p>The &lt;code>climate_awareness_air&lt;/code> data frame contains the answers of 75086
individual respondents. 17.07% thought that climate change was the most
serious world problem and 33.6% mentioned climate change as one of the
three most important global problems.&lt;/p>
&lt;pre>&lt;code>summary ( climate_awareness_air )
## rowid serious_world_problems_first
## ZA5877_v2-0-0_1 : 1 Min. :0.0000
## ZA5877_v2-0-0_10 : 1 1st Qu.:0.0000
## ZA5877_v2-0-0_100 : 1 Median :0.0000
## ZA5877_v2-0-0_1000 : 1 Mean :0.1707
## ZA5877_v2-0-0_10000: 1 3rd Qu.:0.0000
## ZA5877_v2-0-0_10001: 1 Max. :1.0000
## (Other) :75080
## serious_world_problems_climate_change isocntry
## Min. :0.000 BE : 3028
## 1st Qu.:0.000 CZ : 3023
## Median :0.000 NL : 3019
## Mean :0.336 SK : 3000
## 3rd Qu.:1.000 SE : 2980
## Max. :1.000 DE-W : 2978
## (Other):57058
## marital_status age_education
## (Re-)Married: without children :13242 18 :15485
## (Re-)Married: children this marriage :12696 19 : 7728
## Single: without children : 7650 16 : 5840
## (Re-)Married: w children of this marriage: 6520 still studying: 5098
## (Re-)Married: living without children : 6225 17 : 5092
## Single: living without children : 4102 15 : 4528
## (Other) :24651 (Other) :31315
## age_exact occupation_of_respondent
## Min. :15.0 Retired, unable to work :22911
## 1st Qu.:36.0 Skilled manual worker : 6774
## Median :51.0 Employed position, at desk : 6716
## Mean :50.1 Employed position, service job: 5624
## 3rd Qu.:65.0 Middle management, etc. : 5252
## Max. :99.0 Student : 5098
## (Other) :22711
## occupation_of_respondent_recoded
## Employed (10-18 in d15a) :32763
## Not working (1-4 in d15a) :37125
## Self-employed (5-9 in d15a): 5198
##
##
##
##
## respondent_occupation_scale_c_14
## Retired (4 in d15a) :22911
## Manual workers (15 to 18 in d15a) :15269
## Other white collars (13 or 14 in d15a): 9203
## Managers (10 to 12 in d15a) : 8291
## Self-employed (5 to 9 in d15a) : 5198
## Students (2 in d15a) : 5098
## (Other) : 9116
## type_of_community is_student no_education
## DK : 34 Min. :0.0000 Min. :0.000000
## Large town :20939 1st Qu.:0.0000 1st Qu.:0.000000
## Rural area or village :24686 Median :0.0000 Median :0.000000
## Small or middle sized town: 9850 Mean :0.0679 Mean :0.008151
## Small/middle town :19577 3rd Qu.:0.0000 3rd Qu.:0.000000
## Max. :1.0000 Max. :1.000000
##
## education year region_nuts_codes country_code
## Min. :14.00 2013:25103 LU : 1432 DE : 4531
## 1st Qu.:17.00 2015: 0 MT : 1398 GB : 3538
## Median :18.00 2017:25053 CY : 1192 BE : 3028
## Mean :19.61 2019:24930 SK02 : 1053 CZ : 3023
## 3rd Qu.:22.00 EL30 : 974 NL : 3019
## Max. :30.00 EE : 973 SK : 3000
## (Other):68064 (Other):54947
## pm2_5 pm10 o3 BaP
## Min. : 2.109 Min. : 5.883 Min. : 66.37 Min. :0.0102
## 1st Qu.: 9.374 1st Qu.: 28.326 1st Qu.: 90.89 1st Qu.:0.1779
## Median :11.866 Median : 33.673 Median :102.81 Median :0.4105
## Mean :12.954 Mean : 38.637 Mean :101.49 Mean :0.8759
## 3rd Qu.:15.890 3rd Qu.: 49.488 3rd Qu.:110.73 3rd Qu.:1.0692
## Max. :41.293 Max. :123.239 Max. :141.04 Max. :7.8050
##
## so2 ap_pc1 ap_pc2 ap_pc3
## Min. : 0.0000 Min. :-4.6669 Min. :-2.21851 Min. :-2.1007
## 1st Qu.: 0.0000 1st Qu.:-0.4624 1st Qu.:-0.49130 1st Qu.:-0.5695
## Median : 0.0000 Median : 0.4263 Median : 0.02902 Median :-0.1113
## Mean : 0.1032 Mean : 0.1031 Mean : 0.04166 Mean :-0.1746
## 3rd Qu.: 0.0000 3rd Qu.: 0.9748 3rd Qu.: 0.57416 3rd Qu.: 0.3309
## Max. :42.5325 Max. : 2.0344 Max. : 3.25841 Max. : 4.1615
##
## ap_pc4 ap_pc5
## Min. :-1.7387 Min. :-2.75079
## 1st Qu.:-0.1669 1st Qu.:-0.18748
## Median : 0.0371 Median : 0.01811
## Mean : 0.1154 Mean : 0.06797
## 3rd Qu.: 0.3050 3rd Qu.: 0.34937
## Max. : 3.2476 Max. : 1.42816
##
&lt;/code>&lt;/pre>
&lt;p>Let’s see a simple CART tree! We remove the regional codes, because
there are very serious differences among regional climate awareness.
These differences, together with education level, and the year we are
talking about, are the most important predictors of thinking about
climate change as the most important global problem in Europe.&lt;/p>
&lt;pre>&lt;code># Classification Tree with rpart
library(rpart)
# grow tree
fit &amp;lt;- rpart(as.factor(serious_world_problems_first) ~ .,
method=&amp;quot;class&amp;quot;, data=climate_awareness_air %&amp;gt;%
select ( - all_of(c(&amp;quot;rowid&amp;quot;, &amp;quot;region_nuts_codes&amp;quot;))),
control: rpart.control(cp: 0.005))
printcp(fit) # display the results
##
## Classification tree:
## rpart(formula: as.factor(serious_world_problems_first) ~ .,
## data: climate_awareness_air %&amp;gt;% select(-all_of(c(&amp;quot;rowid&amp;quot;,
## &amp;quot;region_nuts_codes&amp;quot;))), method: &amp;quot;class&amp;quot;, control: rpart.control(cp: 0.005))
##
## Variables actually used in tree construction:
## [1] age_education isocntry
## [3] serious_world_problems_climate_change year
##
## Root node error: 12817/75086: 0.1707
##
## n= 75086
##
## CP nsplit rel error xerror xstd
## 1 0.0240566 0 1.00000 1.00000 0.0080438
## 2 0.0082703 3 0.92783 0.92783 0.0078055
## 3 0.0050000 5 0.91129 0.91425 0.0077588
plotcp(fit) # visualize cross-validation results
&lt;/code>&lt;/pre>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="&amp;amp;ldquo;Visualize cross-validation results&amp;amp;rdquo;" srcset="
/post/2021-03-06-individual-join/rpart-1_hu9f1f775a32eec3a67a573c0d2df50ef4_4271_8ce48ac0f7ba6b1d3752385b96368cc3.webp 400w,
/post/2021-03-06-individual-join/rpart-1_hu9f1f775a32eec3a67a573c0d2df50ef4_4271_b20e6dca7fcadd4576da216956498a35.webp 760w,
/post/2021-03-06-individual-join/rpart-1_hu9f1f775a32eec3a67a573c0d2df50ef4_4271_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://reprex-next.netlify.app/post/2021-03-06-individual-join/rpart-1_hu9f1f775a32eec3a67a573c0d2df50ef4_4271_8ce48ac0f7ba6b1d3752385b96368cc3.webp"
width="672"
height="480"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;pre>&lt;code>summary(fit) # detailed summary of splits
## Call:
## rpart(formula: as.factor(serious_world_problems_first) ~ .,
## data: climate_awareness_air %&amp;gt;% select(-all_of(c(&amp;quot;rowid&amp;quot;,
## &amp;quot;region_nuts_codes&amp;quot;))), method: &amp;quot;class&amp;quot;, control: rpart.control(cp: 0.005))
## n= 75086
##
## CP nsplit rel error xerror xstd
## 1 0.024056592 0 1.0000000 1.0000000 0.008043837
## 2 0.008270266 3 0.9278302 0.9278302 0.007805478
## 3 0.005000000 5 0.9112897 0.9142545 0.007758824
##
## Variable importance
## serious_world_problems_climate_change isocntry
## 31 26
## country_code BaP
## 20 8
## pm2_5 ap_pc1
## 4 3
## age_education pm10
## 2 2
## education ap_pc2
## 2 1
## year
## 1
##
## Node number 1: 75086 observations, complexity param=0.02405659
## predicted class=0 expected loss=0.1706976 P(node): 1
## class counts: 62269 12817
## probabilities: 0.829 0.171
## left son=2 (25229 obs) right son=3 (49857 obs)
## Primary splits:
## serious_world_problems_climate_change &amp;lt; 0.5 to the right, improve=2214.2040, (0 missing)
## isocntry splits as RRLLLRRRLLRLRLLLLLLLLLLRRLLLRLL, improve= 728.0160, (0 missing)
## country_code splits as RRLLLRRLLRLLLLLLLLLLRRLLLRLL, improve= 673.3656, (0 missing)
## BaP &amp;lt; 0.4300347 to the right, improve= 310.6229, (0 missing)
## pm2_5 &amp;lt; 13.38264 to the right, improve= 296.4013, (0 missing)
## Surrogate splits:
## age_education splits as ----RRRRRR-RRRRRRRRRR-RRRRRRRRRR-RRRRRRRRRR-RRRRRRRRRR-RRRRRL-RRR-RRRRRRRRR--RRRLLR--R-R, agree=0.664, adj=0, (0 split)
## pm10 &amp;lt; 7.491315 to the left, agree=0.664, adj=0, (0 split)
##
## Node number 2: 25229 observations
## predicted class=0 expected loss=0 P(node): 0.3360014
## class counts: 25229 0
## probabilities: 1.000 0.000
##
## Node number 3: 49857 observations, complexity param=0.02405659
## predicted class=0 expected loss=0.2570752 P(node): 0.6639986
## class counts: 37040 12817
## probabilities: 0.743 0.257
## left son=6 (34631 obs) right son=7 (15226 obs)
## Primary splits:
## isocntry splits as RRLLLRRRLLRLRLLLLLLLLLLRRLLLRLL, improve=1454.9460, (0 missing)
## country_code splits as RRLLLRRLLRLLLLLLLLLLRRLLLRLL, improve=1359.7210, (0 missing)
## BaP &amp;lt; 0.4300347 to the right, improve= 629.8844, (0 missing)
## pm2_5 &amp;lt; 13.38264 to the right, improve= 555.7484, (0 missing)
## ap_pc1 &amp;lt; -0.005459537 to the left, improve= 533.3579, (0 missing)
## Surrogate splits:
## country_code splits as RRLLLRRLLRLLLLLLLLLLRRLLLRLL, agree=0.987, adj=0.957, (0 split)
## BaP &amp;lt; 0.1749425 to the right, agree=0.775, adj=0.264, (0 split)
## pm2_5 &amp;lt; 5.206993 to the right, agree=0.737, adj=0.140, (0 split)
## ap_pc1 &amp;lt; 1.405527 to the left, agree=0.733, adj=0.126, (0 split)
## pm10 &amp;lt; 25.31211 to the right, agree=0.718, adj=0.076, (0 split)
##
## Node number 6: 34631 observations
## predicted class=0 expected loss=0.1769802 P(node): 0.4612178
## class counts: 28502 6129
## probabilities: 0.823 0.177
##
## Node number 7: 15226 observations, complexity param=0.02405659
## predicted class=0 expected loss=0.4392487 P(node): 0.2027808
## class counts: 8538 6688
## probabilities: 0.561 0.439
## left son=14 (11607 obs) right son=15 (3619 obs)
## Primary splits:
## isocntry splits as LL---LLR--L-L----------LL---R--, improve=337.5462, (0 missing)
## country_code splits as LL---LR--L-L--------LL---R--, improve=337.5462, (0 missing)
## age_education splits as ----LLLLLL-LLLRRRRRRR-RRRRRRRRRL-RRRRRRLLRR-RRRRLLRLRL-RRLRRR-RRR-LLLLRRR-----LR-----L-R, improve=294.0807, (0 missing)
## education &amp;lt; 22.5 to the left, improve=262.3747, (0 missing)
## BaP &amp;lt; 0.053328 to the right, improve=232.7043, (0 missing)
## Surrogate splits:
## BaP &amp;lt; 0.053328 to the right, agree=0.878, adj=0.485, (0 split)
## pm2_5 &amp;lt; 4.810361 to the right, agree=0.827, adj=0.271, (0 split)
## ap_pc2 &amp;lt; 0.8746175 to the left, agree=0.792, adj=0.124, (0 split)
## so2 &amp;lt; 0.3302972 to the left, agree=0.781, adj=0.078, (0 split)
## age_education splits as ----LLLLLL-LLLLLLLRLR-LRRLRRRRRR-RRRRLLLLLR-LRLRLLRRLL-LLRLLR-LLR-RRLLLLL-----RR-----R-L, agree=0.779, adj=0.071, (0 split)
##
## Node number 14: 11607 observations, complexity param=0.008270266
## predicted class=0 expected loss=0.3804601 P(node): 0.1545827
## class counts: 7191 4416
## probabilities: 0.620 0.380
## left son=28 (7462 obs) right son=29 (4145 obs)
## Primary splits:
## age_education splits as ----LLLLLL-LRRRRRRRRR-RRLRRLRRLL-RRRRLRLLRR-RLRLLLRLRL-RR-RR--RRL-L-LLRRR------------L-R, improve=123.71070, (0 missing)
## year splits as R-LR, improve=107.79460, (0 missing)
## education &amp;lt; 20.5 to the left, improve= 90.28724, (0 missing)
## occupation_of_respondent splits as LRRLRRRRRLRLLLRLLL, improve= 84.62865, (0 missing)
## respondent_occupation_scale_c_14 splits as LRLLLRRL, improve= 68.88653, (0 missing)
## Surrogate splits:
## education &amp;lt; 20.5 to the left, agree=0.950, adj=0.861, (0 split)
## occupation_of_respondent splits as LLLLRLLRRLRLLLRLLL, agree=0.738, adj=0.267, (0 split)
## respondent_occupation_scale_c_14 splits as LRLLLLRL, agree=0.733, adj=0.251, (0 split)
## is_student &amp;lt; 0.5 to the left, agree=0.709, adj=0.186, (0 split)
## age_exact &amp;lt; 23.5 to the right, agree=0.676, adj=0.094, (0 split)
##
## Node number 15: 3619 observations
## predicted class=1 expected loss=0.3722023 P(node): 0.04819807
## class counts: 1347 2272
## probabilities: 0.372 0.628
##
## Node number 28: 7462 observations
## predicted class=0 expected loss=0.326052 P(node): 0.09937938
## class counts: 5029 2433
## probabilities: 0.674 0.326
##
## Node number 29: 4145 observations, complexity param=0.008270266
## predicted class=0 expected loss=0.4784077 P(node): 0.05520337
## class counts: 2162 1983
## probabilities: 0.522 0.478
## left son=58 (2573 obs) right son=59 (1572 obs)
## Primary splits:
## year splits as L-LR, improve=40.13885, (0 missing)
## occupation_of_respondent splits as LRLLRRRRRLRLLLRLLL, improve=18.33254, (0 missing)
## marital_status splits as LRRRLRRRLRRLRLLRRRRRRLRLRLLRR, improve=17.86888, (0 missing)
## type_of_community splits as LRLRL, improve=17.55254, (0 missing)
## age_education splits as ------------LLRRRRRRR-RR-RL-RR---LRRR-R--LR-R-R---R-R--RR-RR--RR------RRR--------------R, improve=14.66121, (0 missing)
## Surrogate splits:
## type_of_community splits as LLLRL, agree=0.777, adj=0.412, (0 split)
## marital_status splits as RRLLLLLRLLLLLLLRRRLLLLLLRLRLL, agree=0.680, adj=0.155, (0 split)
## isocntry splits as LL---LL---L-R----------LL------, agree=0.669, adj=0.127, (0 split)
## country_code splits as LL---L---L-R--------LL------, agree=0.669, adj=0.127, (0 split)
## o3 &amp;lt; 83.06345 to the right, agree=0.650, adj=0.076, (0 split)
##
## Node number 58: 2573 observations
## predicted class=0 expected loss=0.4240187 P(node): 0.03426737
## class counts: 1482 1091
## probabilities: 0.576 0.424
##
## Node number 59: 1572 observations
## predicted class=1 expected loss=0.43257 P(node): 0.02093599
## class counts: 680 892
## probabilities: 0.433 0.567
# plot tree
plot(fit, uniform=TRUE,
main=&amp;quot;Classification Tree: Climate Change Is The Most Serious Threat&amp;quot;)
text(fit, use.n=TRUE, all=TRUE, cex=.8)
## Warning in labels.rpart(x, minlength: minlength): more than 52 levels in a
## predicting factor, truncated for printout
&lt;/code>&lt;/pre>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="&amp;amp;ldquo;predicting factor, truncated for printout&amp;amp;rdquo;" srcset="
/post/2021-03-06-individual-join/rpart-2_hu8765078af843fd2a25e4b77d7cba4bfb_9882_0bdd94d7f6c1efcc2575c1adeb6917c8.webp 400w,
/post/2021-03-06-individual-join/rpart-2_hu8765078af843fd2a25e4b77d7cba4bfb_9882_daf3b553e16b54a4b23a242bc9ef1e6b.webp 760w,
/post/2021-03-06-individual-join/rpart-2_hu8765078af843fd2a25e4b77d7cba4bfb_9882_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://reprex-next.netlify.app/post/2021-03-06-individual-join/rpart-2_hu8765078af843fd2a25e4b77d7cba4bfb_9882_0bdd94d7f6c1efcc2575c1adeb6917c8.webp"
width="672"
height="480"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;pre>&lt;code>saveRDS ( climate_awareness_air , file.path(tempdir(), &amp;quot;climate_panel_recoded.rds&amp;quot;), version: 2)
# not evaluated
saveRDS( climate_awareness_air, file: file.path(&amp;quot;data-raw&amp;quot;, &amp;quot;climate-panel_recoded.rds&amp;quot;))
&lt;/code>&lt;/pre></description></item><item><title>What is Retrospective Survey Harmonization?</title><link>https://reprex-next.netlify.app/post/2021-03-04_retroharmonize_intro/</link><pubDate>Thu, 04 Mar 2021 00:00:00 +0000</pubDate><guid>https://reprex-next.netlify.app/post/2021-03-04_retroharmonize_intro/</guid><description>&lt;h2 id="reproducible-ex-post-harmonization-of-survey-microdata">Reproducible ex post harmonization of survey microdata&lt;/h2>
&lt;p>Retrospective survey harmonization allows the comparison of opinion poll
data conducted in different countries or time. In this example we are
working with data from surveys that were ex ante harmonized to a certain
degree – in our tutorials we are choosing questions that were asked in
the same way in many natural languages. For example, you can compare
what percentage of the European people in various countries, provinces
and regions thought climate change was a serious world problem back in
2013, 2015, 2017 and 2019.&lt;/p>
&lt;p>We developed the
&lt;a href="https://retroharmonize.dataobservatory.eu/" target="_blank" rel="noopener">retroharmonize&lt;/a> R package
to help this process. We have tested the package with about 80
Eurobarometer, 5 Afrobarometer survey files extensively, and a bit with
Arabbarometer files. This allows the comparison of various survey
answers in about 70 countries. This policy-oriented survey programs were
designed to be harmonized to a certain degree, but their ex post
harmonization is still necessary, challenging and errorprone.
Retrospective harmonization includes harmonization of the different
coding used for questions and answer options, post-stratification
weights, and using different file formats.&lt;/p>
&lt;p>&lt;a href="https://ec.europa.eu/commfrontoffice/publicopinion/index.cfm" target="_blank" rel="noopener">Eurobarometer&lt;/a>,
&lt;a href="https://www.afrobarometer.org/" target="_blank" rel="noopener">Afrobaromer&lt;/a>, &lt;a href="https://www.arabbarometer.org/" target="_blank" rel="noopener">Arab
Barometer&lt;/a> and
&lt;a href="https://www.latinobarometro.org/lat.jsp" target="_blank" rel="noopener">Latinobarómetro&lt;/a> make survey
files that are harmonized across countries available for research with
various terms. Our
&lt;a href="https://retroharmonize.dataobservatory.eu/" target="_blank" rel="noopener">retroharmonize&lt;/a> is not
affiliated with them, and to run our examples, you must visit their
websites, carefully read their terms, agree to them, and download their
data yourself. What we add as a value is that we help to connect their
files across time (from different years) or across these programs.&lt;/p>
&lt;p>The survey programs mentioned above publish their data in the
proprietary SPSS format. This file format can be imported and translated
to R objects with the haven package; however, we needed to re-design
&lt;a href="https://haven.tidyverse.org/" target="_blank" rel="noopener">haven’s&lt;/a>
&lt;a href="https://haven.tidyverse.org/reference/labelled_spss.html" target="_blank" rel="noopener">labelled_spss&lt;/a>
class to maintain far more metadata, which, in turn, a modification of
the &lt;a href="">labelled&lt;/a> class. The haven package was designed and tested with
data stored in individual SPSS files.&lt;/p>
&lt;p>The author of labelled, Joseph Larmarange describes two main approaches
to work with labelled data, such as SPSS’s method to store categorical
data in the &lt;a href="http://larmarange.github.io/labelled/articles/intro_labelled.html" target="_blank" rel="noopener">Introduction to
labelled&lt;/a>.&lt;/p>
&lt;figure id="figure-two-main-approaches-of-labelled-data-conversion">
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img src="img/larmarange_approaches_to_labelled.png" alt="Two main approaches of labelled data conversion." loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;figcaption data-pre="Figure&amp;nbsp;" data-post=":&amp;nbsp;" class="numbered">
Two main approaches of labelled data conversion.
&lt;/figcaption>&lt;/figure>
&lt;p>Our approach is a further extension of &lt;strong>Approach B&lt;/strong>. Survey
harmonization in our case always means the joining data from several
SPSS files, which requires a consistent coding among several data
sources. This means that data cleaning and recoding must take place
before conversion to factors, character or numeric vectors. This is
particularly important with factor data (and their simple character
conversions) and numeric data that occasionally contains labels, for
example, to describe the reason why certain data is missing. Our
tutorial vignette
&lt;a href="https://retroharmonize.dataobservatory.eu/articles/labelled_spss_survey.html" target="_blank" rel="noopener">labelled_spss_survey&lt;/a>
gives you more information about this.&lt;/p>
&lt;p>In the next series of tutorials, we will deal with an array of problems.
These are not for the faint heart – you need to have a solid
intermediate level of R to follow.&lt;/p>
&lt;h2 id="tidy-joined-survey-data">Tidy, joined survey data&lt;/h2>
&lt;ul>
&lt;li>The original files identifiers may not be unique, we have to create
new, truly unique identifiers. Weighting may not be straightforward.&lt;/li>
&lt;li>Neither the number of observations or the number of variables (which
represents the survey questions and their translation to coded data)
is the same. Certain data may be only present in one survey and not
the other. This means that you will likely to run loops on lists and
not data.frames, but eventually you must carefully join them.&lt;/li>
&lt;/ul>
&lt;h2 id="class-conversion">Class conversion&lt;/h2>
&lt;ul>
&lt;li>Similar questions may be imported from a non-native R format, in our
case, from an SPSS files, in an inconsistent manner. SPSS’s variable
formats cannot be translated unambiguously to R classes.
&lt;code>retroharmonize&lt;/code> introduced a new S3 class system that handles this
problem, but eventually you will have to choose if you want to see a
numeric or character coding of each categorical variable.&lt;/li>
&lt;li>The harmonized surveys, with harmonized variable names and
harmonized value labels, must be brought to consistent R
representations (most statistical functions will only work on
numeric, factor or character data) and carefully joined into a
single data table for analysis.&lt;/li>
&lt;/ul>
&lt;h2 id="harmonization-of-variables-and-variable-labels">Harmonization of variables and variable labels&lt;/h2>
&lt;ul>
&lt;li>Same variables may come with dissimilar variable names and variable
labels. It may be a challenge to match age with age. We need to
harmonize the names of variables.&lt;/li>
&lt;li>The harmonized variables may have different labeling. One may call
refused answers as &lt;code>declined&lt;/code> and the other &lt;code>refusal&lt;/code>. On a simple
choice, climate change may be ‘Climate change’ or
&lt;code>Problem: Climate change&lt;/code>. Binary choices may have survey-specific
coding conventions. Value labels must be harmonized. There are good
tools to do this in a single file - but we have to work with several
of them.&lt;/li>
&lt;/ul>
&lt;h2 id="missing-value-harmonization">Missing value harmonization&lt;/h2>
&lt;ul>
&lt;li>There are likely to be various types of &lt;code>missing values&lt;/code>. Working
with missing values is probably where most human judgment is needed.
Why are some answers missing: was the question not asked in some
questionnaires? Is there a coding error? Did the respondent refuse
the question, or sad that she did not have an answer?
&lt;code>retroharmonize&lt;/code> has a special labeled vector type that retains this
information from the raw data, if it is present, but you must make
the judgment yourself – in R, eventually you will either create a
missing category, or use &lt;code>NA_character_&lt;/code> or &lt;code>NA_real_&lt;/code>.&lt;/li>
&lt;/ul>
&lt;p>That’s a lot to put on your plate.&lt;/p>
&lt;p>It is unlikely that you will be able to work with completely unfamiliar
survey programs if you do not have a strong intermediate level of R. Our
package comes with tutorials for
&lt;a href="https://retroharmonize.dataobservatory.eu/articles/eurobarometer.html" target="_blank" rel="noopener">Eurobarometer&lt;/a>,
&lt;a href="https://retroharmonize.dataobservatory.eu/articles/afrobarometer.html" target="_blank" rel="noopener">Afrobarometer&lt;/a>
and our development version already covers Arab Barometer, highlighting
some peculiar issues with these survey programs, that we hope to give a
head start for less experienced R users.&lt;/p></description></item></channel></rss>