一键导入
conjoint-cleaning
Clean and reshape Qualtrics conjoint exports to analysis-ready long format.
用 Codex 或 Claude 帮你安装 复制这段 Prompt,粘贴到 Codex、Claude 或其他助手里,让它检查 Skill 页面并帮你完成安装。
菜单
Clean and reshape Qualtrics conjoint exports to analysis-ready long format.
用 Codex 或 Claude 帮你安装 复制这段 Prompt,粘贴到 Codex、Claude 或其他助手里,让它检查 Skill 页面并帮你完成安装。
基于 SOC 职业分类
Scaffold or audit an entire research project repository organized around its source library. Use whenever the user is starting, structuring, organizing, or reviewing a whole project — "set up a research repo", "how should I structure/organize this project", "initialize my sources folder", "new paper or literature-review project", "audit my repo structure", "is my sources folder set up right", "check my project layout". Builds the full tree from the sources spine outward — sources/{og,md,unprocessed}, references.bib, a PDF→Markdown convert script (OpenDataLoader PDF), a process-source intake command, CLAUDE.md/AGENTS.md, .gitignore, .venv — plus the analysis, manuscript, and review folders; or audits an existing repo and reports what is present, partial, or missing. NOT for intaking or converting a single PDF (use process-source) or building a publication replication package (use replication-package).
LLM token logprobs and calibration: per-decision confidence, ECE, Brier, reliability diagrams, low-confidence triage.
LLM council/panel voting: multi-model coders, consensus rules, inter-rater agreement (kappa, alpha), correlated-error diagnostics.
Compare OCR systems before a bulk run: candidate set, stratified ground truth, CER/WER, normalization, per-language and per-stratum accuracy.
Fact-check a manuscript's claims against the cited sources themselves: locate each source's knowledge-base Markdown file and verify the in-text claim is actually supported. Runs a pre-flight gate that refuses unless a per-source Markdown knowledge base exists and is clean (PDFs converted via process-source); then runs citation-check; then audits claim support, overclaiming, direction, scope, and misattribution.
Audit citation existence and fabrication risk, in-text/reference parity, DOIs, claim support, and style.
| name | conjoint-cleaning |
| description | Clean and reshape Qualtrics conjoint exports to analysis-ready long format. |
| argument-hint | [describe your Qualtrics export or paste data sample] |
Export format: The right Qualtrics export choice depends on the pipeline. cjoint::read.qualtrics() (Method A) hard-errors when the choice-response columns contain text — export "Use numeric values" for it. projoint::reshape_projoint() parses the chosen profile from a label at the end of the outcome string (default A/B), and manual reshaping benefits from readable attribute labels — "Use choice text" works for those. When unsure, export both versions; they differ only in response coding. If working with non-Latin scripts (Chinese, Korean, Arabic), export as XLSX rather than CSV to avoid UTF-8/ANSI encoding issues. On Windows with East Asian locales, read.csv() may still require Sys.setlocale() to match the file encoding before import (see ?cjoint::read.qualtrics East Asian Language Support).
Metadata rows: Current Qualtrics CSV exports include 3 header rows before respondent data: (1) variable identifiers, (2) question text/descriptions, (3) ImportId JSON. Legacy exports have 2 rows. The cjoint::read.qualtrics() parameter new.format = TRUE (set explicitly; default is FALSE) handles the 3-row format. For manual import via readxl::read_excel() or readr::read_csv(), skip the appropriate number of metadata rows after reading headers.
Randomization order columns: If "Export viewing order data" is enabled, Qualtrics adds _DO_ columns (e.g., Block1_DO) containing pipe-separated integers showing element display order. These are useful for task-order robustness checks but are not needed for the core reshape.
Qualtrics conjoint experiments use one of three implementation methods, each producing different column naming conventions:
Method A — Conjoint Survey Design Tool (Strezhnev): Generates JavaScript that Qualtrics executes to randomize profiles. Column naming follows F-{task}-{profile}-{attribute} for attribute levels and F-{task}-{attribute} for attribute names. The cjoint R package's read.qualtrics() function is purpose-built for this format.
Method B — Custom JavaScript + Embedded Data: Researchers write JavaScript to randomize attributes and store values in Qualtrics embedded data fields. Column naming is researcher-defined. Two common conventions: (i) C{x}-F-{task}-{idx} for attribute names and C{x}-F-{task}-{profile}-{idx} for profile values; (ii) the Graham (2020) convention, choice{task}_{attr}{profile} with fixed attribute order (e.g., choice1_bread1) or c{task}_attrib{pos}_name / c{task}_attrib{pos}_sand{profile} when attribute order is also randomized. Requires manual reshaping (Section 4).
Method C — Loop & Merge: Each loop iteration represents one conjoint task. Embedded data fields are referenced via ${e://Field/variable_name} and displayed with ${lm://Field/N}. Column names reflect the embedded data field structure. Requires manual reshaping.
Before writing any cleaning code: Inspect the actual column headers, the QSF survey definition file, or any JavaScript in the survey to determine which method was used. Do not assume a column naming convention.
Before writing custom reshaping code, check whether an existing package handles the data format:
cjoint::read.qualtrics() — Purpose-built for Conjoint SDT exports (Method A). Reads Qualtrics CSV directly, handles metadata rows, outputs one row per profile with a selected column. Parameters: responses (choice column names), covariates (respondent-level variables), respondentID, new.format (TRUE for 3-row headers), ranks (for rank/rating/top-L designs). Supports binary forced choice, profile ranks, per-profile ratings, and top-L choices; see ?cjoint::read.qualtrics Details for the four response types. Requires PHP/JS output from the Conjoint Survey Design Tool.
cjdata::reshape_conjoint() — Lightweight alternative. Functions: read_Qualtrics() + reshape_conjoint(). Handles basic wide-to-long conversion. Requires the terminal character of each outcome string to be {"1","2"} or {"A","B"} (so "Candidate A" works; Japanese zenkaku digits supported). Respondent covariates merged separately.
projoint::reshape_projoint() — For measurement-error-corrected analysis per Clayton, Horiuchi, Kaufman, King, and Komisarchik (2023). Built-in support for repeated tasks (IRR estimation), missing-agreement imputation (.fill = TRUE), and bias-corrected AMCEs. Outcome column names must contain task-ID digits, and the repeated-task outcome must be the last element of .outcomes. Expects wide columns named K-{task}-{attribute} and K-{task}-{profile}-{attribute} by default (.alphabet = "K"); selected profile is parsed from the final character of each outcome string via .choice_labels (default c("A","B")). Specify .flipped = TRUE when the repeated task presents profiles in reversed left/right order (see exampleData1 vs. exampleData2 in the manual); this changes how agreement is computed. Trap: projoint::read_Qualtrics() hard-codes a 2-row metadata skip (legacy format). For current 3-row Qualtrics exports, pre-strip the third metadata row or read manually via readr::read_csv(skip = 3) before calling reshape_projoint().
cregg::cj_tidy() — Reshapes wide data across the three-level respondent/task/profile hierarchy via two named lists: profile_variables (features and profile-specific outcomes that vary within a task) and task_variables (variables that vary by task but not across profiles within it). Crucially, a choice variable that names the chosen profile ("left"/"right") goes in task_variables and must be recoded after reshaping, whereas per-profile "chosen" indicators go in profile_variables — getting this wrong silently corrupts the outcome. Constraint handling is not a cj_tidy feature; two-way design constraints are specified downstream via * in the amce()/cj() formula.
Package decision rule (default then escape hatch):
cjoint::read.qualtrics()projoint::reshape_projoint()cjdata::reshape_conjoint()cregg::cj_tidy()When existing packages cannot handle the data format, reshape manually. The goal: one row per respondent x task x profile, one column per attribute.
Step 1: Build a long table of (ResponseId, task, profile, attribute_name, attribute_value)
Iterate over tasks, profiles, and attribute positions. For each combination, read the attribute name from the name column and the corresponding value from the value column. This naturally handles randomized attribute order.
# n_tasks / P / K = tasks, profiles per task, attributes per profile.
# Never name the task count `T`: it aliases TRUE in R, so seq_len(T) would
# silently run exactly one task instead of erroring.
rows <- vector("list", n_tasks * P * K)
i <- 0L
for (task in seq_len(n_tasks)) {
name_cols <- paste0(prefix, "-F-", task, "-", seq_len(K))
for (profile in seq_len(P)) {
val_cols <- paste0(prefix, "-F-", task, "-", profile, "-", seq_len(K))
for (idx in seq_len(K)) {
i <- i + 1L
rows[[i]] <- data.frame(
ResponseId = data$ResponseId,
task = task,
profile = profile,
attribute_name = data[[name_cols[idx]]],
attribute_value = data[[val_cols[idx]]],
stringsAsFactors = FALSE
)
}
}
}
long <- data.table::rbindlist(rows)
For Graham-style embedded fields (e.g., choice1_bread1, c1_attrib1_name/c1_attrib1_sand1), a tidyr::pivot_longer(names_pattern = ...) one-shot is often cleaner than the triple loop — match the numeric indices into task/profile/attribute-position columns, then pivot back wide on attribute_name. Use data.table::rbindlist() for performance on large datasets.
Step 2: Filter missing data. Remove rows where attribute_name or attribute_value is NA — these indicate respondents who skipped the conjoint section.
Step 3: Apply attribute name and level merges before pivoting. Fix typos, encoding variants, or synonymous levels.
Step 4: Pivot to wide-by-attribute using data.table::dcast() or tidyr::pivot_wider():
dcast(long, ResponseId + task + profile ~ attribute_name, value.var = "attribute_value")
If dcast warns about duplicate row/column combinations, two positions in the same task share an attribute name for some respondents — investigate the name merge.
Choice variables are separate Qualtrics questions (MC type), one per task, placed after each conjoint display.
Identify choice columns: These are NOT part of the embedded data fields. Map each choice question to its task number (Q17 -> task 1, Q19 -> task 2, etc.). Inspect the QSF or survey flow to confirm the mapping.
Text vs. numeric encoding: Text exports produce labels like "Person A"/"Person B" or "Profile 1"/"Profile 2". Numeric exports produce 1/2. Always verify from the actual data — do not assume.
Create the binary outcome: Prefer matching the terminal character of the choice string so the same code handles "Person A"/"Person B", "Profile 1"/"Profile 2", "Candidate A"/"Candidate B", and "Sandwich 1"/"Sandwich 2" uniformly. If numeric exports are used, coerce directly.
last_char <- stringr::str_sub(raw_choice, -1L)
chosen_profile <- dplyr::case_when(
last_char %in% c("A", "1") ~ 1L,
last_char %in% c("B", "2") ~ 2L,
TRUE ~ NA_integer_
)
# Merge by (ResponseId, task), then:
chosen <- as.integer(profile == chosen_profile)
Handle missing choices: Drop rows where chosen is NA. Some tasks may have higher dropout rates than others (especially the last task).
Ratings may be per-profile (one per profile per task — usable as a continuous conjoint DV) or per-task (one rating of the chosen profile — descriptive only, not a standard conjoint DV).
Endpoint label recoding: Qualtrics text exports encode scale endpoints as text labels (e.g., "Strongly disagree" = 1, "Strongly agree" = 7). Recode these before converting to numeric. Intermediate scale points are already numeric.
rating_num <- dplyr::case_when(
raw_rating == "Strongly disagree" ~ 1,
raw_rating == "Strongly agree" ~ 7,
TRUE ~ suppressWarnings(as.numeric(raw_rating))
)
Use recode_factor() with deliberate level ordering. The order of arguments sets the factor level order, which determines:
For Stata/SPSS/Qualtrics exports that already carry labels (e.g., loaded via haven::read_stata()), haven::as_factor() converts all labelled columns to factors in one pass (Heiss 2023). Combine with forcats::fct_relevel() to set the baseline without re-typing every level. Use recode_factor() when you also need translation or level merges.
Reference category principles:
Catching unexpected values: Set .default = NA_character_ in recode_factor() to force unrecognized values to NA. Default handling of unmatched values has varied across dplyr versions, and unexpected pilot-only levels can otherwise pass through as new factor levels or quietly coerce to NA without surfacing. Set .default explicitly so pilot contamination and typos are visible in a post-reshape table() check, not masked by silent defaults.
Drop unused levels: After filtering, call droplevels() to remove factor levels with zero observations. cregg::cj() requires 2+ realized levels per factor.
Pilot detection: Compare unique attribute levels in the data against the final design document. Extra levels (e.g., a country not in the final design) indicate pilot/pre-test respondents. Exclude these as respondents (all their rows), not just the anomalous rows — their entire randomization was generated by a different design.
Pilot exclusion is an analytic choice, not just cleaning hygiene. Under DA-RT/APSA transparency norms, the pilot-vs-final boundary should be specified in the pre-analysis plan. If it was not pre-registered, treat the exclusion as a researcher degree of freedom: log every excluded ResponseId with a reason code in a machine-readable exclusions file, and report both with-pilot and without-pilot estimates as a sensitivity check. See the pre-registration-writing and methods-reporting skills.
Missing conjoint data: Respondents who skipped the conjoint section produce all-NA attribute columns. The NA filter in Step 2 removes them. Respondents who dropped out mid-conjoint will have fewer than T x P rows — this is acceptable for cregg::cj().
Duplicate attribute names: Each task should have exactly K unique attribute names. If a merge creates duplicates within a task, the merge is incorrect.
Choice completeness: chosen should sum to T per respondent (one chosen profile per task). Fewer indicates missing choices for some tasks.
Include respondent-level covariates in the analysis-ready dataset even for main-effects-only analysis. Future subgroup and interaction analyses should not require re-running data prep.
Merge demographics (age, gender, education, urban/rural, ethnicity, party membership), treatment assignments, randomization indicators, and open-text responses by ResponseId after reshaping.
Target shape is not arbitrary. The "one row per respondent x task x profile" unit of observation is dictated by the Hainmueller, Hopkins, and Yamamoto (2014) potential-outcomes framework, under which the AMCE is defined at the profile level with profile-level randomization and a no-carryover / stability assumption across tasks. Collapsing to one-row-per-task (e.g., keeping only the chosen profile's attributes) destroys the identification strategy. Ratings used as a continuous DV must also be per-profile, not per-task.
Save as .rds files (one per conjoint). The output should have:
cregg::cj() will error on character columns)ResponseId: respondent identifier (character or numeric)task: task number (integer)profile: profile number (integer)chosen: binary outcome (numeric 0/1, not logical or factor)rating (numeric)cregg compatibility: cj(data, chosen ~ Attr1 + Attr2 + ..., id = ~ResponseId). The id parameter requires a tilde formula (~ResponseId), not a bare name. The estimate parameter accepts "amce", "mm", "mm_differences", "amce_differences", and "frequencies".
cjoint compatibility: Expects a selected column (integer 0/1, as read.qualtrics() produces) and attributes named with the F-based convention. Use cjoint::amce() for estimation.
projoint compatibility: Requires a projoint_data object created via reshape_projoint(). Supports repeated-task IRR estimation and bias-corrected AMCEs.
cjoint, cjdata, or projoint can handle the data format before writing custom reshaping codechosen column is numeric, not logical or factorrecode_factor(.default = NA_character_) used to surface unexpected valuesResponseId has a machine-readable reason (pilot, no-consent, all-NA conjoint, etc.); raw export is preserved unmodifiedsessionInfo() captured; cjoint, projoint, cregg, dplyr behaviors differ across versions and should be fixed for reproducibility