Skip to content

Delimited mr #154

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Mar 15, 2018
Merged

Delimited mr #154

merged 19 commits into from
Mar 15, 2018

Conversation

gshotwell
Copy link
Contributor

Create Multiple Response variables from delimited text or categorical variables. The process this function uses to generate the MR is:

  • Pull data into R
  • Construct and upload subvariable definitions
  • Call makeMR to bind those subvariables into a MR variable
  • Hide original variable

@codecov
Copy link

codecov bot commented Nov 1, 2017

Codecov Report

Merging #154 into master will decrease coverage by 0.12%.
The diff coverage is 65.51%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #154      +/-   ##
==========================================
- Coverage   89.21%   89.08%   -0.13%     
==========================================
  Files          92       92              
  Lines        5478     5507      +29     
==========================================
+ Hits         4887     4906      +19     
- Misses        591      601      +10
Impacted Files Coverage Δ
R/make-array.R 68.31% <65.51%> (-1.13%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cdd0104...81f2797. Read the comment docs.

@codecov
Copy link

codecov bot commented Nov 1, 2017

Codecov Report

Merging #154 into master will decrease coverage by 0.12%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #154      +/-   ##
==========================================
- Coverage   90.04%   89.91%   -0.13%     
==========================================
  Files         105      104       -1     
  Lines        6466     6414      -52     
==========================================
- Hits         5822     5767      -55     
- Misses        644      647       +3
Impacted Files Coverage Δ
R/make-array.R 82.94% <100%> (+13.5%) ⬆️
R/misc.R 98.91% <100%> (-0.27%) ⬇️
R/variable-update.R 86.23% <0%> (-2%) ⬇️
R/dataset.R 93.28% <0%> (-1.57%) ⬇️
R/variable.R 82.6% <0%> (-0.73%) ⬇️
R/AllGenerics.R 92.59% <0%> (-0.52%) ⬇️
R/cube-result.R 99.38% <0%> (-0.02%) ⬇️
R/progress.R 100% <0%> (ø) ⬆️
R/variable-as-methods.R
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c7726bb...3144e12. Read the comment docs.

Copy link
Contributor

@jonkeane jonkeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very exciting. I know that we've struggled with making this easy, and it will be great to be able to do this.

R/make-array.R Outdated
@@ -80,6 +80,101 @@ makeMR <- function (subvariables, name, selections, ...) {
return(vardef)
}


#' Create Multiple Response Variable from Delimited
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super nitpicky: Create a Multiple Response Variable from Delimited Lists or Create Multiple Response Variables from Delimited Lists

R/make-array.R Outdated
#' @param unanswered Character string indicating non-response
#' @param ... Other arguments to be passed on to [makeMR()]
#'
#' @return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a return string here

R/make-array.R Outdated
#' @param delim The delimiter separating the responses
#' @param name The name of the resulting MR variable
#' @param selected A character string used to indicate a selection
#' @param not_selected Character string identifying non-selection
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add references to the default values that are specified

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think "selected" and "not_selected" arguments are worth having. And I'm guessing that if you convert to the derived expression (using ~= on the server), you won't have room for defining subvariable category names anyway.

R/make-array.R Outdated
v <- as.vector(var)
} else {
halt(dQuote("var"), " must be a Categorical or Text Crunch Variable.")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, dQuote("var") could be dQuote(substitute(var)) to get the actual variable string that was given

R/make-array.R Outdated
ds <- loadDataset(datasetReference(var))
addVariables(ds, vardefs)
hide(var)
ds <- refresh(ds)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can get rid of this refresh line, if you change line 122 to ds <- addVariables(ds, vardefs) and possible also change line 123 to var <- hide(var)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can get rid of all of this if we accept #162 and you use that approach instead.

c("not_selected", "selected", "No Data"))
expect_identical(as.vector(ds$mr_5$maple),
structure(c(2L, 2L, 1L, 1L), .Label = c("not_selected", "selected"
), class = "factor"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be more readable as factor(c("selected", "selected", "not_selected", "not_selected"), levels=c("not_selected", "selected"))

unanswered = v[is.na(v)])
expect_equivalent(varDef, expected)
})
c("maple; birch", "oak; maple; birch", "birch; sugar maple", "maple butter; oak")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this line leftover?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup!

not_selected = "No",
unanswered = v[is.na(v)])
expect_equivalent(varDef, expected)
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This expectation doesn't seem quite right: I would have expected "oak" to be matched a few times given those values.

})

test_that("createSubvarDef generates the correct variable definition", {
v <- c("maple; birch", "oak; maple; birch", "birch; sugar maple", "maple butter; oak", NA)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be good to also test where one of the values is selected in all of the values.

R/make-array.R Outdated
vardefs <- lapply(cats, function(x) createSubvarDef(v, x, delim,
selected, not_selected, unanswered, missing = is.na(v)))
ds <- loadDataset(datasetReference(var))
addVariables(ds, vardefs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is avoidable, but maybe it would be good to add a check that these variables don't already exist / if they do rename them? It seems a bit problematic that someone could only ever use this command once on a variable (without going through and manually deleting the subvariables that were created).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about just changing the subvariable names to varname_selection? In other words if the delimited variable was called trees then the subvariables would be trees_oak trees_maple etc. My preference would be to keep the function consistent so that it always named things in the same way and didn't change behaviour based of the other variables in a dataset.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

varname_selection is nice, but it doesn't ensure uniqueness. A safer way is to create the variable in one request (not upload and then bind, which is not atomic). And in doing it in a single request, you may be able to omit aliases for the subvariables, in which case the backend should generate (guaranteed valid) aliases for them.

@gshotwell
Copy link
Contributor Author

It looks like there's a zz9 bug which prevents creating a MR variable with derived subvariable definitions.

https://www.pivotaltracker.com/story/show/152975005 &
https://www.pivotaltracker.com/story/show/152975139

The ideal scenario is to create the multiple response variable definition in one step without having to first create the subvariables, but for the time being we probably need to use the two step process until the bug is resolved.

R/make-array.R Outdated
deriv <- zfunc("case", new_cat)
deriv$args[[2]] <- zfunc("is_missing", var)
deriv$args[[3]] <- zfunc("~=", var, buildDelimRegex(str, delim))
deriv$references <- list(name = str, alias = paste0(alias(var), "_", str))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably check that these aliases are unique in the dataset. Currently the concatenate the source variable's alias with the item name (str).

It should be doable to check that this alias doesn't exist, and if it does do something else.

GShotwell and others added 4 commits February 13, 2018 14:22
- Changed map names to accomodate no periods in Mongo
- Added escape functionality for regex metacharacters
@nealrichardson nealrichardson merged commit f9cbe3c into master Mar 15, 2018
@nealrichardson nealrichardson deleted the delimited_MR branch March 15, 2018 21:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants