-
Notifications
You must be signed in to change notification settings - Fork 2
Creating variable descriptions for datasets not provided
Authors should provide a codebook or dataset description, precise enough that future replicators, obtaining purportedly same data from a source, can verify plausibility of such provision. It is acceptable to point to codebooks or otherwise clear descriptions provided by the data source.
When creating a codebook, authors should be aware that summary statistics may be subject to confidentiality protection. This is unlikely to be relevant for commercial datasets, but is very likely for administrative data. Editing of codebooks for this purpose, or modification of the data before creation of the codebook, is acceptable.
codebook
Example:
. sysuse auto
(1978 automobile data)
. codebook
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
make Make and model
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Type: String (str18), but longest is str17
Unique values: 74 Missing "": 0/74
Examples: "Cad. Deville"
"Dodge Magnum"
"Merc. XR-7"
"Pont. Catalina"
Warning: Variable has embedded blanks.
...
Multiple packages can be used. The following describes the use of codebook
.
library(haven)
library(codebook)
library(rmarkdown)
# various additional dependencies
new_codebook_rmd() # will generate a new Rmarkdown file called `codebook.Rmd`
# edit the codebook.Rmd to your liking
render("codebook.Rmd") # will generate an HTML codebook
Checksums are created for files, or file contents. Different files (almost) never create the same checksum. While a few datafile-agnostic formats exist, we will focus here on general checksums.
We focus on sha256
checksums, as they suffer less from collisions (different files with the same checksum), but md5
checksums are still widely used. Such standards-based checksums can be checked through a variety of mechanisms. Stata has its own checksum function.
Various operating systems, notably Linux and macOS, may have native checksum commands. From a terminal/command line,
sha256sum file.txt
or
md5sum file.txt
will output something like
8aada5c6f554e426181cd22006c20291119fe85cab1d4d50893d64292802e2de file.txt
The Stata command checksum
will create a different checksum, so you will need Stata to verify it.
. checksum file.txt
will output
Checksum for file.txt = 1964867009, size = 670
The R package tools
has checksums:
> tools::md5sum("file.txt")
file.txt
"c5212cac825e7932be0c01877e344a96"
The R package openssl
has a few other checksums (hash functions):
> openssl::sha256("file.txt")
[1] "d31ce0453051853c17ba2a5225b3d1bfab548e095bab0967d6acfd1b3ce1b35d"
Both openssl
and tools
are usually installed in base R.
Needed
-
Training
-
Tips for authors
-
Tips for replicators
-
Questionnaires
-
Definitions
-
Generic workflow
-
Post-publication replications
-
Technical issues
-
Appendix