Skip to content

Commit 2537862

Browse files
authored
Fix fread to preserve literal header column name "NA" while parsing data "NA" as missing (#7686)
1 parent eddec87 commit 2537862

File tree

3 files changed

+20
-4
lines changed

3 files changed

+20
-4
lines changed

NEWS.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,8 @@
4848

4949
8. `frollapply()` no longer produces output longer than the input when the window length is also longer than the input [#7646](https://github.com/Rdatatable/data.table/issues/7646). Thanks to @hadley-johnson for reporting and @jangorecki for the fix.
5050

51+
9. `fread()` no longer replaces a literal header column name `"NA"` with an auto-generated `Vn` name when `na.strings` includes `"NA"`, [#5124](https://github.com/Rdatatable/data.table/issues/5124). Data rows still continue to parse `"NA"` as missing. Thanks @Mashin6 for the report and @shrektan for the fix.
52+
5153
### Notes
5254

5355
1. {data.table} now depends on R 3.5.0 (2018).
@@ -118,15 +120,15 @@
118120
119121
5. Negative and missing values of `n` argument of adaptive rolling functions trigger an error.
120122
121-
### NOTICE OF INTENDED FUTURE POTENTIAL BREAKING CHANGES
123+
### NOTICE OF INTENDED FUTURE POTENTIAL BREAKING CHANGES
122124
123125
1. `data.table(x=1, <expr>)`, where `<expr>` is an expression resulting in a 1-column matrix without column names, will eventually have names `x` and `V2`, not `x` and `V1`, consistent with `data.table(x=1, <expr>)` where `<expr>` results in an atomic vector, for example `data.table(x=1, cbind(1))` and `data.table(x=1, 1)` will both have columns named `x` and `V2`. In this release, the matrix case continues to be named `V1`, but the new behavior can be activated by setting `options(datatable.old.matrix.autoname)` to `FALSE`. See point 5 under Bug Fixes for more context; this change will provide more internal consistency as well as more consistency with `data.frame()`.
124126
125127
2. The behavior of `week()` will be changed in a future release to calculate weeks sequentially (days 1-7 as week 1), which is a potential breaking change. For now, the current "legacy" behavior, where week numbers advance every 7th day of the year (e.g., day 7 starts week 2), remains the default, and a deprecation warning will be issued when the old and new behaviors differ. Users can control this behavior with the temporary option `options(datatable.week = "...")`:
126128
* `"sequential"`: Opt-in to the new, sequential behavior (no warning).
127129
* `"legacy"`: Continue using the legacy behavior but suppress the deprecation warning.
128130
See [#2611](https://github.com/Rdatatable/data.table/issues/2611) for details. Thanks @MichaelChirico for the report and @venom1204 for the implementation.
129-
131+
130132
### NEW FEATURES
131133
132134
1. New `sort_by()` method for data.tables, [#6662](https://github.com/Rdatatable/data.table/issues/6662). It uses `forder()` to improve upon the data.frame method and also matches `DT[order(...)]` behavior with respect to locale. Thanks @rikivillalba for the suggestion and PR.
@@ -405,7 +407,7 @@ See [#2611](https://github.com/Rdatatable/data.table/issues/2611) for details. T
405407
9. Fixed incorrect sorting of merges where the first column of a key is a factor with non-`sort()`-ed levels (e.g. `factor(1:2, 2:1)` and it is joined to a character column, [#5361](https://github.com/Rdatatable/data.table/issues/5361). Thanks to @gbrunick for the report, Benjamin Schwendinger for the fix, and @MichaelChirico for a follow-up fix caught by revdep testing.
406408
407409
10. Spurious warnings from internal code in `cube()`, `rollup()`, and `groupingsets()` are no longer surfaced to the caller, [#6964](https://github.com/Rdatatable/data.table/issues/6964). Thanks @ferenci-tamas for the report and @venom1204 for the fix.
408-
410+
409411
11. `droplevels()` works on 0-row data.tables, [#7043](https://github.com/Rdatatable/data.table/issues/7043). The result will have factor columns `factor(character())`, consistent with the data.frame method. Thanks @advieser for the report and @MichaelChirico for the fix.
410412
411413
12. `print(..., col.names = 'none')` now correctly adapts column widths to the data content, ignoring the original column names and producing a more compact output, [#6882](https://github.com/Rdatatable/data.table/issues/6882). Thanks to @brooksambrose for the report and @venom1204 for the PR.
@@ -587,7 +589,7 @@ rowwiseDT(
587589
3. Tagging/naming arguments of `c()` in `j=c()` should now more closely follow base R conventions for concatenation of named lists during grouping, [#2311](https://github.com/Rdatatable/data.table/issues/2311). Naming an `lapply(.SD, FUN)` call as an argument of `c()` in `j` will now always cause that tag to get prepended (with a single dot separator) to the resulting column names. Additionally, naming a `list()` call as an argument of `c()` in `j` will now always cause that tag to get prepended to any names specified within the list call. This bug only affected queries with (1) `by=` grouping (2) `getOption("datatable.optimize") >= 1L` and (3) `lapply(.SD, FUN)` in `j`.
588590
589591
While the names returned by `data.table` when `j=c()` will now mostly follow base R conventions for concatenating lists, note that names which are completely unspecified will still be named positionally, matching the typical behavior in `j` and `data.table()`. according to position in `j` (e.g. `V1`, `V2`).
590-
592+
591593
Thanks to @franknarf1 for reporting and @myoung3 for the PR.
592594
593595
```r

inst/tests/tests.Rraw

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2767,6 +2767,12 @@ test(946, fread('A,B,,D\n1,3,foo,5\n2,4,bar,6\n'), data.table(A=1:2,B=3:4,c("foo
27672767
test(947, fread('0,2,,4\n1,3,foo,5\n2,4,bar,6\n'), data.table(0:2,2:4,c("","foo","bar"),4:6))
27682768
test(948, fread('A,B,C\nD,E,F\n',header=TRUE), data.table(A="D",B="E",C="F"))
27692769
test(949, fread('A,B,\nD,E,F\n',header=TRUE), data.table(A="D",B="E",V3="F"))
2770+
# #5124 fread should preserve literal "NA" header names while still parsing data "NA" as missing
2771+
test(949.1, names(fread('A,NA,C\n1,NA,3\n', header=TRUE)), c("A", "NA", "C"))
2772+
ans = data.table(A=1L, tmp=as.logical(NA), C=3L)
2773+
setnames(ans, "tmp", "NA")
2774+
test(949.2, fread('A,NA,C\n1,NA,3\n', header=TRUE), ans)
2775+
test(949.3, names(fread('"A","NA","C"\n1,NA,3\n', header=TRUE)), c("A", "NA", "C"))
27702776

27712777
# +/- with no numbers afterwards should read as character
27722778
test(950, fread('A,B,C\n1,+,4\n2,-,5\n3,-,6\n'), data.table(A=1:3,B=c("+","-","-"),C=4:6))

src/fread.c

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2325,6 +2325,12 @@ int freadMain(freadMainArgs _args)
23252325
.targets = targets,
23262326
.anchor = colNamesAnchor,
23272327
};
2328+
const char * const* savedNAstrings = NAstrings;
2329+
const bool savedBlankIsNAString = blank_is_a_NAstring;
2330+
// Column names should preserve literal header text, even when it matches na.strings.
2331+
// Blank headers still keep len==0 from Field() and are assigned default V<n> names later.
2332+
NAstrings = NULL;
2333+
blank_is_a_NAstring = false;
23282334
ch--;
23292335
for (int i = 0; i < ncol; i++) {
23302336
// Use Field() here as it handles quotes, leading space etc inside it
@@ -2345,6 +2351,8 @@ int freadMain(freadMainArgs _args)
23452351
if (ch[1] == '\r' || ch[1] == '\n' || ch[1] == '\0') { ch++; break; }
23462352
}
23472353
}
2354+
NAstrings = savedNAstrings;
2355+
blank_is_a_NAstring = savedBlankIsNAString;
23482356
if (eol(&ch)) pos = ++ch;
23492357
else if (*ch == '\0') pos = ch;
23502358
else INTERNAL_STOP("reading colnames ending on '%c'", *ch); // # nocov

0 commit comments

Comments
 (0)