In this series of posts we will be looking at a number of ways to store data using R in as little space as possible and also consider the portability of the different solutions. As an example, the New York City Flights data set for 2013 is used (available through CRAN). The first rows and columns are shown below:
# A tibble: 336,776 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
# carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
For portability across programming languages, CSV’s, optionally compressed, area great option as any platform for doing anything with data should be able to read CSV. Alternative formats such as fst, parquet and ORC have a number of advantages such as smaller sizes, better fidelity and built in integrity checking. These will be examined in a later post. For now, gzipped CSV is used as reference. If created from the R data.table package, it will take 2 cores a bit less than one second to write the file containing the 2013 flights. The file size is 6.6mb.
library(dplyr)
library(data.table)
# We drop the column with timestamps for reasons explained below.
flights <- nycflights13::flights %>% select(-time_hour)
flightsDt <- as.data.table(flights)
system.time(data.table::fwrite(flightsDt, 'flights.csv.gz'))
user system elapsed
1.886 0.027 0.984
If you want to read or write a large amount of data to CSV using R and you want to do it rather quickly, there are two good options at the moment: data.table and vroom.
data.table and vroom in the current development version support writing gzip-compressed csv’s as well. Roughly, zip-like compression algorithms works by creating a mapping of shorter sequences to longer sequences of bits in such a way that the mapping + the input mapped from the set of longer sequences to the shorter sequences takes up less space than the original. Compression algorithms use clever techniques to create and maintain this mapping but for reasons of speed and memory use this mapping can’t grow without bounds. We can help the algorithm a bit by first transposing the data.
Transpose for a free lunch
There is a lot of repetition in the first three columns year
, month
and day
.
> flightsDt[,
lapply(.SD, function(.) length(unique(.))),
.SDcols = c('year', 'month', 'day')
]
year month day
1: 1 12 31
If all these are put close together, it helps the compression algorithm a bit:
TimeAndSize <- function(FUN, fileName) {
filePath <- file.path(tempdir(), fileName)
on.exit(unlink(filePath))
timing <- unclass(system.time(FUN(filePath)))
fileSize <- file.info(filePath)$size
c(timing, FileSizeInMB = round(fileSize / 1024L / 1024L, 1))
}
dt_csv_fun <- function(fileName) fwrite(flightsDt, fileName)
dt_gz_fun <- dt_fun vroom_csv_fun <- function(.)
vroom_write(flights, .)
vroom_gz_fun <- function(.) vroom_write(flights, pipe(paste('gzip >', .)))
vroom_mgz_fun <- function(.) vroom_write(flights, pipe(paste('pigz >', .)))
vroom_zstd_fun <- function(.) vroom_write(flights, pipe(paste('zstd >', .)))
dt_csv_fun_transposed <- function(.) fwrite(transpose(flightsDt), .)
dt_gz_fun_transposed <- function(.) fwrite(transpose(flightsDt), .)
experiments <- list(
dt_csv_fun = 'flights.csv', dt_gz_fun = 'flights.csv.gz',
vroom_csv_fun = 'flights.csv', vroom_gz_fun = 'flights.csv.gz',
vroom_mgz_fun = 'flights.csv.gz', vroom_zstd_fun = 'flights.csv.zstd',
dt_csv_fun_transposed = 'flights.tcsv',
dt_gz_fun_transposed = 'flights.tcsv.gz'
)
PerformExperiments <- function(experiments) {
t(sapply(names(experiments), function(.) {
TimeAndSize(match.fun(.), experiments[[.]])
}))
}
PerformExperiments(experiments)
user.self sys.self elapsed user.child sys.child FileSizeInMB
dt_csv_fun 0.403 0.012 0.215 0.000 0.000 22.8
dt_gz_fun 1.891 0.000 0.961 0.000 0.000 7.5
vroom_csv_fun 1.327 0.019 0.360 0.000 0.000 22.9
vroom_gz_fun 1.359 0.030 1.517 1.474 0.024 7.5
vroom_mgz_fun 1.341 0.046 0.943 1.560 0.039 7.5
vroom_zstd_fun 1.364 0.037 0.600 0.181 0.030 6.6
dt_csv_fun_transposed 2.318 0.047 1.964 0.000 0.000 22.8
dt_gz_fun_transposed 3.004 0.003 1.893 0.000 0.000 5.5
Comparing the results of data.table and vroom we see that timing and file sizes are almost equal. Both are equally good at writing CSV files. However, if the table is first transposed and then written, the total process takes a bit longer but file size is reduce by around 25%!
Warts
The column time_hour
needs to be dropped as data.table
‘s transpose()
unpacks this column which is saved as more data and transposing these columns back doesn’t result in the same values in any case. In any case, transposing is slow even with `data.table` and handling very wide tables can be problematic for some software packages.
Conclusion
Compressed csv’s are a portable data format that require significantly less space than their uncompressed counterparts. The size of compressed csv’s can be further reduced by transposing the csv’s first and then compressing. Whether this extra compression is worth it depends on use case.