csv_to_disk.frame running extremely slow #368

bryan-rt · 2021-11-04T14:46:38Z

bryan-rt
Nov 4, 2021

I am fairly new at handling medium sized data so I could very well be doing something basic wrong, but I am not seeing what my issue could be.

I have a 9 GB csv file, and am running on an 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00 Ghz with 4 cores, 8 logical processors, and 15.7 GB of Memory. I am using R 3.6.1 (can't update due to employer). 120M rows and 19 columns.

When I run the fallowing code the stage1splitter runs for hours with no results. and the cpu usage for the R for Windows front-end workers is 0% most of the time.

Code

library(dplyr)
library(purrr)
library(disk.frame)

# this willl set disk.frame with multiple workers
setup_disk.frame()

# this will allow unlimited amount of data to be passed from worker to worker
options(future.globals.maxSize = Inf)

path_to_data = "F:/<file_path>"

# read 1 million at once
in_chunk_size = 1e7

system.time(
  csv_to_disk.frame(
    paste0(path_to_data, "<file_name>.csv"), 
    in_chunk_size = in_chunk_size
  )
)

Output

 ----------------------------------------------------- 
Stage 1 of 2: splitting the file F:/30000/30000/Credit Policy/BryanT/TU Scorecard CL Monitoring/perf_v3.csv into smallers files:
Destination: C:\Users\B9800\AppData\Local\Temp\Rtmp4AoDer\file42f4627257a4
 ----------------------------------------------------- 
Stage 1 of 2 took: 02:44:06 elapsed (45.4s cpu)
 ----------------------------------------------------- 
Stage 2 of 2: Converting the smaller files into disk.frame
 ----------------------------------------------------- 
csv_to_disk.frame: Reading multiple input files.
Please use `colClasses = `  to set column types to minimize the chance of a failed read
=================================================

 ----------------------------------------------------- 
-- Converting CSVs to disk.frame -- Stage 1 of 2:

Converting 13 CSVs to 20 disk.frames each consisting of 20 chunks

  Progress: ---------------------------------------------------------------- 100%-- Converting CSVs to disk.frame -- Stage 1 or 2 took: 33.0s elapsed (0.120s cpu)
 ----------------------------------------------------- 
 
 ----------------------------------------------------- 
-- Converting CSVs to disk.frame -- Stage 2 of 2:

Row-binding the 20 disk.frames together to form one large disk.frame:
Creating the disk.frame at C:\Users\B9800\AppData\Local\Temp\Rtmp4AoDer\file42f47e5437.df

Appending disk.frames: 
Stage 2 of 2 took: 32.4s elapsed (0.190s cpu)
 ----------------------------------------------------- 
Stage 1 & 2 in total took: 00:01:05 elapsed (0.310s cpu)
Stage 2 of 2 took: 00:01:08 elapsed (0.340s cpu)
 ----------------------------------------------------- 
Stage 2 & 2 took: 02:45:15 elapsed (45.7s cpu)
 ----------------------------------------------------- 
   user  system elapsed 
  45.75   28.98 9915.55

Is this an issue with using 3.6.1? When I load disk.frame I get

Warning message: package ‘disk.frame’ was built under R version 3.6.3

xiaodaigh · 2022-01-24T11:03:51Z

xiaodaigh
Jan 24, 2022
Maintainer

I think the larger the chunk size the better the performance

in_chunk_size = 1e7

Try removing the above or set it to much higher number e.g. in_chunk_size = 1e8

Are the files share-able? If yes I can take a look.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

csv_to_disk.frame running extremely slow #368

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

csv_to_disk.frame running extremely slow #368

Uh oh!

Uh oh!

bryan-rt Nov 4, 2021

Code

Output

Replies: 1 comment

Uh oh!

xiaodaigh Jan 24, 2022 Maintainer

bryan-rt
Nov 4, 2021

xiaodaigh
Jan 24, 2022
Maintainer