More Detail on lapply() and combining multiple data files

Continuing the discussion from FAQ: Data Cleaning in R - Dealing with Multiple Files:

Hi everyone,

This exercise gives a quick overview of combining different CSV files into one, using the lappy operations.

Can anyone elaborate more on the below code and offer more details on how to understand and use the lapply() fucntion in R? Thanks.

files ← list.files(pattern = “file_.*csv”)
df_list ← lapply(files,read_csv)
df ← bind_rows(df_list)

Simply put, lapply allows for the application of a function to a list or vector in R. It then returns the output as a list, allowing for the use of “bind_rows” above, since data frames are essentially just lists anyway. Generally, you want to use lapply when you need to apply the same single function to multiple R lists, which is often something like read_csv but not always. For example, you can take a column of a dataframe and return that column with some data manipulation done to it.

I use R as a large part of my job, and honestly it can be hard to find good applications for lapply, sapply etc. They have their use, however in production you tend to work mostly with data frames, and using packages like dplyr, tidyr and data.frame. These have their own built in functions which are easier to use (see dplyr mutate for example) and do effectively the same job. But the gist of it is that lapply is used to apply a standard function to a list.

1 Like

Hi, thanks for replying. After reading your comment and doing some more research, this Lapply() function makes more sense to me now. There are actually many applications to this, using apply() to manipulate datasets.

Do you use lapply() or any apply() function at work? If so, could you share a brief example of the practical use of the function in action?

If there is none to think of now, it is OK. Anyway, thank you.

I’ve used sapply occasionally when I’m using it to pick out things in a data frame, but honestly it’s incredibly rare for me. Generally I work a ton with data and as such data frames over anything else, and I tend to find anything apply() functions can do (lapply, sapply etc), there’s a dplyr or data.frame function that does it better. I think it had it’s place, and maybe still does but I don’t find it so useful anymore!

1 Like

OK, thanks for giving the input. I should explore more.