Skip to content

Question on csv programming exercise and encoding rows missing specific header keys #176

Open
@CoreyWinkelmannPP

Description

@CoreyWinkelmannPP

Problem Description

I have the need to take a collection of csv documents in a folder and merge them together into one really large csv document. The columns within each file will contain some overlapping columns and some unique columns. The script will read each of these files, merge them, and then write out the new csv file.

Solution I have working (but seems a little slow in comparison to a go or rust implementation)

Rust and Go on the data set would run this scenario in 100 to 200ms. The Haskell version below would do it in 300 to 400 ms. A python version was running within that 300 to 400ms realm as well which is why I think Haskell should be able to do this faster.

I have coded the following and originally I was hoping to stream through the files and process and build up the results using conduit but I ended up bailing and outputting the files and then going through those and processing them one off. I want a more efficient and idiomatic Haskell version for accomplishing this and was wondering if anyone would give me some insights on what that may look like. One issue I did come across with the below solution was that I had to change the cassava code to allow an empty string to be returned when the map lookup returned Nothing instead of failing like the current version does.

{-# LANGUAGE OverloadedStrings #-}
module Main where

import Conduit
import System.FilePath (takeExtension)
import Data.Csv
import qualified Data.Vector as V
import qualified Data.ByteString as BS
import qualified Data.ByteString.Lazy as LBS
import qualified Data.Map as M
import Data.Either
import Data.List (nub)
import Control.Monad

type Column = M.Map BS.ByteString LBS.ByteString
type Rows = V.Vector Column
type CsvDocument = (Header, Rows)
type CsvDocuments = V.Vector BS.ByteString
type ErrorMsg = String

getCsvDocuments :: ConduitM a c (ResourceT IO) CsvDocuments
getCsvDocuments = sourceDirectoryDeep True "."
        .| filterC (\fp -> takeExtension fp == ".csv")
        .| awaitForever sourceFile
        .| sinkVector

mergeHeader :: Header -> Header -> Header
mergeHeader h1 h2 = V.fromList . nub . V.toList $ (h1 V.++ h2)

combineCsvDocuments :: CsvDocument -> BS.ByteString -> CsvDocument
combineCsvDocuments acc csv = (mergedHeader, mergedBody)
    where
        decodedCsv = fromRight (V.empty, V.empty) . decodeByName . LBS.fromStrict $ csv
        mergedHeader = mergeHeader (fst decodedCsv) (fst acc)
        mergedBody = snd acc V.++ snd decodedCsv

mapFiles :: CsvDocuments -> CsvDocument
mapFiles = V.foldl' combineCsvDocuments (V.empty, V.empty)

main :: IO ()
main = do
    files <- runConduitRes getCsvDocuments
    let document = mapFiles files
    LBS.writeFile
        "output/combined_response.csv"
        (encodeByName
            (fst document)
            (V.toList . snd $ document))
    return ()

About me

I am learning concepts from Haskell but still a beginner. I pick out some challenges and try them out in Haskell but I am always looking for feedback and better ways of doing them from more experienced individuals. I develop in Object Oriented Languages for my current role which I understand well. I am trying to expand my knowledge with gaining a better understanding of how Functional Programming can improve my development skills.

Thanks in advance for any help you can give!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions