encoding #96

dmvianna · 2017-11-21T03:06:18Z

Hi, Andrew.

λ> printValidAddresses
{id :-> "1992010803", abstract :-> POBoxAddress "3898" "sydney" "nsw" "2001"}
{id :-> "1992010807", abstract :-> POBoxAddress "4164" "sydney" "nsw" "2001"}
{id :-> "1999042444", abstract :-> POBoxAddress "1285" "k,melbourne" "vic" "3001"}
*** Exception: ../data/IPGOD.IPGOD122B_PAT_ABSTRACTS.csv: hGetLine: invalid argument (invalid byte sequence)

How would one change the encoding prior to parsing a CSV file? I thought of decodeLatin1, but there isn't an obvious place to put it.

current commit

acowley · 2017-11-21T16:58:52Z

I can't reproduce, but I don't think you're using the same data file as is in your repository. The file there has no record with id 1999042444, for example.

dmvianna · 2017-11-21T18:42:15Z

I’m not using the same file. I didn’t want to include a massive text file in the repo. I can get a `head` of the file including the offending line or the whole thing, which do you think is the best approach?

…

-- Daniel Vianna

On 22 Nov 2017, at 3:58 am, Anthony Cowley ***@***.***> wrote: I can't reproduce, but I don't think you're using the same data file as is in your repository. The file there has no record with id 1999042444, for example. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

acowley · 2017-11-21T19:11:34Z

You could try using something from here, but I must confess I have very little experience with that. Moreover, any such change to the runtime code has a danger of hiding a problem with some TemplateHaskell reading the file at compile time; I'm not sure how best to handle that. Perhaps ParserOptions would need a TextEncoding field that could be overridden.

If possible, I'd actually just write a program to convert the file encoding before further processing. If you think it would be helpful to handle different encodings from within Frames, the first step is figuring out how to read the lines of that file into a Text value, and then we can add what is needed.

dmvianna · 2017-11-21T19:22:15Z

Munging files from external stakeholders is what I do for a living. I would think the most convenient approach would be to stream the file in one pass. It would be great if we could do that with `pipes`, that is, translate the encoding and then get `Frames` to break it into columns and types. I’ll work on the `head` and the `Text` once I’m on the train going to work. That should be an hour or two away.

…

-- Daniel Vianna

On 22 Nov 2017, at 6:11 am, Anthony Cowley ***@***.***> wrote: You could try using something from here, but I must confess I have very little experience with that. Moreover, any such change to the runtime code has a danger of hiding a problem with some TemplateHaskell reading the file at compile time; I'm not sure how best to handle that. Perhaps ParserOptions would need a TextEncoding field that could be overridden. If possible, I'd actually just write a program to convert the file encoding before further processing. If you think it would be helpful to handle different encodings from within Frames, the first step is figuring out how to read the lines of that file into a Text value, and then we can add what is needed. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

acowley · 2017-11-21T19:59:58Z

Okay, end should incorporate this machinery at compile- and run-time.

dmvianna · 2017-11-21T21:02:41Z

The offending line is now in the file, and the code points to it. Please see if you can reproduce. As a side note, I'm testing it on a Mac, but last Friday this ran perfectly fine in Windows 10. However I will not be able to reproduce it in Windows, as I had to make many changes to my Haskell setup there.

acowley · 2017-11-22T19:05:05Z

The below modification of your program now works with the pipe-everything branch of Frames! We should poke at that branch a bit more before merging it into master since it's a pretty large change, but I'm hoping it also addresses #77 and #92, though I did a poor job of incorporating the start offered in #93, so I'm not sure if the current changes cover everyone's needs.

{-# LANGUAGE ConstraintKinds   #-}
{-# LANGUAGE DataKinds         #-}
{-# LANGUAGE FlexibleContexts  #-}
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE QuasiQuotes       #-}
{-# LANGUAGE TemplateHaskell   #-}
{-# LANGUAGE TypeFamilies      #-}
{-# LANGUAGE TypeOperators     #-}
{-# OPTIONS_GHC -Wall #-}

module Patents where

-- import           Lens.Micro            ((%~), (^.))
import           Pipes                 (Producer, (>->))
import qualified Pipes.Prelude         as P

import           Data.Vinyl            (Rec)
import           Frames                (runSafeEffect, (:->), Text, MonadSafe)

import           Frames.CSV            (declareColumn,
                                        pipeTableMaybe, readFileLatin1Ln)

import           Frames.Rec

import           PatAbstracts

declareColumn "patId" ''Text
declareColumn "abstract" ''Address
type PatColumns = '["id" :-> Text, "abstract" :-> Address]
type PA = Record PatColumns
type PAMaybe = Rec Maybe PatColumns

patStreamM :: MonadSafe m => Producer PAMaybe m ()
patStreamM =  readFileLatin1Ln "data/pat_abstracts.csv" >-> pipeTableMaybe

printValidAddresses :: IO ()
printValidAddresses =
  runSafeEffect $ patStreamM >-> P.map recMaybe >-> P.concat >-> P.print

dmvianna · 2017-11-22T21:54:21Z

I love it, Anthony! Works for me, and being modular, we could change the input and keep the rest. Would that work with things like http streams, at least conceptually?

acowley · 2017-11-23T03:34:53Z

Yes, I think so!

A couple things need to happen:

We should make this encoding example part of a test suite
We need to verify the pipes stuff works with row type inference (this involves customizing a RowGen record with a custom Producer)

After that, I think it'd be great to have an example that pulled the data from a web site. I'm not sure how great an idea that is for the compile-time part, but it'd be neat to show that such a bad idea can be made real! 😃

dmvianna · 2017-11-23T04:50:33Z

Hmm. Did you fix the test suite to build properly yet? And are you available on messaging? I am keen to work on this and learn.

acowley · 2017-11-23T15:31:41Z

Did running the getdata program not get the tests working for you?

dmvianna · 2017-11-23T20:42:57Z

Oh wow. stack exec getdata. Would you consider a pull request with some trivial documentation for Haskell beginners? Only now I understood how to use the executables. You assume a certain level of familiarity with the Haskell build system (cabal or stack) which will not exist for people first dipping their toe to check if this unknown language might offer something useful in the realm of data analysis.

acowley · 2017-12-14T00:49:48Z

I think across this and the latin1 PR and the initial changes to the README, we've hit the main pieces of this. More documentation is an ongoing effort.

dmvianna mentioned this issue Nov 24, 2017

test stub for latin1-encoded data #97

Closed

acowley closed this as completed Dec 14, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

encoding #96

encoding #96

dmvianna commented Nov 21, 2017

acowley commented Nov 21, 2017

dmvianna commented Nov 21, 2017 via email

acowley commented Nov 21, 2017

dmvianna commented Nov 21, 2017 via email

acowley commented Nov 21, 2017

dmvianna commented Nov 21, 2017

acowley commented Nov 22, 2017

dmvianna commented Nov 22, 2017

acowley commented Nov 23, 2017

dmvianna commented Nov 23, 2017

acowley commented Nov 23, 2017

dmvianna commented Nov 23, 2017 •

edited

Loading

acowley commented Dec 14, 2017

encoding #96

encoding #96

Comments

dmvianna commented Nov 21, 2017

acowley commented Nov 21, 2017

dmvianna commented Nov 21, 2017 via email

acowley commented Nov 21, 2017

dmvianna commented Nov 21, 2017 via email

acowley commented Nov 21, 2017

dmvianna commented Nov 21, 2017

acowley commented Nov 22, 2017

dmvianna commented Nov 22, 2017

acowley commented Nov 23, 2017

dmvianna commented Nov 23, 2017

acowley commented Nov 23, 2017

dmvianna commented Nov 23, 2017 • edited Loading

acowley commented Dec 14, 2017

dmvianna commented Nov 23, 2017 •

edited

Loading