Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding #96

Closed
dmvianna opened this issue Nov 21, 2017 · 13 comments
Closed

encoding #96

dmvianna opened this issue Nov 21, 2017 · 13 comments

Comments

@dmvianna
Copy link
Contributor

Hi, Andrew.

λ> printValidAddresses
{id :-> "1992010803", abstract :-> POBoxAddress "3898" "sydney" "nsw" "2001"}
{id :-> "1992010807", abstract :-> POBoxAddress "4164" "sydney" "nsw" "2001"}
{id :-> "1999042444", abstract :-> POBoxAddress "1285" "k,melbourne" "vic" "3001"}
*** Exception: ../data/IPGOD.IPGOD122B_PAT_ABSTRACTS.csv: hGetLine: invalid argument (invalid byte sequence)

How would one change the encoding prior to parsing a CSV file? I thought of decodeLatin1, but there isn't an obvious place to put it.

current commit

@acowley
Copy link
Owner

acowley commented Nov 21, 2017

I can't reproduce, but I don't think you're using the same data file as is in your repository. The file there has no record with id 1999042444, for example.

@dmvianna
Copy link
Contributor Author

dmvianna commented Nov 21, 2017 via email

@acowley
Copy link
Owner

acowley commented Nov 21, 2017

You could try using something from here, but I must confess I have very little experience with that. Moreover, any such change to the runtime code has a danger of hiding a problem with some TemplateHaskell reading the file at compile time; I'm not sure how best to handle that. Perhaps ParserOptions would need a TextEncoding field that could be overridden.

If possible, I'd actually just write a program to convert the file encoding before further processing. If you think it would be helpful to handle different encodings from within Frames, the first step is figuring out how to read the lines of that file into a Text value, and then we can add what is needed.

@dmvianna
Copy link
Contributor Author

dmvianna commented Nov 21, 2017 via email

@acowley
Copy link
Owner

acowley commented Nov 21, 2017

Okay, end should incorporate this machinery at compile- and run-time.

@dmvianna
Copy link
Contributor Author

The offending line is now in the file, and the code points to it. Please see if you can reproduce. As a side note, I'm testing it on a Mac, but last Friday this ran perfectly fine in Windows 10. However I will not be able to reproduce it in Windows, as I had to make many changes to my Haskell setup there.

@acowley
Copy link
Owner

acowley commented Nov 22, 2017

The below modification of your program now works with the pipe-everything branch of Frames! We should poke at that branch a bit more before merging it into master since it's a pretty large change, but I'm hoping it also addresses #77 and #92, though I did a poor job of incorporating the start offered in #93, so I'm not sure if the current changes cover everyone's needs.

{-# LANGUAGE ConstraintKinds   #-}
{-# LANGUAGE DataKinds         #-}
{-# LANGUAGE FlexibleContexts  #-}
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE QuasiQuotes       #-}
{-# LANGUAGE TemplateHaskell   #-}
{-# LANGUAGE TypeFamilies      #-}
{-# LANGUAGE TypeOperators     #-}
{-# OPTIONS_GHC -Wall #-}

module Patents where

-- import           Lens.Micro            ((%~), (^.))
import           Pipes                 (Producer, (>->))
import qualified Pipes.Prelude         as P

import           Data.Vinyl            (Rec)
import           Frames                (runSafeEffect, (:->), Text, MonadSafe)

import           Frames.CSV            (declareColumn,
                                        pipeTableMaybe, readFileLatin1Ln)

import           Frames.Rec

import           PatAbstracts

declareColumn "patId" ''Text
declareColumn "abstract" ''Address
type PatColumns = '["id" :-> Text, "abstract" :-> Address]
type PA = Record PatColumns
type PAMaybe = Rec Maybe PatColumns

patStreamM :: MonadSafe m => Producer PAMaybe m ()
patStreamM =  readFileLatin1Ln "data/pat_abstracts.csv" >-> pipeTableMaybe

printValidAddresses :: IO ()
printValidAddresses =
  runSafeEffect $ patStreamM >-> P.map recMaybe >-> P.concat >-> P.print 

@dmvianna
Copy link
Contributor Author

I love it, Anthony! Works for me, and being modular, we could change the input and keep the rest. Would that work with things like http streams, at least conceptually?

@acowley
Copy link
Owner

acowley commented Nov 23, 2017

Yes, I think so!

A couple things need to happen:

  1. We should make this encoding example part of a test suite
  2. We need to verify the pipes stuff works with row type inference (this involves customizing a RowGen record with a custom Producer)

After that, I think it'd be great to have an example that pulled the data from a web site. I'm not sure how great an idea that is for the compile-time part, but it'd be neat to show that such a bad idea can be made real! 😃

@dmvianna
Copy link
Contributor Author

Hmm. Did you fix the test suite to build properly yet? And are you available on messaging? I am keen to work on this and learn.

@acowley
Copy link
Owner

acowley commented Nov 23, 2017

Did running the getdata program not get the tests working for you?

@dmvianna
Copy link
Contributor Author

dmvianna commented Nov 23, 2017

Oh wow. stack exec getdata. Would you consider a pull request with some trivial documentation for Haskell beginners? Only now I understood how to use the executables. You assume a certain level of familiarity with the Haskell build system (cabal or stack) which will not exist for people first dipping their toe to check if this unknown language might offer something useful in the realm of data analysis.

@acowley
Copy link
Owner

acowley commented Dec 14, 2017

I think across this and the latin1 PR and the initial changes to the README, we've hit the main pieces of this. More documentation is an ongoing effort.

@acowley acowley closed this as completed Dec 14, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants