-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
encoding #96
Comments
I can't reproduce, but I don't think you're using the same data file as is in your repository. The file there has no record with id 1999042444, for example. |
I’m not using the same file. I didn’t want to include a massive text file in the repo. I can get a `head` of the file including the offending line or the whole thing, which do you think is the best approach?
…--
Daniel Vianna
On 22 Nov 2017, at 3:58 am, Anthony Cowley ***@***.***> wrote:
I can't reproduce, but I don't think you're using the same data file as is in your repository. The file there has no record with id 1999042444, for example.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
You could try using something from here, but I must confess I have very little experience with that. Moreover, any such change to the runtime code has a danger of hiding a problem with some TemplateHaskell reading the file at compile time; I'm not sure how best to handle that. Perhaps If possible, I'd actually just write a program to convert the file encoding before further processing. If you think it would be helpful to handle different encodings from within |
Munging files from external stakeholders is what I do for a living. I would think the most convenient approach would be to stream the file in one pass. It would be great if we could do that with `pipes`, that is, translate the encoding and then get `Frames` to break it into columns and types.
I’ll work on the `head` and the `Text` once I’m on the train going to work. That should be an hour or two away.
…--
Daniel Vianna
On 22 Nov 2017, at 6:11 am, Anthony Cowley ***@***.***> wrote:
You could try using something from here, but I must confess I have very little experience with that. Moreover, any such change to the runtime code has a danger of hiding a problem with some TemplateHaskell reading the file at compile time; I'm not sure how best to handle that. Perhaps ParserOptions would need a TextEncoding field that could be overridden.
If possible, I'd actually just write a program to convert the file encoding before further processing. If you think it would be helpful to handle different encodings from within Frames, the first step is figuring out how to read the lines of that file into a Text value, and then we can add what is needed.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Okay, end should incorporate this machinery at compile- and run-time. |
The offending line is now in the file, and the code points to it. Please see if you can reproduce. As a side note, I'm testing it on a Mac, but last Friday this ran perfectly fine in Windows 10. However I will not be able to reproduce it in Windows, as I had to make many changes to my Haskell setup there. |
The below modification of your program now works with the pipe-everything branch of {-# LANGUAGE ConstraintKinds #-}
{-# LANGUAGE DataKinds #-}
{-# LANGUAGE FlexibleContexts #-}
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE QuasiQuotes #-}
{-# LANGUAGE TemplateHaskell #-}
{-# LANGUAGE TypeFamilies #-}
{-# LANGUAGE TypeOperators #-}
{-# OPTIONS_GHC -Wall #-}
module Patents where
-- import Lens.Micro ((%~), (^.))
import Pipes (Producer, (>->))
import qualified Pipes.Prelude as P
import Data.Vinyl (Rec)
import Frames (runSafeEffect, (:->), Text, MonadSafe)
import Frames.CSV (declareColumn,
pipeTableMaybe, readFileLatin1Ln)
import Frames.Rec
import PatAbstracts
declareColumn "patId" ''Text
declareColumn "abstract" ''Address
type PatColumns = '["id" :-> Text, "abstract" :-> Address]
type PA = Record PatColumns
type PAMaybe = Rec Maybe PatColumns
patStreamM :: MonadSafe m => Producer PAMaybe m ()
patStreamM = readFileLatin1Ln "data/pat_abstracts.csv" >-> pipeTableMaybe
printValidAddresses :: IO ()
printValidAddresses =
runSafeEffect $ patStreamM >-> P.map recMaybe >-> P.concat >-> P.print |
I love it, Anthony! Works for me, and being modular, we could change the input and keep the rest. Would that work with things like http streams, at least conceptually? |
Yes, I think so! A couple things need to happen:
After that, I think it'd be great to have an example that pulled the data from a web site. I'm not sure how great an idea that is for the compile-time part, but it'd be neat to show that such a bad idea can be made real! 😃 |
Hmm. Did you fix the test suite to build properly yet? And are you available on messaging? I am keen to work on this and learn. |
Did running the |
Oh wow. |
I think across this and the latin1 PR and the initial changes to the README, we've hit the main pieces of this. More documentation is an ongoing effort. |
Hi, Andrew.
How would one change the encoding prior to parsing a CSV file? I thought of decodeLatin1, but there isn't an obvious place to put it.
current commit
The text was updated successfully, but these errors were encountered: