Skip to content
This repository has been archived by the owner on Feb 19, 2021. It is now read-only.

Add Dockerfile for application and documentation #39

Merged

Conversation

pitkley
Copy link
Member

@pitkley pitkley commented Feb 15, 2016

This commit adds a Dockerfile to the root of the project, accompanied by a docker-compose.yml for simplified deployment. The Dockerfile is agnostic to whether it will be the webserver, the consumer, or if it is run for a one-off command (i.e. creation of a superuser, migration of the database, document export, ...).

The containers entrypoint is the scripts/docker-entrypoint.sh script. This script verifies that the required permissions are set, remaps the default users and/or groups id if required and installs additional languages if the user wishes to.

After initialization, it analyzes the command the user supplied:

  • If the command starts with a slash, it is expected that the user wants to execute a binary file and the command will be executed without further intervention. (Using exec to effectively replace the started shell-script and not have any reaping-issues.)
  • If the command does not start with a slash, the command will be passed directly to the manage.py script without further modification. (Again using exec.)

The default command is set to --help.

If the user wants to execute a command that is not meant for manage.py but doesn't start with a slash, the Docker --entrypoint parameter can be used to circumvent the mechanics of docker-entrypoint.sh.

Further information can be found in docs/setup.rst and in docs/migrating.rst.


Some additional points:

  • Given the discussions in issue Server setup via Docker #2 and PR Docker #28, this PR will probably supersede PR Docker #28.

  • If you have skimmed through the migration-documentation you might have realized that I have left out how to restore data using Docker. I have actually written the corresponding documentation and implementation, but it requires a custom loaddata-command which can be found in this gist.

    Right now, the license the gist is under is unclear. I have already asked for clarification and as soon as I get a response, I am going to update this PR accordingly.

  • I have marked this as work in progress. While I have tested everything (at least I think so...) I have documented, I am not comfortable merging this before there are a few responses.

  • One big point remaining to discuss is Docker Hub. In my eyes it is essential that the official Docker container will be available on the hub, and as up-to-date as possible.

    Regarding integrity of the image, using Docker Hub's automated builds is also something I would see as a given. This leaves the issue of "namespacing" -- under what user or organization should this container live?

    If we were to bind it to a user, it would have to be @danielquinn as far as I can tell since Docker Hub requires linking your GitHub account to make automated builds work. Additionally, having Docker Hub rebuild the image as soon as master gets updated requires this link as well. (I don't know if we would want this build to happen fully automatically, but rather only if we know building the Docker image will not fail. I have something brewing regarding Travis-CI, this could possibly solve this.)

    I have no experience with organizations on Docker Hub, maybe someone with more knowledge on that can support here?


Overall, there are a few open points, although I think the whole Docker Hub issue should be discussed separately and should not block this PR from being merged.

@pitkley pitkley mentioned this pull request Feb 15, 2016
@pitkley pitkley force-pushed the feature/dockerfile branch 11 times, most recently from 47feabe to 328e17b Compare February 17, 2016 08:28
@pitkley pitkley changed the title [WIP] Add Dockerfile Add Dockerfile for application and documentation Feb 17, 2016
@pitkley
Copy link
Member Author

pitkley commented Feb 17, 2016

After having the licensing issue on the mentioned gist resolved (thanks again @bmispelon), I have added the missing documentation and command to allow for easy loading of exported tags.

I have double-checked the documentation and went through the whole process:

  • Creating and starting containers
  • Creating the superuser
  • Configuring tags and senders
  • Consuming documents
  • Setting senders and tagging documents
  • Exporting tags as a JSON dump
  • Exporting documents
  • Removing the containers
  • Creating and starting containers
  • Creating the superuser
  • Importing the exported JSON dump
  • Moving the exported documents into the consumption directory

From what I could tell, everything worked out. Thus, I have removed the work-in-progress notice and "clear" this PR for merge.

@danielquinn Feel free to merge whenever, but maybe @TheConnMan wants to check it out first, since he opened the initial Docker PR (#28).

@pitkley pitkley mentioned this pull request Feb 17, 2016
@danielquinn
Copy link
Collaborator

Ok so I've started learning about Docker these past few days in anticipation of your wanting to merge this. I have a few comments so far:

  • I found a typo in the documentation. including their three letter code should probably be including their three letter codes.

  • The documentation makes reference to adapting a series of values in the .yaml file, but only one of those values, PAPERLESS_PASSPHRASE is defined there. If I'm understanding correctly, that's because the others have defaults that will take over if they're not defined here, but it wasn't entirely clear when I was going through there. I opened docker-compose.yaml and was confused that most of the values mentioned in the doc aren't in there. I actually thought I had the wrong file. Perhaps we can set these values to None or something in the checked-in file, or have them set to their defaults and commented out?

  • I tried following the instructions you added to the documentation and I got as far as docker-compose:

    $ docker-compose up -d
    ERROR: In file './docker-compose.yml' service 'version' doesn't have any configuration options. All top level keys in your docker-compose.yml must map to a dictionary of configuration options.
    

    So now I'm stuck. If you can help me get past this point, I'll keep trying with the PR.

@pitkley
Copy link
Member Author

pitkley commented Feb 17, 2016

I tried following the instructions you added to the documentation and I got as far as docker-compose

Are you running version 1.6.0? It is the minimum required version for the file to work. You can check by running docker-compose -v. The error seems like that might be the issue, and I've just tried it again on a fresh checkout and I'm not experiencing this.

I'll correct the typo and check into the environment variable issue after I got some sleep, but in general you are correct with your assumption: the defaults are not repeated in the docker-compose.yml. The environment variables changed quite fast while implementing and the documentation might lack a bit too.

@danielquinn
Copy link
Collaborator

$ docker-compose -v
docker-compose version 1.5.2, build 7240ff3

Well shit. It looks like that's the latest version in Gentoo. I'll have to take a look around regarding the "right" way to get a more modern version. Thanks for the heads up. Maybe that should go in the doc somewhere too?

Get some sleep, we can do this over a few days :-)

@waynew
Copy link

waynew commented Feb 17, 2016

I need to actually download/run this, but at least from what I read of the Dockerfile, I'm a fan 👍

@pitkley
Copy link
Member Author

pitkley commented Feb 18, 2016

Maybe that should go in the doc somewhere too?

If you are talking about mentioning that we need version 1.6.0, it is. If you talk about how to install a current version, linking the install page might not be a bad idea, which suggests executing the following commands as root:

curl -L https://github.com/docker/compose/releases/download/1.6.0/docker-compose-`uname -s`-`uname -m` > /usr/local/bin/docker-compose
chmod +x /usr/local/bin/docker-compose

Or adapted to sudo:

curl -L https://github.com/docker/compose/releases/download/1.6.0/docker-compose-`uname -s`-`uname -m` | sudo tee /usr/local/bin/docker-compose > /dev/null
sudo chmod +x /usr/local/bin/docker-compose

[...] but at least from what I read of the Dockerfile, I'm a fan 👍

Cool, be sure to let us know if it worked out for you!

@tikitu
Copy link
Contributor

tikitu commented Feb 18, 2016

Hi folks, I had a go following the instructions in the diff (checking out https://github.com/pitkley/paperless/tree/feature/dockerfile) and ran into the following issue when running docker-compose up -d:

Pulling webserver (paperless:latest)...
Pulling repository docker.io/library/paperless
ERROR: Error: image library/paperless not found

I don't know if this is something you need to fix before landing the PR: I'm a complete docker n00b so it's very possible I'm doing something non-standard that breaks things, or even that this is completely expected. Please ignore if so, I'm quite happy to wait until the PR lands and things are "official". If it is something you want to look into please let me know if I can give any useful details. I'm on OS X El Capitan, versions

$ docker --version
Docker version 1.10.1, build 9e83765
$ docker-compose --version
docker-compose version 1.6.0, build d99cad6

@pitkley
Copy link
Member Author

pitkley commented Feb 18, 2016

Hey @tikitu, thanks for taking the time for trying it out!

Right now the image is not available from the Docker Hub. You have to clone this PR and execute docker build -t paperless . in the cloned directory first.

To clone this PR and build the image, the following is one of the ways:

$ git clone https://github.com/pitkley/paperless
$ cd paperless
$ git checkout feature/dockerfile
$ docker build -t paperless .

Then you should be able to continue on as you tried above.

@tikitu
Copy link
Contributor

tikitu commented Feb 18, 2016

Aha! That's what "this should go to Docker Hub but let's land the PR first" is talking about. Thanks for helping me through the basics @pitkley, I'll give it a whirl!

@danielquinn
Copy link
Collaborator

Ok I managed to start everything up and have a few more issues:

  • I can't figure out how to get the current status of things on the consumer. I see this is a branch from an older version, so there's no Logging module yet, which means I can't use the web interface to see if there are any issues, but I can't figure out how to actually see the output of the consumer either. Is it possible to get a shell in (not a django shell, but a bash one)?
  • The docker-compose.yml file is nicely laid out, but it has some assumptions built into it that require changing on a case-by-case basis. Specifically, the PAPERLESS_* variables and the volumes: section. I don't want to commit to the repo anything that is likely to be changed by the user. This would mean that any time they do a git pull, they risk a conflict -- even as a passive user. It's why I try to use environment variables for everything.
    If the compose file doesn't have the concept of using environment variables, perhaps it should be something like docker-compose.yml.example so the user can modify it for their own use without risking future problems?
  • I ran into both of these because while the consumer is running in Docker, it's not consuming anything at present, and I'm not sure why. I created a directory at /consume on my host machine (did I read the compose file right? That's what I'm supposed to do, right?) put a file in it with permissions of 666. The file never disappears and never appears in the Documents section of the admin. As a Docker-novice, it's tough for me to debug this at the moment, so I'm bugging you :-(

@pitkley
Copy link
Member Author

pitkley commented Feb 18, 2016

I see this is a branch from an older version, so there's no Logging module yet

Actually 1c45ca1 is the current parent, so the logging should be working (it is for me at least). To look at stdout, you can run docker-compose logs to get both the logs from the webserver and the consumer or specifically docker-compose logs consumer to only get stdout from the consumer.

If the compose file doesn't have the concept of using environment variables

With docker-compose version 1.6.0 they introduced exactly that, I will push an updated version shortly.

[...] perhaps it should be something like docker-compose.yml.example so the user can modify it

That was pretty much the idea of the file, it should just be for the users convenience and not a hard requirement. I'm going to rename it!

I created a directory at /consume on my host machine (did I read the compose file right? That's what I'm supposed to do, right?)

No, you would have to modify the YAML-file something like this:

diff --git a/docker-compose.yml b/docker-compose.yml
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -17,9 +18,8 @@ services:
         volumes:
             - paperless-data:/usr/src/paperless/data
             - paperless-media:/usr/src/paperless/media
-            - /consume
+            - /local/path/you/choose:/consume

where the left part of the colon is the local path you want to mount. I haven't included this in the documentation since that are Docker and docker-compose basics I would assume a user coming to us using Docker already has, but maybe I'll add this as an example.

Thanks for the feedback 👍

PS: The minimum Docker version required is 1.10.0, stable in Gentoo is only 1.7.1. If you have further issues, that might be the cause!

@tikitu
Copy link
Contributor

tikitu commented Feb 18, 2016

👍 for including the volumes: example somewhere in the docs: for myself I'm following the Docker path without any previous experience, because of the hope it will Just Work (tm) and let me skip messy installation details. People like me (assuming there are any others!) need quite some hand-holding.

@pitkley
Copy link
Member Author

pitkley commented Feb 18, 2016

Add docker-compose.yml to .gitignore since we're expecting users to create their own from docker-compose.yml.example.

Done.

Can we not remove this line and port that bit to the environment file?

The reason the line is there is so that the webserver that doesn't do any text recognition doesn't have to install unnecessary languages the user might have set in the env-file by overwriting the value with nothing.

So yes, we could remove the line with the potential drawback of the webserver having a higher startup time and higher space consumption. Moving it to the environment file is not possible since (a) it is already there in form of the commented out line and (b) it would always affect both the consumer and the webserver.

I'd like to rename docker-compose.env to docker-compose.conf to keep it consistent with typical UNIX config file naming conventions

Naming it .env has two main reasons. Firstly, it is what a user of docker-compose would expect, since it is recommended by the docker-compose manual. Secondly, .conf might imply that there is more to configure than environment variables, which there is not.

In the end it is up to you.

Can we change [the consume mount]?

Generally yes. The reason I didn't is mainly because I wanted the YAML-file to be drop-in and use without modifications, but that really doesn't make a whole lot of sense since a mounted consumption directory will be what the end-user wants most of the time, I assume.

Besides having a non-existant "random" path, we could change it so something like this:

diff --git a/docker-compose.yml b/docker-compose.yml
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -17,9 +18,8 @@ services:
         volumes:
             - paperless-data:/usr/src/paperless/data
             - paperless-media:/usr/src/paperless/media
-            - /consume
+            - $PWD/data:/consume

This would take the current path from the shell and append /data to it. We could also create a /consume directory in Git which could be used by default, but I need some input from you guys on this point.

Can you just explain to me the purpose of this section?

The volumes section defines volumes that one wants to share between containers. It allows you to specify the volume-driver you want to use. Since we always want to use the default, there is no more data to provide after the :.

I too was surprised by this, something like this doesn't work:

volumes:
    - paperless-media
    - paperless-data

If there are any more questions, just ask :)!

@pitkley pitkley force-pushed the feature/dockerfile branch 2 times, most recently from 0018516 to cad7e6b Compare February 18, 2016 12:27
@waynew
Copy link

waynew commented Feb 18, 2016

Note that YAML file syntax allows for

comments:
    # like this, so they could explain values
    - which_might_help
    # especially the values that might not be obvious to docker newbies

That might be the approach to take in the example file

@tikitu
Copy link
Contributor

tikitu commented Feb 18, 2016

Hey, I've now successfully followed the Docker instructions and had my first document OCRed for me! One tiny bump on the way: the docs say I can find the webserver at localhost:8000 but docker for unknown reasons chose a different IP (192.168.99.100:8000). I'm guessing this is another of those things that would be obvious to someone a bit docker-experienced.

@waynew
Copy link

waynew commented Feb 18, 2016

@tikitu Indeed. Docker creates a virtual machine with a different IP (in your case, 192.168.99.100).

If it's not already in the docs for this it would definitely be worth pointing out. You've already figured out how to get the machine IP, but (assuming one has docker-machine installed as well) one can get the IP via docker-machine ip <machine name, probably 'default'>

@pitkley
Copy link
Member Author

pitkley commented Feb 18, 2016

@tikitu The guide was written with a "vanilla" Linux installation in mind. The 192.168.99.100 IP is something Docker machine introduces and can be determined with the command @waynew provided you.

In my opinion that is something that should not be in the guide, since we never mention Docker machine (and that would probably blow up the documentation in regards to mounting volumes, since machine starts a separate VM).

If the end-user decides to go with Docker machine, they should be able to find all the documentation they need in the machine docs.

@tikitu
Copy link
Contributor

tikitu commented Feb 18, 2016

I'm not sure that "since we never mention Docker machine" is a strong enough reason: unless I missed something, this is what I got from starting at https://www.docker.com/ and following the vanilla install instructions for OS X (I'm guessing docker-machine is installed as part of Docker Toolbox, but that's not at all obvious). But it also comes down to how much docker-n00b-support you're willing to write into the docs, of course; after all, these are paperless docs not docker docs.

@pitkley
Copy link
Member Author

pitkley commented Feb 18, 2016

vanilla install instructions for OS X

Indeed that is the official way to do Docker under OS X, but again, this guide was with Linux in mind where you can access the container under localhost.

after all, these are paperless docs not docker docs

That is my main issue right now. This is slowly turning into a rabbit hole, rewriting already existing documentation. Some of it is clearly justified and needed to have an easy and pleasant start with Paperless on Docker (regardless of if you know Docker yet or not), but supporting the "non-native" operating systems (in regards to Docker) will blow up the documentation.

One point I take from this though is that we should at least mention that you can use Docker and thus Paperless on Windows and Mac OS X, and that the guide might not adapt to if you use either of those OSes.

What do you think?

@tikitu
Copy link
Contributor

tikitu commented Feb 18, 2016

Yup, makes sense. (Once this PR lands I can turn my quibbles into PRs myself instead of bothering you with them.)

@tikitu
Copy link
Contributor

tikitu commented Feb 18, 2016

I do have one more substantive question though: this might again just be my (complete) lack of docker-fu, or might be something to tweak. I've tried to add a couple of other OCR languages by editing docker-compose.env, and I can't make sense of what I'm seeing.

  • Removing the image entirely and rebuilding installs only eng in the image (this is expected I think, because only tesseract-ocr-eng is a dependency in the Dockerfile).
  • Running docker-compose up -d starts the containers instantly, no package update or install. (Expected?)
  • Running docker-compose run --rm consumer document_exporter /consume does update packages and install the two other tesseract OCR languages, but afaics they're not used in the consumer service that's reading from /consume (also not after a few up/down cycles).

Again: might easily be I'm just missing some docker basics. If so I promise not to demand whatever-is-needed makes its way into the docs!

@danielquinn
Copy link
Collaborator

@tikitu ooh, I think I know this one. The Dockerfile is the basis of both the webserver and the consumer, but in reality, the consumer is the only one that needs Tesseract, and certainly only the consumer needs multiple languages. With this in mind, the base install includes Tesseract with only one language, but when it comes time to build the consumer from the base, we also build the additional languages since they're actually used there.

@pitkley thank you so much for all of your work on this. I've been at a conference all day so I've not been able to chime in when I would have liked. Can I assume that you're happy with what we have here and are cool with me merging it? Once I do, I may need some quick tips on how to push it into my Docker account so people can download Paperless from the cloud.

As for the discussion about documentation, I agree with @pitkley that this is definitely paperless documentation. I asked for clarification in a few parts because I was largely ignorant about how Docker works, but I think that the right perspective to take when writing docs for this project is that if the user has chosen the Docker route, she knows everything she needs to know about Docker to get it working. Otherwise she would have gone the bare-metal or Vagrant route instead.

@danielquinn
Copy link
Collaborator

@pitkley I just realised I didn't respond to your comments, so here goes:

The reason the line is there is so that the webserver that doesn't do any text recognition doesn't have to install unnecessary languages the user might have set in the env-file by overwriting the value with nothing.

So yes, we could remove the line with the potential drawback of the webserver having a higher startup time and higher space consumption. Moving it to the environment file is not possible since (a) it is already there in form of the commented out line and (b) it would always affect both the consumer and the webserver.

I understand now. Just a thought though: given that the webserver doesn't actually use Tesseract (or ImageMagick for that matter), would it be better to tell the composer to install those only on the consumer rather than have it as part of the Dockerfile? Or is that just Not Done in the Docker way of things? I'm still trying to get a grip of standard patterns in Dockerland.

In the same vein, I see now why you named it .env and I agree that we shouldn't change it.

As for changing the consume mount, in the same way you've set PAPERLESS_PASSPHRASE=CHANGEME, I think we can safely set this path to something we know doesn't exist just to better point out that it needs to change. both /consume and $PWD/consume weren't immediately obvious (at least to me) that they needed to change. Perhaps this would be a good place to make use of @waynew's suggestion of adding comments to this file?

This last bit is symatics though, and shouldn't hold up this merge as clearly it's a popular feature that's long overdue. Perhaps the better route down the road would be to not have a .env file at all, but rather a heavily commented .yml.example, but we can work this out after I hit the Big Green Button there.

So, let me know if you want to change the stuff I said about Tesseract there, and if not, just post a 👍 and I'll merge this baby.

@pitkley
Copy link
Member Author

pitkley commented Feb 18, 2016

@tikitu The additional languages get installed after the consumer-container starts, so issuing docker-compose up -d is "instant". If you execute docker-compose logs consumer directly afterwards you shoud see a bunch of apt messages while the additional languages get installed. So what @danielquinn said is pretty much correct, but replace build by at runtime, since the image only gets built once.

Why do you think your additional languages don't get used? What seems to indicate this? I've tested it with tesseract-ocr-deu and a bunch of German documents and the automatic detection did a good job. Maybe it just doesn't recognize your language? If so, some demo documents would be great, if you have any non-confidential ones.


@danielquinn

would it be better to tell the composer to install those only on the consumer rather than have it as part of the Dockerfile?

Installing the dependencies for the consumer using docker-compose directly is not possible, only indirectly using e.g. a environment variable, just like we do with the additional languages.

While I could theoretically implement such a variable, it would drastically increase the startup time of the consumer-container since it needs to install all those hefty dependencies.

Also what one needs to realize in regards to Docker: If you run a container off of a image multiple times, it still only consumes space of the single image. Everything you add or modify within a container overlays the base image and only requires that delta of space.

If we would relocate the installation of the dependencies of the consumer into runtime, the end-user would not have any advantages, since the consumed space doesn't change. What they would have is the disadvantage of having to download the dependencies every time the container gets recreated.

As for changing the consume mount [...]

The passphrase is a good point. I will change it to a non-existant arbitray path (which will conveniently throw an error message in docker-compose, maybe nudging the user in the right direction to modify the docker-compose.yml). I'll definitely incorporate @waynew's idea of adding clarifying comments to the YAML-file.

Perhaps the better route down the road would be to not have a .env file at all

I only introduced the .env file at a later stage in the development, but for one main reason: having the passphrase in a single place. Before you had to set the passphrase on both the webserver and the consumer, and if the user would have mistakenly only changed one or introduced a typo, they would have probably gotten weird errors given that the webserver couldn't decrypt the documents the consumer encrypted.

let me know if you want to change the stuff

Yes, I will change the stuff mentioned above. When I am done with that, I'll ping you.

@pitkley
Copy link
Member Author

pitkley commented Feb 18, 2016

@danielquinn One comment of yours that I missed to discuss:

Is PAPERLESS_OCR_LANGUAGES=deu ita equivalent to PAPERLESS_OCR_LANGUAGES="deu ita"?

No. The quotes will be part of the variable, causing the installation to fail since neither tesseract-ocr-"deu nor tesseract-ocr-ita" are valid packages (or package names, for that matter). While I agree with you that it would be a lot clearer, it sadly doesn't work. (And I don't want to start sanitizing environment variables at container runtime either...)


Other than that, I have fixed/updated the points mentioned above, and I think and hope that that should be it. If you could check the documentation one more time (specifically this paragraph and points 1 and 2 including the notes here).

So, here we go: 👍

This commit adds a `Dockerfile` to the root of the project, accompanied
by a `docker-compose.yml.example` for simplified deployment. The
`Dockerfile` is agnostic to whether it will be the webserver, the
consumer, or if it is run for a one-off command (i.e. creation of a
superuser, migration of the database, document export, ...).

The containers entrypoint is the `scripts/docker-entrypoint.sh` script.
This script verifies that the required permissions are set, remaps the
default users and/or groups id if required and installs additional
languages if the user wishes to.

After initialization, it analyzes the command the user supplied:

  - If the command starts with a slash, it is expected that the user
    wants to execute a binary file and the command will be executed
    without further intervention. (Using `exec` to effectively replace
    the started shell-script and not have any reaping-issues.)

  - If the command does not start with a slash, the command will be
    passed directly to the `manage.py` script without further
    modification. (Again using `exec`.)

The default command is set to `--help`.

If the user wants to execute a command that is not meant for `manage.py`
but doesn't start with a slash, the Docker `--entrypoint` parameter can
be used to circumvent the mechanics of `docker-entrypoint.sh`.

Further information can be found in `docs/setup.rst` and in
`docs/migrating.rst`.

For additional convenience, a `Dockerfile` has been added to the `docs/`
directory which allows for easy building and serving of the
documentation. This is documented in `docs/requirements.rst`.
@danielquinn
Copy link
Collaborator

It all looks good to me. I'll give all the documentation a once-over later next week, but I haven't seen anything glaring that's worth holding up the merge. Here we go indeed!

danielquinn added a commit that referenced this pull request Feb 18, 2016
Add Dockerfile for application and documentation
@danielquinn danielquinn merged commit 99be40a into the-paperless-project:master Feb 18, 2016
@pitkley
Copy link
Member Author

pitkley commented Feb 18, 2016

☺️ 👍

Thanks to everyone involved. I am very happy with the end-result and it wouldn't have been as good without your input, testing and feedback!

@pitkley pitkley deleted the feature/dockerfile branch February 18, 2016 22:05
@tikitu
Copy link
Contributor

tikitu commented Feb 19, 2016

🎆

Why do you think your additional languages don't get used? What seems to indicate this? I've tested it with tesseract-ocr-deu and a bunch of German documents and the automatic detection did a good job. Maybe it just doesn't recognize your language? If so, some demo documents would be great, if you have any non-confidential ones.

@pitkley Thanks for the explanation of the build/run setup -- indeed I see the package management in the consumer logs, so I must be getting unlucky with the documents. If I can get an unambiguous example of recognition-gone-wrong, where would you like me to drop the document? (This PR should be left to rest on its laurels, I'm sure you'll agree :-) )

@pitkley
Copy link
Member Author

pitkley commented Feb 19, 2016

@tikitu If you find a document, that would be great. What you could also do is check the logs, since it contains messages like "Parsing for eng" or "Parsing for deu". Note that there will always be a "Parsing for eng" to begin with, since the detect the language after the first OCR run. If it detects a different language, it will output "Language detected: ..." and "Parsing for ..." directly after that message.

In terms of where to put that document, I don't know. You can mail it to me directly if you want, but maybe uploading it somewhere and opening an issue linking to it so that everyone has access to the document (if you would want that) might be another option.

@tikitu
Copy link
Contributor

tikitu commented Feb 19, 2016

@pitkley I've only ever seen "Parsing for eng" (multiple times for the same document) and "Language detected: en". That's for some Dutch and Greek documents I happened to use as my first test cases, but they're pretty messy: it's possible there's enough stuff that looks like English that it's mis-detecting. I'll try to find a clear example that I'm ok with sharing with the world, and open an issue for it.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants