- Java >= 1.5
- JCC
$ python setup.py build
$ python setup.py install
Or,
$ pip install git+https://github.com/sudharsh/python-tika.git
To use the AutoDetectParser
,
import tika
tika.initVM()
from tika import parser
print parser.from_buffer("<html><body>Hello World</body></html>
# Or directly from a file,
# print parser.from_file("/tmp/foo.doc")
returns a dict
,
{'content': u'Hello Cruel World',
'metadata': {u'Content-Encoding': u'ISO-8859-1',
u'Content-Type': u'text/html',
u'title': u'Hello world'}
}
setup.py
script derived from aptivate/python-tika