Word,PowerPoint,PDFなどに埋め込まれた情報(テキスト)を抽出できる「textract」のインストールについて解説しています。
「textract(https://github.com/deanmalmgren/textract)」は、Word,PowerPoint,PDFなどの各形式から、無関係なマークアップなしでコンテンツを抽出できるPythonライブラリです。
■Python
今回のPythonのバージョンは、「3.8.5」を使用しています。(Windows10)(pythonランチャーでの確認)
■textractをインストールする
textractをインストールを行いますが、今回はpipを経由してインストールを行うので、まずWindowsのコマンドプロンプトを起動します。
pip install textract
起動後、上記のコマンドを入力し、Enterキーを押します。
なお、今回は、pythonランチャーを使用しており、Python Version 3.8.5にインストールを行うために、バージョンの切り替えを行います。
py -3.8 -m pip install textract
切り替えるために、上記のコマンドを入力し、Enterキーを押します。
Defaulting to user installation because normal site-packages is not writeable Collecting textract Downloading textract-1.6.5-py3-none-any.whl (23 kB) Collecting docx2txt~=0.8 Downloading docx2txt-0.8.tar.gz (2.8 kB) Preparing metadata (setup.py) ... done Collecting pdfminer.six==20191110 Downloading pdfminer.six-20191110-py2.py3-none-any.whl (5.6 MB) ---------------------------------------- 5.6/5.6 MB 2.8 MB/s eta 0:00:00 Collecting SpeechRecognition~=3.8.1 Downloading SpeechRecognition-3.8.1-py2.py3-none-any.whl (32.8 MB) ---------------------------------------- 32.8/32.8 MB 4.2 MB/s eta 0:00:00 Collecting extract-msg<=0.29.* Downloading extract_msg-0.28.7-py2.py3-none-any.whl (69 kB) ---------------------------------------- 69.0/69.0 kB ? eta 0:00:00 Collecting chardet==3.* Using cached chardet-3.0.4-py2.py3-none-any.whl (133 kB) Collecting python-pptx~=0.6.18 Using cached python-pptx-0.6.21.tar.gz (10.1 MB) Preparing metadata (setup.py) ... done Collecting beautifulsoup4~=4.8.0 Downloading beautifulsoup4-4.8.2-py3-none-any.whl (106 kB) ---------------------------------------- 106.9/106.9 kB ? eta 0:00:00 Collecting argcomplete~=1.10.0 Downloading argcomplete-1.10.3-py2.py3-none-any.whl (36 kB) Requirement already satisfied: xlrd~=1.2.0 in c:\users\user_\appdata\roaming\python\python38\site-packages (from textract) (1.2.0) Collecting six~=1.12.0 Downloading six-1.12.0-py2.py3-none-any.whl (10 kB) Requirement already satisfied: sortedcontainers in c:\users\user_\appdata\roaming\python\python38\site-packages (from pdfminer.six==20191110->textract) (2.4.0) Requirement already satisfied: pycryptodome in c:\users\user_\appdata\roaming\python\python38\site-packages (from pdfminer.six==20191110->textract) (3.10.1) Requirement already satisfied: soupsieve>=1.2 in c:\users\user_\appdata\roaming\python\python38\site-packages (from beautifulsoup4~=4.8.0->textract) (2.3.2.post1) Collecting compressed-rtf>=1.0.6 Downloading compressed_rtf-1.0.6.tar.gz (5.8 kB) Preparing metadata (setup.py) ... done Collecting ebcdic>=1.1.1 Downloading ebcdic-1.1.1-py2.py3-none-any.whl (128 kB) ---------------------------------------- 128.5/128.5 kB 3.8 MB/s eta 0:00:00 Collecting imapclient==2.1.0 Downloading IMAPClient-2.1.0-py2.py3-none-any.whl (73 kB) ---------------------------------------- 74.0/74.0 kB 4.2 MB/s eta 0:00:00 Collecting olefile>=0.46 Downloading olefile-0.46.zip (112 kB) ---------------------------------------- 112.2/112.2 kB 6.4 MB/s eta 0:00:00 Preparing metadata (setup.py) ... done Requirement already satisfied: tzlocal>=2.1 in c:\users\user_\appdata\roaming\python\python38\site-packages (from extract-msg<=0.29.*->textract) (2.1) Requirement already satisfied: lxml>=3.1.0 in c:\users\user_\appdata\roaming\python\python38\site-packages (from python-pptx~=0.6.18->textract) (4.8.0) Requirement already satisfied: Pillow>=3.3.2 in c:\users\user_\appdata\roaming\python\python38\site-packages (from python-pptx~=0.6.18->textract) (9.1.1) Requirement already satisfied: XlsxWriter>=0.5.7 in c:\users\user_\appdata\roaming\python\python38\site-packages (from python-pptx~=0.6.18->textract) (1.4.3) Requirement already satisfied: pytz in c:\users\user_\appdata\roaming\python\python38\site-packages (from tzlocal>=2.1->extract-msg<=0.29.*->textract) (2021.3) Building wheels for collected packages: docx2txt, python-pptx, compressed-rtf, olefile Building wheel for docx2txt (setup.py) ... done Created wheel for docx2txt: filename=docx2txt-0.8-py3-none-any.whl size=3966 sha256=4540cb776f8ab42048a57891f2ec8286cca73bb657bb0c1bff519e59ebefd80a Stored in directory: c:\users\user_\appdata\local\pip\cache\wheels\55\f0\2c\81637d42670985178b77df6d41b9b6c6dc18c94818447414b9 Building wheel for python-pptx (setup.py) ... done Created wheel for python-pptx: filename=python_pptx-0.6.21-py3-none-any.whl size=470935 sha256=c386193b0331adf01dfd9e82a780846dfca399105ef5aad9b933bf9ec0298a41 Stored in directory: c:\users\user_\appdata\local\pip\cache\wheels\b0\38\58\8530ed1681bfee42349acf166867cc9fb369517b2fce83e599 Building wheel for compressed-rtf (setup.py) ... done Created wheel for compressed-rtf: filename=compressed_rtf-1.0.6-py3-none-any.whl size=6186 sha256=83e92e4ed924be17abb161b66ea37b2c7a698a29a98dc54a5c48eba440bb8297 Stored in directory: c:\users\user_\appdata\local\pip\cache\wheels\11\f5\c4\81acab65ab073b5a3e67fd82e4b9accf3dbcf1de39c7b246ec Building wheel for olefile (setup.py) ... done Created wheel for olefile: filename=olefile-0.46-py2.py3-none-any.whl size=35415 sha256=20a53285a90d14dd34915020435f83a11f259c80f294194992bcd52ce1985ebf Stored in directory: c:\users\user_\appdata\local\pip\cache\wheels\0b\d8\16\1e2d32ad7455728b8af9efdb9d2a0c3d03cd8f2e4be0191b8c Successfully built docx2txt python-pptx compressed-rtf olefile Installing collected packages: SpeechRecognition, ebcdic, docx2txt, compressed-rtf, chardet, argcomplete, six, python-pptx, olefile, beautifulsoup4, pdfminer.six, imapclient, extract-msg, textract Attempting uninstall: chardet Found existing installation: chardet 4.0.0 Uninstalling chardet-4.0.0: Successfully uninstalled chardet-4.0.0 Attempting uninstall: argcomplete Found existing installation: argcomplete 2.0.0 Uninstalling argcomplete-2.0.0: Successfully uninstalled argcomplete-2.0.0 Attempting uninstall: six Found existing installation: six 1.16.0 Uninstalling six-1.16.0: Successfully uninstalled six-1.16.0 Attempting uninstall: beautifulsoup4 Found existing installation: beautifulsoup4 4.9.3 Uninstalling beautifulsoup4-4.9.3: Successfully uninstalled beautifulsoup4-4.9.3 Attempting uninstall: pdfminer.six Found existing installation: pdfminer.six 20220319 Uninstalling pdfminer.six-20220319: Successfully uninstalled pdfminer.six-20220319 ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. xhtml2pdf 0.2.5 requires html5lib>=1.0, but you have html5lib 1.0b10 which is incompatible. wagtail 2.14.1 requires Pillow<9.0.0,>=4.0.0, but you have pillow 9.1.1 which is incompatible. tox 3.24.5 requires six>=1.14.0, but you have six 1.12.0 which is incompatible. tensorflow 2.5.0 requires numpy~=1.19.2, but you have numpy 1.21.5 which is incompatible. tensorflow 2.5.0 requires six~=1.15.0, but you have six 1.12.0 which is incompatible. tensorflow 2.5.0 requires typing-extensions~=3.7.4, but you have typing-extensions 4.0.0 which is incompatible. streamlit 0.82.0 requires click<8.0,>=7.0, but you have click 8.0.4 which is incompatible. seleniumbase 3.1.0 requires beautifulsoup4==4.11.1; python_version >= "3.6", but you have beautifulsoup4 4.8.2 which is incompatible. seleniumbase 3.1.0 requires chardet==4.0.0; python_version >= "3.5", but you have chardet 3.0.4 which is incompatible. seleniumbase 3.1.0 requires pdfminer.six==20220319; python_version >= "3.7", but you have pdfminer-six 20191110 which is incompatible. seleniumbase 3.1.0 requires six==1.16.0, but you have six 1.12.0 which is incompatible. pygooglenews 0.1.2 requires beautifulsoup4<5.0.0,>=4.9.1, but you have beautifulsoup4 4.8.2 which is incompatible. pygooglenews 0.1.2 requires dateparser<0.8.0,>=0.7.6, but you have dateparser 1.1.1 which is incompatible. lassie 0.11.11 requires beautifulsoup4<4.10.0,>=4.9.0, but you have beautifulsoup4 4.8.2 which is incompatible. google-cloud-firestore 2.1.3 requires google-api-core[grpc]<2.0.0dev,>=1.22.2, but you have google-api-core 2.7.1 which is incompatible. google-cloud-core 1.7.1 requires google-api-core<2.0.0dev,>=1.21.0, but you have google-api-core 2.7.1 which is incompatible. firebase-admin 5.0.1 requires google-api-core[grpc]<2.0.0dev,>=1.22.1; platform_python_implementation != "PyPy", but you have google-api-core 2.7.1 which is incompatible. easyocr 1.4.1 requires Pillow<8.3.0, but you have pillow 9.1.1 which is incompatible. deep-translator 1.8.1 requires beautifulsoup4<5.0.0,>=4.9.1, but you have beautifulsoup4 4.8.2 which is incompatible. aiohttp 3.7.4.post0 requires async-timeout<4.0,>=3.0, but you have async-timeout 4.0.2 which is incompatible. Successfully installed SpeechRecognition-3.8.1 argcomplete-1.10.3 beautifulsoup4-4.8.2 chardet-3.0.4 compressed-rtf-1.0.6 docx2txt-0.8 ebcdic-1.1.1 extract-msg-0.28.7 imapclient-2.1.0 olefile-0.46 pdfminer.six-20191110 python-pptx-0.6.21 six-1.12.0 textract-1.6.5 WARNING: There was an error checking the latest version of pip.
Enterキーを押すと、インストールが開始され、上記のように「Successfully installed」と表示されます。これが表示されれば、textractが正常にインストールされたことになりますが、今回は「ERROR: pip’s dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.(ERROR: pip の依存性解決は現在インストールされているすべてのパッケージを考慮に入れていません。この挙動は以下のような依存関係の衝突の原因となっています。)」や「which is incompatible.(これは互換性がありません。)のエラーが出力してしまいました。このため、インストールする際は、こちらとしては仮想環境の構築を行い、インストールすることを推奨します。
なお、今回はtextractのバージョン1.6.5をインストールしました。
コメント