Word,PowerPoint,PDFなどからテキストを抽出できる「textract」のインストール

スポンサーリンク

Word,PowerPoint,PDFなどに埋め込まれた情報(テキスト)を抽出できる「textract」のインストールについて解説しています。

「textract(https://github.com/deanmalmgren/textract)」は、Word,PowerPoint,PDFなどの各形式から、無関係なマークアップなしでコンテンツを抽出できるPythonライブラリです。

■Python

今回のPythonのバージョンは、「3.8.5」を使用しています。(Windows10)(pythonランチャーでの確認)

■textractをインストールする

textractをインストールを行いますが、今回はpipを経由してインストールを行うので、まずWindowsのコマンドプロンプトを起動します。

pip install textract

起動後、上記のコマンドを入力し、Enterキーを押します。

なお、今回は、pythonランチャーを使用しており、Python Version 3.8.5にインストールを行うために、バージョンの切り替えを行います。

py -3.8 -m pip install textract

切り替えるために、上記のコマンドを入力し、Enterキーを押します。

Defaulting to user installation because normal site-packages is not writeable
Collecting textract
Downloading textract-1.6.5-py3-none-any.whl (23 kB)
Collecting docx2txt~=0.8
Downloading docx2txt-0.8.tar.gz (2.8 kB)
Preparing metadata (setup.py) ... done
Collecting pdfminer.six==20191110
Downloading pdfminer.six-20191110-py2.py3-none-any.whl (5.6 MB)
---------------------------------------- 5.6/5.6 MB 2.8 MB/s eta 0:00:00
Collecting SpeechRecognition~=3.8.1
Downloading SpeechRecognition-3.8.1-py2.py3-none-any.whl (32.8 MB)
---------------------------------------- 32.8/32.8 MB 4.2 MB/s eta 0:00:00
Collecting extract-msg<=0.29.*
Downloading extract_msg-0.28.7-py2.py3-none-any.whl (69 kB)
---------------------------------------- 69.0/69.0 kB ? eta 0:00:00
Collecting chardet==3.*
Using cached chardet-3.0.4-py2.py3-none-any.whl (133 kB)
Collecting python-pptx~=0.6.18
Using cached python-pptx-0.6.21.tar.gz (10.1 MB)
Preparing metadata (setup.py) ... done
Collecting beautifulsoup4~=4.8.0
Downloading beautifulsoup4-4.8.2-py3-none-any.whl (106 kB)
---------------------------------------- 106.9/106.9 kB ? eta 0:00:00
Collecting argcomplete~=1.10.0
Downloading argcomplete-1.10.3-py2.py3-none-any.whl (36 kB)
Requirement already satisfied: xlrd~=1.2.0 in c:\users\user_\appdata\roaming\python\python38\site-packages (from textract) (1.2.0)
Collecting six~=1.12.0
Downloading six-1.12.0-py2.py3-none-any.whl (10 kB)
Requirement already satisfied: sortedcontainers in c:\users\user_\appdata\roaming\python\python38\site-packages (from pdfminer.six==20191110->textract) (2.4.0)
Requirement already satisfied: pycryptodome in c:\users\user_\appdata\roaming\python\python38\site-packages (from pdfminer.six==20191110->textract) (3.10.1)
Requirement already satisfied: soupsieve>=1.2 in c:\users\user_\appdata\roaming\python\python38\site-packages (from beautifulsoup4~=4.8.0->textract) (2.3.2.post1)
Collecting compressed-rtf>=1.0.6
Downloading compressed_rtf-1.0.6.tar.gz (5.8 kB)
Preparing metadata (setup.py) ... done
Collecting ebcdic>=1.1.1
Downloading ebcdic-1.1.1-py2.py3-none-any.whl (128 kB)
---------------------------------------- 128.5/128.5 kB 3.8 MB/s eta 0:00:00
Collecting imapclient==2.1.0
Downloading IMAPClient-2.1.0-py2.py3-none-any.whl (73 kB)
---------------------------------------- 74.0/74.0 kB 4.2 MB/s eta 0:00:00
Collecting olefile>=0.46
Downloading olefile-0.46.zip (112 kB)
---------------------------------------- 112.2/112.2 kB 6.4 MB/s eta 0:00:00
Preparing metadata (setup.py) ... done
Requirement already satisfied: tzlocal>=2.1 in c:\users\user_\appdata\roaming\python\python38\site-packages (from extract-msg<=0.29.*->textract) (2.1)
Requirement already satisfied: lxml>=3.1.0 in c:\users\user_\appdata\roaming\python\python38\site-packages (from python-pptx~=0.6.18->textract) (4.8.0)
Requirement already satisfied: Pillow>=3.3.2 in c:\users\user_\appdata\roaming\python\python38\site-packages (from python-pptx~=0.6.18->textract) (9.1.1)
Requirement already satisfied: XlsxWriter>=0.5.7 in c:\users\user_\appdata\roaming\python\python38\site-packages (from python-pptx~=0.6.18->textract) (1.4.3)
Requirement already satisfied: pytz in c:\users\user_\appdata\roaming\python\python38\site-packages (from tzlocal>=2.1->extract-msg<=0.29.*->textract) (2021.3)
Building wheels for collected packages: docx2txt, python-pptx, compressed-rtf, olefile
Building wheel for docx2txt (setup.py) ... done
Created wheel for docx2txt: filename=docx2txt-0.8-py3-none-any.whl size=3966 sha256=4540cb776f8ab42048a57891f2ec8286cca73bb657bb0c1bff519e59ebefd80a
Stored in directory: c:\users\user_\appdata\local\pip\cache\wheels\55\f0\2c\81637d42670985178b77df6d41b9b6c6dc18c94818447414b9
Building wheel for python-pptx (setup.py) ... done
Created wheel for python-pptx: filename=python_pptx-0.6.21-py3-none-any.whl size=470935 sha256=c386193b0331adf01dfd9e82a780846dfca399105ef5aad9b933bf9ec0298a41
Stored in directory: c:\users\user_\appdata\local\pip\cache\wheels\b0\38\58\8530ed1681bfee42349acf166867cc9fb369517b2fce83e599
Building wheel for compressed-rtf (setup.py) ... done
Created wheel for compressed-rtf: filename=compressed_rtf-1.0.6-py3-none-any.whl size=6186 sha256=83e92e4ed924be17abb161b66ea37b2c7a698a29a98dc54a5c48eba440bb8297
Stored in directory: c:\users\user_\appdata\local\pip\cache\wheels\11\f5\c4\81acab65ab073b5a3e67fd82e4b9accf3dbcf1de39c7b246ec
Building wheel for olefile (setup.py) ... done
Created wheel for olefile: filename=olefile-0.46-py2.py3-none-any.whl size=35415 sha256=20a53285a90d14dd34915020435f83a11f259c80f294194992bcd52ce1985ebf
Stored in directory: c:\users\user_\appdata\local\pip\cache\wheels\0b\d8\16\1e2d32ad7455728b8af9efdb9d2a0c3d03cd8f2e4be0191b8c
Successfully built docx2txt python-pptx compressed-rtf olefile
Installing collected packages: SpeechRecognition, ebcdic, docx2txt, compressed-rtf, chardet, argcomplete, six, python-pptx, olefile, beautifulsoup4, pdfminer.six, imapclient, extract-msg, textract
Attempting uninstall: chardet
Found existing installation: chardet 4.0.0
Uninstalling chardet-4.0.0:
Successfully uninstalled chardet-4.0.0
Attempting uninstall: argcomplete
Found existing installation: argcomplete 2.0.0
Uninstalling argcomplete-2.0.0:
Successfully uninstalled argcomplete-2.0.0
Attempting uninstall: six
Found existing installation: six 1.16.0
Uninstalling six-1.16.0:
Successfully uninstalled six-1.16.0
Attempting uninstall: beautifulsoup4
Found existing installation: beautifulsoup4 4.9.3
Uninstalling beautifulsoup4-4.9.3:
Successfully uninstalled beautifulsoup4-4.9.3
Attempting uninstall: pdfminer.six
Found existing installation: pdfminer.six 20220319
Uninstalling pdfminer.six-20220319:
Successfully uninstalled pdfminer.six-20220319
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
xhtml2pdf 0.2.5 requires html5lib>=1.0, but you have html5lib 1.0b10 which is incompatible.
wagtail 2.14.1 requires Pillow<9.0.0,>=4.0.0, but you have pillow 9.1.1 which is incompatible.
tox 3.24.5 requires six>=1.14.0, but you have six 1.12.0 which is incompatible.
tensorflow 2.5.0 requires numpy~=1.19.2, but you have numpy 1.21.5 which is incompatible.
tensorflow 2.5.0 requires six~=1.15.0, but you have six 1.12.0 which is incompatible.
tensorflow 2.5.0 requires typing-extensions~=3.7.4, but you have typing-extensions 4.0.0 which is incompatible.
streamlit 0.82.0 requires click<8.0,>=7.0, but you have click 8.0.4 which is incompatible.
seleniumbase 3.1.0 requires beautifulsoup4==4.11.1; python_version >= "3.6", but you have beautifulsoup4 4.8.2 which is incompatible.
seleniumbase 3.1.0 requires chardet==4.0.0; python_version >= "3.5", but you have chardet 3.0.4 which is incompatible.
seleniumbase 3.1.0 requires pdfminer.six==20220319; python_version >= "3.7", but you have pdfminer-six 20191110 which is incompatible.
seleniumbase 3.1.0 requires six==1.16.0, but you have six 1.12.0 which is incompatible.
pygooglenews 0.1.2 requires beautifulsoup4<5.0.0,>=4.9.1, but you have beautifulsoup4 4.8.2 which is incompatible.
pygooglenews 0.1.2 requires dateparser<0.8.0,>=0.7.6, but you have dateparser 1.1.1 which is incompatible.
lassie 0.11.11 requires beautifulsoup4<4.10.0,>=4.9.0, but you have beautifulsoup4 4.8.2 which is incompatible.
google-cloud-firestore 2.1.3 requires google-api-core[grpc]<2.0.0dev,>=1.22.2, but you have google-api-core 2.7.1 which is incompatible.
google-cloud-core 1.7.1 requires google-api-core<2.0.0dev,>=1.21.0, but you have google-api-core 2.7.1 which is incompatible.
firebase-admin 5.0.1 requires google-api-core[grpc]<2.0.0dev,>=1.22.1; platform_python_implementation != "PyPy", but you have google-api-core 2.7.1 which is incompatible.
easyocr 1.4.1 requires Pillow<8.3.0, but you have pillow 9.1.1 which is incompatible.
deep-translator 1.8.1 requires beautifulsoup4<5.0.0,>=4.9.1, but you have beautifulsoup4 4.8.2 which is incompatible.
aiohttp 3.7.4.post0 requires async-timeout<4.0,>=3.0, but you have async-timeout 4.0.2 which is incompatible.
Successfully installed SpeechRecognition-3.8.1 argcomplete-1.10.3 beautifulsoup4-4.8.2 chardet-3.0.4 compressed-rtf-1.0.6 docx2txt-0.8 ebcdic-1.1.1 extract-msg-0.28.7 imapclient-2.1.0 olefile-0.46 pdfminer.six-20191110 python-pptx-0.6.21 six-1.12.0 textract-1.6.5
WARNING: There was an error checking the latest version of pip.

Enterキーを押すと、インストールが開始され、上記のように「Successfully installed」と表示されます。これが表示されれば、textractが正常にインストールされたことになりますが、今回は「ERROR: pip’s dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.(ERROR: pip の依存性解決は現在インストールされているすべてのパッケージを考慮に入れていません。この挙動は以下のような依存関係の衝突の原因となっています。)」や「which is incompatible.(これは互換性がありません。)のエラーが出力してしまいました。このため、インストールする際は、こちらとしては仮想環境の構築を行い、インストールすることを推奨します。

なお、今回はtextractのバージョン1.6.5をインストールしました。

コメント

タイトルとURLをコピーしました