Pythonでニュース記事を自動で収集できる「newspaper3k」のインストールについて解説しています。
newspaper3k(https://github.com/codelucas/newspaper/)は、Webスクレイピングを実行し、ニュースサイトから記事の取得(テキスト抽出など)やテキストからのキーワード抽出等を行うことができます。また10以上の言語(英語、中国語、ドイツ語)などに対応しており、日本語にも対応しています。なお、newspaper3kは、Webスクレイピングを実行しますので、同時に複数のリクエストを行ってしまうと、サイトからのブロックされてしまう恐れがありますので、これを踏まえてご利用ください。
■Python
今回のPythonのバージョンは、「3.8.5」を使用しています。(Windows10)(pythonランチャーでの確認)
■newspaper3kをインストールする
newspaper3kをインストールを行いますが、今回はpipを経由してインストールを行うので、まずWindowsのコマンドプロンプトを起動します。
pip install newspaper3k
起動後、上記のコマンドを入力し、Enterキーを押します。
なお、今回は、pythonランチャーを使用しており、Python Version 3.8.5にインストールを行うために、pipを使う場合にはコマンドでの切り替えを行います。
py -3.8 -m pip install newspaper3k
切り替えるために、上記のコマンドを入力し、Enterキーを押します。
Defaulting to user installation because normal site-packages is not writeable Collecting newspaper3k Downloading newspaper3k-0.2.8-py3-none-any.whl (211 kB) |████████████████████████████████| 211 kB 819 kB/s Collecting cssselect>=0.9.2 Downloading cssselect-1.1.0-py2.py3-none-any.whl (16 kB) Requirement already satisfied: beautifulsoup4>=4.4.1 in c:\users\user_\appdata\roaming\python\python38\site-packages (from newspaper3k) (4.9.3) Collecting tldextract>=2.0.1 Downloading tldextract-3.1.2-py2.py3-none-any.whl (87 kB) |████████████████████████████████| 87 kB 1.5 MB/s Collecting jieba3k>=0.35.1 Downloading jieba3k-0.35.1.zip (7.4 MB) |████████████████████████████████| 7.4 MB 3.3 MB/s Preparing metadata (setup.py) ... done Collecting feedfinder2>=0.0.4 Downloading feedfinder2-0.0.4.tar.gz (3.3 kB) Preparing metadata (setup.py) ... done Requirement already satisfied: feedparser>=5.2.1 in c:\users\user_\appdata\roaming\python\python38\site-packages (from newspaper3k) (5.2.1) Collecting tinysegmenter==0.3 Downloading tinysegmenter-0.3.tar.gz (16 kB) Preparing metadata (setup.py) ... done Requirement already satisfied: PyYAML>=3.11 in c:\users\user_\appdata\roaming\python\python38\site-packages (from newspaper3k) (5.4.1) Requirement already satisfied: lxml>=3.6.0 in c:\users\user_\appdata\roaming\python\python38\site-packages (from newspaper3k) (4.6.3) Requirement already satisfied: Pillow>=3.3.0 in c:\users\user_\appdata\roaming\python\python38\site-packages (from newspaper3k) (8.2.0) Collecting nltk>=3.2.1 Downloading nltk-3.6.5-py3-none-any.whl (1.5 MB) |████████████████████████████████| 1.5 MB 2.2 MB/s Requirement already satisfied: requests>=2.10.0 in c:\users\user_\appdata\roaming\python\python38\site-packages (from newspaper3k) (2.25.1) Requirement already satisfied: python-dateutil>=2.5.3 in c:\users\user_\appdata\roaming\python\python38\site-packages (from newspaper3k) (2.8.1) Requirement already satisfied: soupsieve>1.2 in c:\users\user_\appdata\roaming\python\python38\site-packages (from beautifulsoup4>=4.4.1->newspaper3k) (2.2.1) Requirement already satisfied: six in c:\users\user_\appdata\roaming\python\python38\site-packages (from feedfinder2>=0.0.4->newspaper3k) (1.15.0) Requirement already satisfied: joblib in c:\users\user_\appdata\roaming\python\python38\site-packages (from nltk>=3.2.1->newspaper3k) (1.0.1) Collecting regex>=2021.8.3 Downloading regex-2021.11.10-cp38-cp38-win_amd64.whl (273 kB) |████████████████████████████████| 273 kB 3.2 MB/s Requirement already satisfied: click in c:\users\user_\appdata\roaming\python\python38\site-packages (from nltk>=3.2.1->newspaper3k) (7.1.2) Requirement already satisfied: tqdm in c:\users\user_\appdata\roaming\python\python38\site-packages (from nltk>=3.2.1->newspaper3k) (4.60.0) Requirement already satisfied: chardet<5,>=3.0.2 in c:\users\user_\appdata\roaming\python\python38\site-packages (from requests>=2.10.0->newspaper3k) (4.0.0) Requirement already satisfied: urllib3<1.27,>=1.21.1 in c:\users\user_\appdata\roaming\python\python38\site-packages (from requests>=2.10.0->newspaper3k) (1.26.5) Requirement already satisfied: idna<3,>=2.5 in c:\users\user_\appdata\roaming\python\python38\site-packages (from requests>=2.10.0->newspaper3k) (2.10) Requirement already satisfied: certifi>=2017.4.17 in c:\users\user_\appdata\roaming\python\python38\site-packages (from requests>=2.10.0->newspaper3k) (2021.5.30) Collecting requests-file>=1.4 Downloading requests_file-1.5.1-py2.py3-none-any.whl (3.7 kB) Requirement already satisfied: filelock>=3.0.8 in c:\users\user_\appdata\roaming\python\python38\site-packages (from tldextract>=2.0.1->newspaper3k) (3.3.1) Building wheels for collected packages: tinysegmenter, feedfinder2, jieba3k Building wheel for tinysegmenter (setup.py) ... done Created wheel for tinysegmenter: filename=tinysegmenter-0.3-py3-none-any.whl size=13552 sha256=5731157b65d0e30ddc7f2a931218bb9194911c5f5d42fd85c480b67f64e86f6a Stored in directory: c:\users\user_\appdata\local\pip\cache\wheelsfac1c8d9c648cfabebbbffe97a889f6624817f3aa0bbe6c09 Building wheel for feedfinder2 (setup.py) ... done Created wheel for feedfinder2: filename=feedfinder2-0.0.4-py3-none-any.whl size=3356 sha256=f945d0b813c1be9179ff9076b31dbbfd897873921b5d85ec3f9659b25e382d28 Stored in directory: c:\users\user_\appdata\local\pip\cache\wheels\b6\a9f15498ac02c23dde29f18745bc6a6f574ba4ab41861a3575 Building wheel for jieba3k (setup.py) ... done Created wheel for jieba3k: filename=jieba3k-0.35.1-py3-none-any.whl size=7398405 sha256=f8da32e98d5417ca22a45cac6c55a112739decf8bd160297b8a44e1e04c8174e Stored in directory: c:\users\user_\appdata\local\pip\cache\wheelsfeDefaulting to user installation because normal site-packages is not writeable Collecting newspaper3k Downloading newspaper3k-0.2.8-py3-none-any.whl (211 kB) |████████████████████████████████| 211 kB 819 kB/s Collecting cssselect>=0.9.2 Downloading cssselect-1.1.0-py2.py3-none-any.whl (16 kB) Requirement already satisfied: beautifulsoup4>=4.4.1 in c:\users\user_\appdata\roaming\python\python38\site-packages (from newspaper3k) (4.9.3) Collecting tldextract>=2.0.1 Downloading tldextract-3.1.2-py2.py3-none-any.whl (87 kB) |████████████████████████████████| 87 kB 1.5 MB/s Collecting jieba3k>=0.35.1 Downloading jieba3k-0.35.1.zip (7.4 MB) |████████████████████████████████| 7.4 MB 3.3 MB/s Preparing metadata (setup.py) ... done Collecting feedfinder2>=0.0.4 Downloading feedfinder2-0.0.4.tar.gz (3.3 kB) Preparing metadata (setup.py) ... done Requirement already satisfied: feedparser>=5.2.1 in c:\users\user_\appdata\roaming\python\python38\site-packages (from newspaper3k) (5.2.1) Collecting tinysegmenter==0.3 Downloading tinysegmenter-0.3.tar.gz (16 kB) Preparing metadata (setup.py) ... done Requirement already satisfied: PyYAML>=3.11 in c:\users\user_\appdata\roaming\python\python38\site-packages (from newspaper3k) (5.4.1) Requirement already satisfied: lxml>=3.6.0 in c:\users\user_\appdata\roaming\python\python38\site-packages (from newspaper3k) (4.6.3) Requirement already satisfied: Pillow>=3.3.0 in c:\users\user_\appdata\roaming\python\python38\site-packages (from newspaper3k) (8.2.0) Collecting nltk>=3.2.1 Downloading nltk-3.6.5-py3-none-any.whl (1.5 MB) |████████████████████████████████| 1.5 MB 2.2 MB/s Requirement already satisfied: requests>=2.10.0 in c:\users\user_\appdata\roaming\python\python38\site-packages (from newspaper3k) (2.25.1) Requirement already satisfied: python-dateutil>=2.5.3 in c:\users\user_\appdata\roaming\python\python38\site-packages (from newspaper3k) (2.8.1) Requirement already satisfied: soupsieve>1.2 in c:\users\user_\appdata\roaming\python\python38\site-packages (from beautifulsoup4>=4.4.1->newspaper3k) (2.2.1) Requirement already satisfied: six in c:\users\user_\appdata\roaming\python\python38\site-packages (from feedfinder2>=0.0.4->newspaper3k) (1.15.0) Requirement already satisfied: joblib in c:\users\user_\appdata\roaming\python\python38\site-packages (from nltk>=3.2.1->newspaper3k) (1.0.1) Collecting regex>=2021.8.3 Downloading regex-2021.11.10-cp38-cp38-win_amd64.whl (273 kB) |████████████████████████████████| 273 kB 3.2 MB/s Requirement already satisfied: click in c:\users\user_\appdata\roaming\python\python38\site-packages (from nltk>=3.2.1->newspaper3k) (7.1.2) Requirement already satisfied: tqdm in c:\users\user_\appdata\roaming\python\python38\site-packages (from nltk>=3.2.1->newspaper3k) (4.60.0) Requirement already satisfied: chardet<5,>=3.0.2 in c:\users\user_\appdata\roaming\python\python38\site-packages (from requests>=2.10.0->newspaper3k) (4.0.0) Requirement already satisfied: urllib3<1.27,>=1.21.1 in c:\users\user_\appdata\roaming\python\python38\site-packages (from requests>=2.10.0->newspaper3k) (1.26.5) Requirement already satisfied: idna<3,>=2.5 in c:\users\user_\appdata\roaming\python\python38\site-packages (from requests>=2.10.0->newspaper3k) (2.10) Requirement already satisfied: certifi>=2017.4.17 in c:\users\user_\appdata\roaming\python\python38\site-packages (from requests>=2.10.0->newspaper3k) (2021.5.30) Collecting requests-file>=1.4 Downloading requests_file-1.5.1-py2.py3-none-any.whl (3.7 kB) Requirement already satisfied: filelock>=3.0.8 in c:\users\user_\appdata\roaming\python\python38\site-packages (from tldextract>=2.0.1->newspaper3k) (3.3.1) Building wheels for collected packages: tinysegmenter, feedfinder2, jieba3k Building wheel for tinysegmenter (setup.py) ... done Created wheel for tinysegmenter: filename=tinysegmenter-0.3-py3-none-any.whl size=13552 sha256=5731157b65d0e30ddc7f2a931218bb9194911c5f5d42fd85c480b67f64e86f6a Stored in directory: c:\users\user_\appdata\local\pip\cache\wheels\99\74\83\8fac1c8d9c648cfabebbbffe97a889f6624817f3aa0bbe6c09 Building wheel for feedfinder2 (setup.py) ... done Created wheel for feedfinder2: filename=feedfinder2-0.0.4-py3-none-any.whl size=3356 sha256=f945d0b813c1be9179ff9076b31dbbfd897873921b5d85ec3f9659b25e382d28 Stored in directory: c:\users\user_\appdata\local\pip\cache\wheels\b6\09\68\a9f15498ac02c23dde29f18745bc6a6f574ba4ab41861a3575 Building wheel for jieba3k (setup.py) ... done Created wheel for jieba3k: filename=jieba3k-0.35.1-py3-none-any.whl size=7398405 sha256=f8da32e98d5417ca22a45cac6c55a112739decf8bd160297b8a44e1e04c8174e Stored in directory: c:\users\user_\appdata\local\pip\cache\wheels\1f\7e\0c\54f3b0f5164278677899f2db08f2b07943ce2d024a3c862afb Successfully built tinysegmenter feedfinder2 jieba3k Installing collected packages: requests-file, regex, tldextract, tinysegmenter, nltk, jieba3k, feedfinder2, cssselect, newspaper3k Attempting uninstall: regex Found existing installation: regex 2021.4.4 Uninstalling regex-2021.4.4: Successfully uninstalled regex-2021.4.4 Successfully installed cssselect-1.1.0 feedfinder2-0.0.4 jieba3k-0.35.1 newspaper3k-0.2.8 nltk-3.6.5 regex-2021.11.10 requests-file-1.5.1 tinysegmenter-0.3 tldextract-3.1.2cf3b0f5164278677899f2db08f2b07943ce2d024a3c862afb Successfully built tinysegmenter feedfinder2 jieba3k Installing collected packages: requests-file, regex, tldextract, tinysegmenter, nltk, jieba3k, feedfinder2, cssselect, newspaper3k Attempting uninstall: regex Found existing installation: regex 2021.4.4 Uninstalling regex-2021.4.4: Successfully uninstalled regex-2021.4.4 Successfully installed cssselect-1.1.0 feedfinder2-0.0.4 jieba3k-0.35.1 newspaper3k-0.2.8 nltk-3.6.5 regex-2021.11.10 requests-file-1.5.1 tinysegmenter-0.3 tldextract-3.1.2
Enterキーを押すと、インストールが開始され、上記のように「Successfully installed」と表示されます。これが表示されれば、newspaper3kが正常にインストールされたことになります。
なお、今回はnewspaper3kのバージョン0.2.8をインストールしました。
コメント