Scripyd deploys scripy crawler on centos offline

What is scrapd

scrapyed = scrapy + deploying
A set of python services that publish the scratch program online

step

1. installation

  • 1. install scraped
    After installing scrapd, start the service first
pip3 install scrapyd
# Open service
scrapyd

  • 2. install the scrapd client
    This is a tool for deploying to the sweep to sweep service
pip3 install scrapyd-client

The installation package has the command "scraped deploy". It is a key step for the crawler project to be published to the srcapyd service. The prerequisite for use is
The scrapd service must be enabled, that is, the first step a) this step requires additional processing in the window, not in linux

2. project preparation

  • 1. modify the configuration file
    First, modify the sweep configuration file, enter the sweep project folder, and you can see the configuration file sweep CFG modify the configuration as follows
[settings]
default = Demo.settings

[deploy:Demo]   # Take the name of the project here
url = http://localhost:6800/ \e this comment opens 6800 as the later access port. If you want to modify it, you should also remember to modify it in the scratch
project = Demo

  • 2. modify the scrapd configuration file (the file may have different paths due to different installations)
sudo find / -name default_scrapyd.conf   # First know the configuration file path
vim /your_path/default_scrapyd.conf  # vim edit profile

# Here is the edit profile
[scrapyd]
eggs_dir    = eggs
logs_dir    = logs
items_dir   =
jobs_to_keep = 5
dbs_dir     = dbs
max_proc    = 0
max_proc_per_cpu = 10    #The product of this parameter and the number of CPU s is the maximum number of crawlers running at the same time. For future convenience, it is changed to 10
finished_to_keep = 100   
poll_interval = 5.0
bind_address = 0.0.0.0   # If the bound IP address is changed to 0.0.0.0, you can access the Internet
http_port   = 6800   # Here is the corresponding port
debug       = off
runner      = scrapyd.runner
application = scrapyd.app.application
launcher    = scrapyd.launcher.Launcher
webroot     = scrapyd.website.Root

[services]
schedule.json     = scrapyd.webservice.Schedule
cancel.json       = scrapyd.webservice.Cancel
addversion.json   = scrapyd.webservice.AddVersion
listprojects.json = scrapyd.webservice.ListProjects
listversions.json = scrapyd.webservice.ListVersions
listspiders.json  = scrapyd.webservice.ListSpiders
delproject.json   = scrapyd.webservice.DeleteProject
delversion.json   = scrapyd.webservice.DeleteVersion
listjobs.json     = scrapyd.webservice.ListJobs
daemonstatus.json = scrapyd.webservice.DaemonStatus

  • 3. test start
    Make sure the server has port 6800 open
# TODO() first go to the script project folder to check whether the project can be started normally
scrapy list  # If there is an error or missing package, you can check my previous blog 
# If the previous step is normal, you can test whether the scratch is normal
scrapyd  # When testing the startup of scrapd, be sure to enable the 6800 port 
# You can find the answer on stackoverflow if you report an error in this step. My blog has also made relevant mistakes before

If accessed at this time http://127.0.0.1/6800 The page with scrapd indicates success

  • Tips

Background running scrapd can use supervisor to manage background processes

  • 4. start the project
# Deploy projects. A scratch can deploy multiple scratch projects
scrapyd-deploy Demo(cfg:Deployment project name in) -p Crawler item name(cfg Medium project name)  # Demo is the deployment name set in the script configuration file (default if none)

3. command method for controlling scrapd

You can obtain 6800 related api interfaces through curl command to remotely operate the crawler, such as start the crawler, stop the crawler, etc,

  • Open: the top-level project name whose project name is crawler, not the above deployment name
curl http://ip address: 6800/schedule JSON -d project= project name -d spider= crawler name
  • Close: job s can be viewed through 6800
curl http://ip address: 6800/cancel JSON -d project= project name -d job=232424434
  • Delete unwanted items and ensure that there are no running crawlers for the item
curl http://ip address: 6800/delproject JSON -d project= project name
  • See how many items are left
curl http://ip address: 6800/listprojects JSON

It's done

As you can see from the above, the curl command is the method to obtain the interface command. You can use scratch_ API for quick operation

Tags: crawler scrapy

Posted by karnegyhall on Wed, 01 Jun 2022 09:11:35 +0530