What is scrapd
scrapyed = scrapy + deploying
A set of python services that publish the scratch program online
step
1. installation
- 1. install scraped
After installing scrapd, start the service first
pip3 install scrapyd # Open service scrapyd
- 2. install the scrapd client
This is a tool for deploying to the sweep to sweep service
pip3 install scrapyd-client
The installation package has the command "scraped deploy". It is a key step for the crawler project to be published to the srcapyd service. The prerequisite for use is
The scrapd service must be enabled, that is, the first step a) this step requires additional processing in the window, not in linux
2. project preparation
- 1. modify the configuration file
First, modify the sweep configuration file, enter the sweep project folder, and you can see the configuration file sweep CFG modify the configuration as follows
[settings] default = Demo.settings [deploy:Demo] # Take the name of the project here url = http://localhost:6800/ \e this comment opens 6800 as the later access port. If you want to modify it, you should also remember to modify it in the scratch project = Demo
- 2. modify the scrapd configuration file (the file may have different paths due to different installations)
sudo find / -name default_scrapyd.conf # First know the configuration file path vim /your_path/default_scrapyd.conf # vim edit profile # Here is the edit profile [scrapyd] eggs_dir = eggs logs_dir = logs items_dir = jobs_to_keep = 5 dbs_dir = dbs max_proc = 0 max_proc_per_cpu = 10 #The product of this parameter and the number of CPU s is the maximum number of crawlers running at the same time. For future convenience, it is changed to 10 finished_to_keep = 100 poll_interval = 5.0 bind_address = 0.0.0.0 # If the bound IP address is changed to 0.0.0.0, you can access the Internet http_port = 6800 # Here is the corresponding port debug = off runner = scrapyd.runner application = scrapyd.app.application launcher = scrapyd.launcher.Launcher webroot = scrapyd.website.Root [services] schedule.json = scrapyd.webservice.Schedule cancel.json = scrapyd.webservice.Cancel addversion.json = scrapyd.webservice.AddVersion listprojects.json = scrapyd.webservice.ListProjects listversions.json = scrapyd.webservice.ListVersions listspiders.json = scrapyd.webservice.ListSpiders delproject.json = scrapyd.webservice.DeleteProject delversion.json = scrapyd.webservice.DeleteVersion listjobs.json = scrapyd.webservice.ListJobs daemonstatus.json = scrapyd.webservice.DaemonStatus
- 3. test start
Make sure the server has port 6800 open
# TODO() first go to the script project folder to check whether the project can be started normally scrapy list # If there is an error or missing package, you can check my previous blog # If the previous step is normal, you can test whether the scratch is normal scrapyd # When testing the startup of scrapd, be sure to enable the 6800 port # You can find the answer on stackoverflow if you report an error in this step. My blog has also made relevant mistakes before
If accessed at this time http://127.0.0.1/6800 The page with scrapd indicates success
- Tips
Background running scrapd can use supervisor to manage background processes
- 4. start the project
# Deploy projects. A scratch can deploy multiple scratch projects scrapyd-deploy Demo(cfg:Deployment project name in) -p Crawler item name(cfg Medium project name) # Demo is the deployment name set in the script configuration file (default if none)
3. command method for controlling scrapd
You can obtain 6800 related api interfaces through curl command to remotely operate the crawler, such as start the crawler, stop the crawler, etc,
- Open: the top-level project name whose project name is crawler, not the above deployment name
curl http://ip address: 6800/schedule JSON -d project= project name -d spider= crawler name
- Close: job s can be viewed through 6800
curl http://ip address: 6800/cancel JSON -d project= project name -d job=232424434
- Delete unwanted items and ensure that there are no running crawlers for the item
curl http://ip address: 6800/delproject JSON -d project= project name
- See how many items are left
curl http://ip address: 6800/listprojects JSON
It's done
As you can see from the above, the curl command is the method to obtain the interface command. You can use scratch_ API for quick operation