SA-MP Forums

Go Back   SA-MP Forums > SA-MP Scripting and Plugins > Plugin Development

Reply
 
Thread Tools Display Modes
Old 12/11/2018, 04:44 PM   #1
SyS
High-roller
 
SyS's Avatar
 
Join Date: Oct 2015
Posts: 1,952
Reputation: 497
Default PawnScraper

PawnScraper




A powerful scraper plugin that provides interface for utlising html_parsers and css selectors in pawn.

Installing

Thanks to Southclaws,plugin installation is now much easier with sampctl

PHP Code:
sampctl p install Sreyas-Sreelal/pawn-scraper 
OR
  • Download suitable binary files from releases for your operating system
  • Add it your plugins folder
  • Add PawnScraper to server.cfg or PawnScraper.so (for linux)
  • Add pawnscraper.inc in includes folder

Building
  • Clone the repo

    PHP Code:
    git clone https://github.com/Sreyas-Sreelal/pawn-scraper.git 
  • Compile the plugin using nightly compiler
    • Windows
      PHP Code:
      cargo +nightly-i686-pc-windows-msvc build --release 
    • Linux
      PHP Code:
      cargo +nightly-i686-unknown-linux-gnu build --release 

API
  • ParseHtmlDocument(document[])]
    • Params
      • document[] - string of html document
    • Returns
      • Html document instance id
      • if failed to parse document INVALID_HTML_DOC is returned
    • Example Usage

      PHP Code:
      new Html:doc ParseHtmlDocument("\
          <!DOCTYPE html>\
          <meta charset=\"utf-8\">\
          <title>Hello, world!</title>\
          <h1 class=\"foo\">Hello, <i>world!</i></h1>\
          "
      );
      ASSERT(doc != INVALID_HTML_DOC);
      DeleteHtml(doc); 
  • ResponseParseHtml(Response:id)
    • Params
      • id - Http response id returned from HttpGet
    • Returns
      • Html document instance id
      • if failed to parse document INVALID_HTML_DOC is returned
    • Example Usage

      PHP Code:
      new Response:response HttpGet("https://www.sa-mp.com");
      new 
      Html:doc ResponseParseHtml(response);
      ASSERT(doc != INVALID_HTML_DOC);
      DeleteHtml(doc); 
  • HttpGet(url[],Header:headerid=INVALID_HEADER)
    • Params
      • url[] - Url of a website
      • header - id of header object created using CreateHeader
    • Returns
      • Response id if successful
      • if failed to INVALID_HTTP_RESPONSE is returned
    • Example Usage

      PHP Code:
      new Response:response HttpGet("https://www.sa-mp.com");
      ASSERT(response != INVALID_HTTP_RESPONSE);
      DeleteResponse(response); 
  • HttpGetThreaded(playerid,callback[],url[],Header:headerid=INVALID_HEADER)
    • Params
      • playerid - id of the player
      • callback[] - name of the callback function to handle the response.
      • url[] - Url of a website
      • header - id of header object created using CreateHeader
    • Example Usage
      PHP Code:
      HttpGetThreaded(0,"MyHandler","https://sa-mp.com");
      //********
      forward MyHandler(playerid,Response:responseid);
      public 
      MyHandler(playerid,Response:responseid){
          
      ASSERT(responseid != INVALID_HTTP_RESPONSE);
          
      DeleteResponse(responseid);

  • ParseSelector(string[])
    • Params
      • string[] - CSS selector
    • Returns
      • Selector instance id if successful
      • if failed to INVALID_SELECTOR is returned
    • Example Usage

      PHP Code:
      new Selector:selector ParseSelector("h1 .foo");
      ASSERT(selector != INVALID_SELECTOR);
      DeleteSelector(selector); 
  • CreateHeader(Ö)
    • Params
      • key,value pairs of String type
    • Returns
      • Header instance id if successful
      • if failed to INVALID_HEADER is returned
    • Example Usage

      PHP Code:
      new Header:header CreateHeader(
          
      "User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
      );
      ASSERT(header != INVALID_HEADER);
      new 
      Response:response HttpGet("https://sa-mp.com/",header);
      ASSERT(response != INVALID_HTTP_RESPONSE);
      ASSERT(DeleteHeader(header) == 1); 
  • GetNthElementName(Html:docid,Selector:selectorid,i dx,string[],size = sizeof(string))
    • Params
      • docid - Html instance id
      • selectorid - CSS selector instance id
      • idx - the níth occurence of element in the document (starts from 0)
      • string[] - element name is stored
      • size - sizeof string
    • Returns
      • 1 if successful
      • 0 if failed
    • Example Usage

      PHP Code:
      new Html:doc ParseHtmlDocument("\
          <!DOCTYPE html>\
          <meta charset=\"utf-8\">\
          <title>Hello, world!</title>\
          <h1 class=\"foo\">Hello, <i>world!</i></h1>\
      "
      );
      ASSERT(doc != INVALID_HTML_DOC);

      new 
      Selector:selector ParseSelector("i");
      ASSERT(selector != INVALID_SELECTOR);

      new 
      i= -1,element_name[10];
      while(
      GetNthElementName(doc,selector,++i,element_name)!=0){
          
      ASSERT(strcmp(element_name,"i") == 0);
      }

      DeleteSelector(selector);
      DeleteHtml(doc); 
  • GetNthElementText(Html:docid,Selector:selectorid,i dx,string[],size = sizeof(string))
    • Params
      • docid - Html instance id
      • selectorid - CSS selector instance id
      • idx - the níth occurence of element in the document (starts from 0)
      • string[] - element name
      • size - sizeof string
    • Returns
      • 1 if successful
      • 0 if failed
    • Example Usage

      PHP Code:
      new Html:doc ParseHtmlDocument("\
          <!DOCTYPE html>\
          <meta charset=\"utf-8\">\
          <title>Hello, world!</title>\
          <h1 class=\"foo\">Hello, <i>world!</i></h1>\
      "
      );
      ASSERT(doc != INVALID_HTML_DOC);

      new 
      Selector:selector ParseSelector("h1.foo");
      ASSERT(selector != INVALID_SELECTOR);

      new 
      element_text[20];
      ASSERT(GetNthElementText(doc,selector,0,element_text) == 1);

      new 
      check strcmp(element_text,("Hello, world!"));
      ASSERT(check == 0);

      DeleteSelector(selector);
      DeleteHtml(doc); 
  • GetNthElementAttrVal(Html:docid,Selector:selectori d,idx,attribute[],string[],size = sizeof(string))
    • Params
      • docid - Html instance id
      • selectorid - CSS selector instance id
      • idx - the níth occurence of element in the document (starts from 0)
      • attribute[] - the attribute of element
      • string[] - element name
      • size - sizeof string
    • Returns
      • 1 if successful
      • 0 if failed
    • Example Usage

      PHP Code:
      new Html:doc ParseHtmlDocument("\
          <!DOCTYPE html>\
          <meta charset=\"utf-8\">\
          <title>Hello, world!</title>\
          <h1 class=\"foo\">Hello, <i>world!</i></h1>\
      "
      );
      ASSERT(doc != INVALID_HTML_DOC);

      new 
      Selector:selector ParseSelector("h1");
      ASSERT(selector != INVALID_SELECTOR);

      new 
      element_attribute[20];
      ASSERT(GetNthElementAttrVal(doc,selector,0,"class",element_attribute) == 1);

      new 
      check strcmp(element_attribute,("foo"));
      ASSERT(check == 0);

      DeleteSelector(selector);
      DeleteHtml(doc); 

  • DeleteHtml(Html:id)
    • Params
      • id - html instance to be deleted
    • Returns
      • 1 if successful
      • 0 if failed

  • DeleteSelector(Selector:id)
    • Params
      • id - selector instance to be deleted
    • Returns
      • 1 if successful
      • 0 if failed

  • DeleteResponse(Html:id)
    • Params
      • id - response instance to be deleted
    • Returns
      • 1 if successful
      • 0 if failed

  • DeleteHeader(Header:id)
    • Params
      • id - header instance to be deleted
    • Returns
      • 1 if successful
      • 0 if failed


Example Usage

A small example to fetch all links in wiki.sa-mp.com

PHP Code:
new Response:response HttpGet("https://wiki.sa-mp.com");
if(
response == INVALID_HTTP_RESPONSE){
    
printf("HTTP ERROR");
    return;
}

new 
Html:html ResponseParseHtml(response);
if(
html == INVALID_HTML_DOC){
    
DeleteResponse(response);
    return;
}

new 
Selector:selector ParseSelector("a");
if(
selector == INVALID_SELECTOR){
    
DeleteResponse(response);
    
DeleteHtml(html);
    return;
}

new 
str[500],i;
while(
GetNthElementAttrVal(html,selector,i,"href",str)){
    
printf("%s",str);
    ++
i;
}
//delete created objects after the usage..
DeleteHtml(html);
DeleteResponse(response);
DeleteSelector(selector); 

The same above with threaded http call would be

PHP Code:
HttpGetThreaded(0,"MyHandler","https://wiki.sa-mp.com");
//...
forward MyHandler(playerid,Response:responseid);
public 
MyHandler(playerid,Response:responseid){
    if(
responseid == INVALID_HTTP_RESPONSE){
        
printf("HTTP ERROR");
        return 
0;
    }

    new 
Html:html ResponseParseHtml(responseid);
    if(
html == INVALID_HTML_DOC){
        
DeleteResponse(response);
        return 
0;
    }

    new 
Selector:selector ParseSelector("a");
    if(
selector == INVALID_SELECTOR){
        
DeleteResponse(response);
        
DeleteHtml(html);
        return 
0;
    }

    new 
str[500],i;
    while(
GetNthElementAttrVal(html,selector,i,"href",str)){
        
printf("%s",str);
        ++
i;
    }

    
DeleteHtml(html);
    
Delete(response);
    
DeleteSelector(selector);
    return 
1;



More examples can be found in examples

Repository
https://github.com/Sreyas-Sreelal/pawn-scraper

Note

The plugin is in primary stage and more tests and features needed to be added.Iím open to any kind of contribution, just open a pull request if you have anything to improve or add new features.

Special thanks

Last edited by SyS; 13/01/2019 at 06:35 AM.
SyS is offline   Reply With Quote
Old 12/11/2018, 04:59 PM   #2
Gabriel432135
Little Clucker
 
Join Date: Nov 2018
Posts: 29
Reputation: 0
Default Re: PawnScraper

cool
Gabriel432135 is offline   Reply With Quote
Old 12/11/2018, 05:12 PM   #3
kristo
Banned
 
Join Date: Jun 2012
Location: Estonia
Posts: 370
Reputation: 179
Default Re: PawnScraper

hot.
kristo is offline   Reply With Quote
Old 12/11/2018, 07:44 PM   #4
Ermanhaut
Gangsta
 
Ermanhaut's Avatar
 
Join Date: Apr 2016
Location: 2369.5547, -1681.9297, 15.0078
Posts: 634
Reputation: 47
Default Re: PawnScraper

This is really good.
__________________
try, try and try again
Ermanhaut is offline   Reply With Quote
Old 15/11/2018, 08:38 PM   #5
Chaprnks
Gangsta
 
Chaprnks's Avatar
 
Join Date: Sep 2007
Location: Soviet America
Posts: 568
Reputation: 69
Default Re: PawnScraper

Amazing! Finally a well-rounded solution to the HTTP() function
__________________
Chaprnks is offline   Reply With Quote
Old 24/11/2018, 12:39 PM   #6
SyS
High-roller
 
SyS's Avatar
 
Join Date: Oct 2015
Posts: 1,952
Reputation: 497
Default Re: PawnScraper

New version released!

https://github.com/Sreyas-Sreelal/pa...ases/tag/0.1.0

Changes
  • Added HttpGetThreaded
  • Changed reqwest to minihttp
  • Smaller binary

Still might need more tests but the basic functionalities are working as expected.Big thanks to Eva who patiently listened to my questions and doubts and for giving me guidance in certain parts.

Usage of HttpGetThreaded
pawn Code:
HttpGetThreaded(0,"MyHandler","https://wiki.sa-mp.com");
//...
forward MyHandler(playerid,Response:responseid);
public MyHandler(playerid,Response:responseid){
    if(responseid == INVALID_HTTP_RESPONSE){
        printf("HTTP ERROR");
        return 0;
    }

    new Html:html = ResponseParseHtml(responseid);
    if(html == INVALID_HTML_DOC){
        DeleteResponse(response);
        return 0;
    }

    new Selector:selector = ParseSelector("a");
    if(selector == INVALID_SELECTOR){
        DeleteResponse(response);
        DeleteHtml(html);
        return 0;
    }

    new str[500],i;
    while(GetNthElementAttrVal(html,selector,i,"href",str)){
        printf("%s",str);
        ++i;
    }

    DeleteHtml(html);
    Delete(response);
    DeleteSelector(selector);
    return 1;
}

Last edited by SyS; 30/11/2018 at 12:52 PM.
SyS is offline   Reply With Quote
Old 24/11/2018, 05:13 PM   #7
Infin1ty
Banned
 
Join Date: Feb 2018
Posts: 118
Reputation: 52
Default Re: PawnScraper

no
no you didnt
:O
Infin1ty is offline   Reply With Quote
Old 26/11/2018, 12:14 PM   #8
AmirSavand
Big Clucker
 
AmirSavand's Avatar
 
Join Date: Sep 2018
Location: Behind Schedule
Posts: 79
Reputation: 8
Default Re: PawnScraper

SAMP http requests are known to fail without a reason so does the http calls here always succeed without bugs?
__________________

GitHub - Website - Contact

C# - Python - PHP - Angular
Unity 3D - Django - Electron

AmirSavand is offline   Reply With Quote
Old 26/11/2018, 12:18 PM   #9
SyS
High-roller
 
SyS's Avatar
 
Join Date: Oct 2015
Posts: 1,952
Reputation: 497
Default Re: PawnScraper

Quote:
Originally Posted by AmirSavand View Post
SAMP http requests are known to fail without a reason so does the http calls here always succeed without bugs?
Http requests is working fine as per the tests,if you encountered any bugs open an issue on github. But do note that main scope of this plugin is not sending http requests (plugin can only be used to send GET requests ) but parsing HTML doc and using CSS selectors. Southclaw' requests plugin already gives a better solution to http requests.
SyS is offline   Reply With Quote
Old 26/11/2018, 01:52 PM   #10
fiki574
Gangsta
 
fiki574's Avatar
 
Join Date: Mar 2011
Location: Croatia
Posts: 849
Reputation: 169
Default Re: PawnScraper

Nice work!

However, is there any way to send a HTTP request towards the SAMP server instead of only external URLs?
__________________
fiki574 is offline   Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off



All times are GMT. The time now is 01:57 PM.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2019, Jelsoft Enterprises Ltd.