close ad
 
Important WebAssist Announcement
open ad
View Menu

Technical Support Forums

Free, outstanding support from WebAssist and your colleagues

rating

stopping robots from taking data

Thread began 8/13/2009 6:51 pm by aaron322044 | Last modified 8/18/2009 7:52 am by Ray Borduin | 2316 views | 4 replies |

aaron322044

stopping robots from taking data

I built a site that has a lot of data. The detail pages end with
detail.php?num=503084
detail.php?num=503085
etc.
How can I stop someone fron running a program to get all that data?

Sign in to reply to this post

Danilo Celic

If what is hitting you is really a search engine robot, then perhaps you can use robots.txt to limit or prevent parsing of certain areas of your site.

Assuming that the pages are open for anyone to see (as in not password protected), then the quick answer is that you can't stop someone from doing that. All you can really do is make it more cumbersome to accomplish what they are trying to do. I've not needed to do this myself, so not sure what is really effective, but I'd suggest reading up on "throttling". As you supplied PHP links, here's a start:
#hl=en&q=php+throttling

One thing I immediately thought of was to track requests by IP address and if more than XXX requests in YY seconds, then make the requests by that IP address take longer, perhaps using sleep() With that, you'd need some way to track the number of requests over a period of time, perhaps with a database. Settings a session value will likely not be any good as the robot is probably making individual requests and not saving any session state on it's end.

Maybe someone else has some good suggestions.

Sign in to reply to this post

aaron322044

mod_rewrite

I don't know the code for the htaccess, but would putting a mod rewrite for the url work? That would change all the different urls to one url (like example.com/data_results)

Sign in to reply to this post

Danilo Celic

I've not worked with mod rewrite so I could easily be wrong, but it's my understanding that the mod rewrite would be able to change a request to remove the query string, but unless you have specific pages created in the mod rewrite for each possible value of the "num" parameter, then the page won't "know" what the parameter should be and therefore won't be able to display the proper data on the page. That would stop the robot from getting to your content, but also the humans that need it too. ;-)

Another option could be to not use autoincrement for the ID of the item to display, and generate the ID in some other fashion, like a text unique id, perhaps using uniqid
which will create a 13 character long ID, 23 characters long with the more_entropy parameter set to true.

This would require changes to your database, and making sure to generate the ID column value each time a record is inserted into the database.

However, many robots will follow links through out the site to find all of the available data items rather than simply incrementing query string parameter values. If the robot(s) coming to your pages spider rather than increment, then the unique id won't help in that regard, as if there is some way to get to the page, the spider will find it. You'd be better off limiting visits by IP over a certain timeframe.

In addition to limiting the number of requests, you could also "ban" IP addresses where if you find that they are running through your site, then add the IP to a "bad id" table in your DB and then spit back a 404 or some other HTTP error when a request comes through from a bad IP. The problem with this approach is that with some internet connections IP addresses change all the time, and therefore an outright ban could be stopping legitimate traffic. That's why I initially suggested the sleep response which will merely slow down the response to the bad requests. You could even take a progressive approach where if the request come in at over a certain rate per minute/hour/day, then you increase the delay on the response.

HTH

Sign in to reply to this post

Ray BorduinWebAssist

I think you would have to use something like SecurityAssist and Session level security to keep robots or anyone else who doesn't have access from viewing and capturing your data.

In general if your information can be viewed on the web, then somebody will be able to also capture and record that data. Only if it can't be viewed without authorization would prevent somebody or some robot without authorization from taking the data.

Sign in to reply to this post
Did this help? Tips are appreciated...

Build websites with a little help from your friends

Your friends over here at WebAssist! These Dreamweaver extensions will assist you in building unlimited, custom websites.

Build websites from already-built web applications

These out-of-the-box solutions provide you proven, tested applications that can be up and running now.  Build a store, a gallery, or a web-based email solution.

Want your website pre-built and hosted?

Close Windowclose

Rate your experience or provide feedback on this page

Account or customer service questions?
Please user our contact form.

Need technical support?
Please visit support to ask a question

Content

rating

Layout

rating

Ease of use

rating

security code refresh image

We do not respond to comments submitted from this page directly, but we do read and analyze any feedback and will use it to help make your experience better in the future.

Close Windowclose

We were unable to retrieve the attached file

Close Windowclose

Attach and remove files

add attachmentAdd attachment
Close Windowclose

Enter the URL you would like to link to in your post

Close Windowclose

This is how you use right click RTF editing

Enable right click RTF editing option allows you to add html markup into your tutorial such as images, bulleted lists, files and more...

-- click to close --

Uploading file...