Fetching remote content with curl

Lately I see lots of problems with bunch of website scripts still using fopen() function to retrieve remote urls. Usually problem isn’t in scripts themselves but in the mere fact that url_functions are more and more often disabled in webhost’s php configuration. I can’t say I blame them, pure thought off all the possible script exploits that are using url_fopen isn’t very comforting. What bothers me is php developers not wanting to accept new ways of retrieving remote content. I’m pretty sure that almost all webhosts have curl enabled, and curl is one powerful function in php. But still we can find tons of fopen functions wanting to retrieve remote content in almost all popular content management systems.

In the rest of the post I will present you with alternatives to fopen, as some examples and guidelines for retrieving remote url content. Post is a bit longer but bear with me.

I’m not saying I’m an expert in php programing, but common sense led me to few simple questions and conclusions.

  • First problem is unnecessary traffic and slow load times!

Our side of the story: I’ve made an excellent portal that will display all the relevant data on my homepage in blocks. Blocks will be visible on all sites, and will display content retrieved from my municipal sites and servers. They aren’t very reliable in sense of uptime, so I will serve the informations of the interest on my portal. My portal is slowly gaining in popularity, but as it daily visitors base starting to grow its becoming more and more unstable (oh.. did I mention… I use plain fopen in blocks to get the remote content).

Providers side of the story: Municipal system admin shows an great increase in web traffic suddenly. As it seams some webhosting server is doing some intensive crawling, and information gathering. At first additional network traffic and server load is not so alarming, but as the time goes by, traffic and load on webserver from that specific host is getting bigger and bigger. At some point sysadmin is forced to ban the excessing ip.

What just happened?

Every single visitor on my imaginary portal is triggering several fopen functions, witch then every time on every page refresh fetch content from remote server. They trigger unnecessary traffic and load on remote server. They affect the performance of my site as well since we all ways must wait remote server to respond with content. When the remote server finally blocks my server ip, my site will stop working, since fopen will wait for remote server to respond.

So how to overcome this problem?

Nice and elegant solution is to use periodic retrieving. Check the frequency of changes on remote server, and retrieve content periodically (every 2-6h). Save the periodically retrieved content on local filesystem and serve it from there, it will dramatically increase your page loading times.

  • Second problem is site dependence.

It’s much related with first problem, but the big question is what happens when remote server is offline, or remote content is for some reason unavailable? Mostly using plain fopen will cause our site to stop showing any data from remote server, or in worst case our whole site can stop functioning. Even if we download a copy of remote content to local filesystem and serve it from there, it will only be sufficient until next download interval.

As of my knowing fopen doesn’t support response header checking. So if the site is moved, unavailable, service is interrupted (by solar flares :) ) fopen can’t handle it. With curl you can actually check the response headers of remote url before you download the content. That way you can be sure that you won’t download someones 404 page.

So let’s cut to the chase! Most of my clients still use some form of fopen in some sites. What I was wanting to do is make some simple copy/paste php code witch can replace current fopen function.

So what we want to do is replace:

$handle = fopen(“http://www.example.com/”, “r”);

with:

// variables to set
$remoteurl = “http://www.example.com”; //Url you want to retrive
$chtime =2; //hours
$timeout =10; //secconds
$localfile = preg_replace("/[^A-Za-z0-9_\.]/", "_", $remoteurl);

if (file_exists($localfile)){
           
                $localfile_stat = stat($localfile);
                if ($localfile_stat['mtime'] < strtotime("-$chtime hours")){
               
            $chresponse = curl_init($remoteurl);
            $ret = curl_setopt($chresponse, CURLOPT_HEADER, 1);
            $ret = curl_setopt($chresponse, CURLOPT_FOLLOWLOCATION, 1);
            $ret = curl_setopt($chresponse, CURLOPT_TIMEOUT,        $timeout);
            $ret = curl_setopt($chresponse, CURLOPT_RETURNTRANSFER, 1);
            $ret = curl_exec($chresponse);

            if (empty($ret)) {
                    die(curl_error($chresponse));
                    curl_close($chresponse);
            } else {
                $info = curl_getinfo($chresponse);
                curl_close($chresponse);
                if ($info['http_code'] == "200") {
                   
                            $ch = curl_init($remoteurl);
                    $fp = fopen($localfile, "w");

                    curl_setopt($ch, CURLOPT_FILE, $fp);
                    curl_setopt($ch, CURLOPT_HEADER, 0);
                    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
                    curl_exec($ch);
                    curl_close($ch);
                    fclose($fp);            
                    }else{
                        touch($localfile);
                    }
            }
        }
}else{
    $ch = curl_init($remoteurl);
    $fp = fopen($localfile, "w");
    curl_setopt($ch, CURLOPT_FILE, $fp);
    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
    curl_exec($ch);
    curl_close($ch);
    fclose($fp);    
}
$handle = fopen($localfile, "r");

Yes it’s a bit longer, but in my humble opinion, this is the right way of doing it.

No related posts.

  1. For those who have need for intensive url_fopen usage this can be converted to function. You can see example here:
    http://toic.org/curl/

  1. October 27th, 2009