Skip to Content

New Blog Post: How to keep all your websites in sync with scraping technology

Technology Blog

Technology Blog

Magic Links

Recently updated on

The first time I used the new link attachment tool found on sites like Facebook, I thought I had experienced something magical.  After pasting an external webpage URL into my wall-post form, an image, page title, and page description of the linked webpage appeared beneath my post as though out of a puff of smoke from the Internet cloud.  I could cycle through a variety of thumbnails associated with the linked webpage and choose one, such as a logo or avatar, that represented the linked page appropriately.  The attached hyperlink was no longer simply a hyperlink - it was a rich, enticing advertisement for the webpage that lay beyond.  However, when it came to designing such a tool, it soon became apparent that conjuring such Internet wizardry required only basic tools that are firmly rooted in web-development fundamentals.

There are likely many approaches to creating a link attachment lookup tool, but ours was two-fold:

Part 1 - lookup the title, description, and images from the user-provided URL link.  Give the user the ability to choose a thumbnail for their link based on the images that exist on the linked website.   Allow the user to customize the title and description text returned by the lookup.

Part 2 - when the user submits the post, save the page title, description, and link URL to a database. Download, resize, and save the thumbnail image that the user selected.

In this post, we’ll focus more on Part 1, as Part 2 mostly deals with the more standard Python/Django operations of resizing images using PIL and saving text and image data to a database.

Part 1 of this process, the image and metadata look-up, makes use of the commonplace HTML tags <title>, <meta name="description">, and <meta name="title"> to find information about the linked webpage.  The lookup process begins when the user pastes a URL into a “post box” form and clicks a lookup button.  The button click triggers a Javascript Ajax HTTP GET request, passing the pasted URL as a GET parameter to a Django back-end running on the local webserver.  The back-end code there begins the actual link URL lookup.

On the back-end, we use the built-in Python library httplib2 to fetch the webpage referenced by the given link, and we use the common Python-based webpage parser, BeautifulSoup (version, to tokenize the returned webpage into an object tree.

     import urllib2
     request = urllib2.Request('')
     response = urllib2.urlopen(request)
     html =
     from BeautifulSoup import BeautifulSoup
     soup = BeautifulSoup(html)

From there, it is easy to grab the linked website's title/description metadata using BeautifulSoup's tree traversal syntax:

     import re
     title=''; description=''
     description = soup.findAll('meta',

     # Try to get the page title from the meta tag named title
        title = soup.findAll('meta',
            attrs={'name':re.compile("^title$", re.I)})[0].get('content')

     # If the meta tag does not exist, grab the title from the title tag.
     if not title:
         title = soup.title.string

Next, we want to find suitable images on the linked webpage that the user can select as a thumbnail for their post.  BeautifulSoup makes finding the images on the page simple, and returns a list of image URLs found on the linked page.  Good candidates for thumbnail images are most likely going to appear near the top of the webpage, such as a logo or an article image.  Therefore, we limit the number of image URLs returned to the first twenty or so.  This is also important for performance reasons when the user provides a webpage link with a lot of images on it.

     max_images = 20
     image_tags = soup.findAll('img', limit=max_images)
     image_urls_list = []
     for image_tag in image_tags:
        url = image_tag.get('src')

Very large images are likely to be background images and very small images are likely to be page styling elements, such as button backgrounds or spacing images.  Therefore, pruning these image URLs from the list helps avoid returning images that are unlikely to make good thumbnails.  A simple way to find the image size without downloading the entire image is to loop through the discovered image URLs, use urllib2 to do an HTTP HEAD request for each image to find the Content-Length, and prune any images that don’t fall between a given size range.

Once we have the linked page's title, description and a list of suitable thumbnail images, it's time to pass this data back to the front end.  Serializing the results in a JSON structure will make it easy for the front-end javascript to read and process.

     import simplejson as json
     image_list = []
     for url in image_urls_list:
        image_list.append({'url': url})
     return_dict = {'title':title, 'description':description}
     return_dict.update({'images': image_list})

The front-end JavaScript code parses the JSON data fed to it and assigns the title and description to hidden HTML inputs of the same names.  JavaScript functions will allow the user to modify the contents of these input elements if the user wishes to change the title or description text before they submit the post.  JavaScript code also enables the user to rotate through the image thumbnails returned by the JSON data and assigns the image that they choose as the link thumbnail to an “image_url” hidden input.

     <input type="text" id="id_link_url" name="link_url">
     <input type="hidden" id="id_title" name="title">
     <input type="hidden" id="id_description" name="description">
     <input type="hidden" id="id_image_url" name="image_url">

Part 2 of this process, uploading the title/description, resizing the image, and saving the posted data to a database, is more straightforward.  The page title, url, and description are saved to the database using standard Django forms.  The selected image is uploaded from the user-chosen image URL, re-sized to a thumbnail size using PIL, and then written to a file on the server.

The posted link, its title, description, and associated thumbnail can then be displayed as an entry in a list using Django templates.

These examples are far from complete, but hopefully get the idea across.  This technique can help make web bulletin boards into more dynamic and interactive sites.  In conclusion, no magic required.

Possible improvements.

Research use of lxml or html5lib instead of BeautifulSoup for website parsing.

If the page parser fails to find any images, use a regular expression to manually search the page for <img> elements.

Split the metadata look-up into two separate asynchronous requests: one to look-up the title and description and one to lookup the images on the linked page.

Share Twitter, LinkedIn, Facebook

Hey there...
Would you mind sharing this post? We would greatly appreciate it.
Thank you.