Refactoring an article system

In this article I am going to take you through a multi-stage refactor of a PHP blog system. The only factor preventing me porting the blog system to C# is PHP’s prevelance on shared hosting servers. I will use the result live on OOP Architect and my personal homepage.

High level plan

The first stage in the refactor is to figure out if it is worthwhile. Common reasons to refactor include:

  1. Increased flexibility to allow further development
  2. Improved readability and therefore reduced maintenance costs

I want to increase modularisation and code reuse across multiple websites. At the moment development is stunted as the system has become too complicated to maintain. A change in one website requires changes in all other websites using the same framework.

I will concentrate on the class ArticleStorage as it is the interface between the blog website and the backing article store. It is highly coupled to website specific configuration and has too many responsibilities.

Here are the top level structures inside ArticleStorage.class.php:

<?php
class ArticleStorage { }

function feed($flux, $strType, $category, $articleSet) { }

function feedTitleFull($category, $date) { }

function paginate($articleSet, $output) { }

This file is doing too many things. The first step in making this file manageable is to separate it into multiple files each with a single responsibility. The easiest single change I can make at this point is to extract all utility functions into a separate file.

ArticleStorage.class.php now contains a single class:

<?php
class ArticleStorage { }

This simple refactor is one of the most powerful. We can now consider ArticleStorage as a separate entity.

Simplifying the article store

As there is no documentation, we will first look at usages of the class article storage to understand its purpose. I will add comments at this point to assist my short term memory:

<?php
// Create a class to retrieve articles
// - Where do the articles come from?
// - We are not specifying a directory or data source
$articleStorage = new ArticleStorage();

// Get a cached list of categories
// - Why do we have to get the entire list?
// - Caching should be transparent, I should not know it exists here
// - Why do I need another structure when I just want a list of articles?
$categoryArchive = $articleStorage->getCachedCategoryList();

// Get articles inside the category archive
// - Passing an empty array makes no sense, what is it doing?
// - Is it already filtered?
$catArticles = $categoryArchive->getArticlesByTag(array());

There are a number of issues with the code and a common problem at this point is knowing where to start the refactor. I like to start by understanding the interface and adapting it to conform to the single responsibility principle.

Here is the current interface:

<?php
interface ArticleStorage {
    public function getCompleteList();

    public function getMainList();

    public function getCachedMainList();

    public function getCategoryList();

    public function getCachedCategoryList();

    public function getHomepageList();

    public function getCachedHomepageList();
}

There are accessor methods for a bunch of article lists and corresponding cached accessor methods for the same lists. The first stage in this refactor is to split these two functions into two classes, leaving everything else constant. This gives us two new classes.

The first class is ArticleStorage minus the caching:

<?php
interface ArticleStorage {
    public function getCompleteList();

    public function getMainList();

    public function getCategoryList();

    public function getHomepageList();
}

The second is a caching layer:

<?php
interface CachedArticleStorage {
    public function getMainList();

    public function getCategoryList();

    public function getHomepageList();
}

Now it is important to document what each of these methods do. Then we can take the refactor to the next stage:

<?php
interface ArticleStorage {
    /**
     * Gets the complete list of articles from the data source
     */
    public function getCompleteList();

    /**
     * Takes a complete list of articles and filters 
     * a number of hardcoded categories out
     */
    public function getMainList();

    /**
     * Takes the filtered list of articles and creates
     * a list that can be browsed by category
     */
    public function getCategoryList();

    /**
     * Applies a different filtering onto the complete
     * list to only retrieve content used by the website
     * itself, not the blog system
     */
    public function getHomepageList();
}

From these definitions, it appears that some functions are specialised for unique use-cases. Ideally, these cases can be generalised and specialised using method parameters instead of having a separate method for each case. Fundamentally we are getting a list of articles or a category hierarchy, both of which we can filter and remove a list of tags. If we change the interface to reflect that then we can change the filters using a configuration file instead of having them hardcoded. This will give us another degree of flexibility in the design.

<?php
interface ArticleStorage {
    /**
     * Get all of articles filtered by a list of tag names
     */
    public function getArticles($filterTags = array());

    /**
     * Retrieve a filtered list of articles and put into categories
     */
    public function getCategories($filterTags = array());

    /**
     * Gets the complete list of articles from the data source
     */
    private function getCompleteList();
}

At this point both the storage and caching layer can implement the same interface.

Generating feeds

Continuing the refactoring process, moving the feed logic to OOP entities will allow reuse across products. The following code generates the feed:

<?php
$feedInputType = strtolower($_REQUEST['feed']);

$feedType = array(
    'rss2' => new RSS2FeedWriter(),
    'rss1' => new RSS1FeedWriter(),
    'atom' => new ATOMFeedWriter(),
);

if (!isset($feedType[$feedInputType])) {
    exit;
}

$feed = $feedType[$feedInputType];

$feed->setTitle('Michael Clark on software development and technology');
$feed->setLink('http://mjac.co.uk/features?feed=rss2');
$feed->setDescription('Featured articles');

array_splice($catArticles, 6);

foreach ($catArticles as $article) {
    $newItem = $feed->createNewItem();

    $newItem->setTitle($article->getTitle());
    $newItem->setLink('http://mjac.co.uk' . Config::urlArticle($article->getId()));
    $newItem->setDate($article->getDate()->getTimestamp());

    $ac = new ArticleCache(Config::$pathArticleCache);
    $newContent = $ac->get($article);

    $newItem->setDescription($newContent);

    $feed->addItem($newItem);
}

$feed->generateFeed();

The function of the above code was not immediately obvious. In these circumstances I find it best to add comments to aid the refactoring process:

<?php
// Read and normalize a feed type
$feedInputType = strtolower($_REQUEST['feed']);

// There is no point constructing multiple classes if we are not
// going to use them
$feedType = array(
    'rss2' => new RSS2FeedWriter(),
    'rss1' => new RSS1FeedWriter(),
    'atom' => new ATOMFeedWriter(),
);

// There should be a default feed type
// rss should be an alias for rss2
if (!isset($feedType[$feedInputType])) {
    // Exiting is not useful, there should be an error message
    exit;
}

$feed = $feedType[$feedInputType];

// These should be in a configuration file
$feed->setTitle('Michael Clark on software development and technology');
$feed->setLink('http://mjac.co.uk/features?feed=rss2');
$feed->setDescription('Featured articles');

// Not obvious what the following statement does…
// It is limiting the feed to 6 articles.
// 6 should be a constant from a configuration file.
array_splice($catArticles, 6);

// $catArticles is not a good variable name
// cat meow, category would be better
foreach ($catArticles as $article) {
    $newItem = $feed->createNewItem();

    $newItem->setTitle($article->getTitle());

    // This hardcoded URL should be extracted
    $newItem->setLink('http://mjac.co.uk' . Config::urlArticle($article->getId()));
    $newItem->setDate($article->getDate()->getTimestamp());

    // $ac is a bad variable name
    // ArticleCache does not indicate what it actually does
    $ac = new ArticleCache(Config::$pathArticleCache);
    $newContent = $ac->get($article);

    $newItem->setDescription($newContent);

    $feed->addItem($newItem);
}

// Ideally this would not output straight away so we could have
// control over how the script exits (plus compression)
$feed->generateFeed();

This code resembles a factory pattern at the highest level. We provide a feed type and get a feed writer.

<?php
interface FeedFactory {
    public function getFeed($feedType);
}

The FeedWriter library is an example of the builder pattern and has a well-defined interface. We can adapt this interface to use a custom article format instead of FeedWriter items:

<?php
interface Feed {
    public function setTitle($title);
    public function setLink($link);
    public function setDescription($description);

    public function addArticles($articleList);

    public function generateFeed();
}

Feed generation is now readable and can be reused across multiple websites:

<?php
$feedFactory = new FeedFactory();

$feed = $feedFactory->getFeed($feedInputType);

$feed->setTitle('Michael Clark on software development and technology');
$feed->setLink('http://mjac.co.uk/features?feed=rss2');
$feed->setDescription('Featured articles');

$feed->addArticles($categoryArticles);

$feed->generateFeed();

View improvements

An important aspect of the article system is the view and template code. The templates are written in the PHPSavant template system. This template system uses PHP directly and allows you access to all of PHP’s power.

This freedom resulted in logic creeping into the template system:

<div class="grid_2 categories">
    <h2>Categories</h2>
<?php
$activeCatComp = explode('/', $this->category);
$catComp = array();
$target =& $catComp;
foreach ($activeCatComp as $cat) {
    $target[$cat] = array();
    $target =& $target[$cat];
}

echo ArticleUtility::catsToUl(NULL, $this, $this->categoryList, array(), $catComp);
?>
</div>

The inline function converts a string category into a recursive hierarchical key using reference swapping. This is complex code that is unrelated to presentation. It does not belong here.

After the inline function, there is a function call to generate the HTML category list. This uses the recursive hierarchical key to mark the active category to the user.

<?php
public static function catsToUl($linkage, $tpl, $cats, $seen, $active = array())
{
    if ($cats instanceof \Escribir\TagLibrarySearch) {
        $cats->ksort();
    } else {
        ksort($cats);
    }

    $str = '<ul>';
    foreach ($cats as $catname => $catcontents) {
        if ($catname === '/') {
            continue;
        }

        $newArr = $seen;
        $newArr[] = $catname;

        $newActive = isset($active[$catname]) ? $active[$catname] : false;

        $str .= '<li><a href="' . $tpl->escape(Config::urlArticles(implode('/', $newArr)))
            . '"' . ($newActive !== false ? ' class="active"' : '')
            . '>' . $tpl->escape($catname) . '</a>' .
            (count($catcontents) > 1 ?
            ArticleUtility::catsToUl($linkage, $tpl, $catcontents, $newArr,
            $newActive === false ? array() : $newActive) : '') . '</li>';
    }
    $str .= '</ul>';
    return $str;
}

If you understand that straight off, congratulations. Certainly it does not represent the reusable literate code we would like to maintain. Sometimes in these circumstances it is best to rewrite, and a good way to rewrite is to examine the end result.

<h2>Categories</h2>
<ul>
    <li><a href="/coding">coding</a></li>
    <li><a href="/features" class="active">features</a></li>
    <li>
        <a href="/personal">personal</a>
        <ul>
            <li><a href="/personal/sixth form">sixth form</a></li>
            <li><a href="/personal/university">university</a></li>
        </ul>
    </li>
    <li><a href="/portfolio">portfolio</a></li>
    <li><a href="/projects">projects</a></li>
    <li><a href="/reading">reading</a></li>
    <li><a href="/sport">sport</a></li>
    <li><a href="/technology">technology</a></li>
</ul>

The template only needs to know the attributes and text values here, it does not need access to the entire model.

Category DTO

Each category has multiple children and its own link plus the meta data that will mark it as the active category. This is a simple data transfer object. In PHP we can use an associative array instead of creating a concrete type:

<?php
$category = array(
    'name' => 'personal',
    'link' => '/personal',
    'categories' => array(/* recursive */),
);

Reading and writing this structure will require some kind of recursive function, or at least some tracking of the depth of the tree in combination with a non-recursive function. It is now clear that this constraint caused much of the complexity in the function we are trying to re-factor.

One solution to this complexity is to limit article categories to be two tier. This will allow us to write the structure without using PHP functions, and will add flexibility for distinguishing between categories and subcategories.

<h2>Categories</h2>
<ul>
    <?php foreach ($this->categoryList as $category): ?>
        <li>
            <a href="<?= $this->escape($category['link']) ?>"
                <?= $category['active'] ? ' class="active"' : '' ?>>
                <?= $this->escape($category['name']) ?>
            </a>
            <?php if (!empty($category['categories'])): ?>
                <ul>
                    <?php foreach ($category['categories'] as $subcategory): ?>
                    <li>
                        <a href="<?= $this->escape($subcategory['link']) ?>"
                            <?= $subcategory['active'] ? ' class="active"' : '' ?>>
                            <?= $this->escape($subcategory['name']) ?>
                        </a>
                    </li>
                    <?php endforeach; ?>
                </ul>
            <?php endif; ?>
        </li>
    <?php endforeach; ?>
</ul>

This code is much less complex and more maintainable than the previous recursive function. We can easily edit that HTML structure of both categories and subcategories.

Generating the DTOs

We will have to implement code to create the new data structure.

<?php
class RecursiveCategoryNavigation
{
    private $activeCategory = '';

    public function setActiveCategory($activeCategory)
    {
        $this->activeCategory = $activeCategory;
    }

    public function create(\Escribir\TagLibrarySearch $categories)
    {
        return $this->_create($categories);
    }

    private function _create($categories, $categoryComponents = array())
    {
        $navigation = array();

        foreach ($categories as $category => $categoryContents) {
            if ($category === '/') {
                continue;
            }

            $subcategoryComponents = $categoryComponents;
            $subcategoryComponents[] = $category;

            $subcategoryLinks = implode('/', $subcategoryComponents);

            $sectionLinks = array(
                'name' => $category,
                'link' => Config::urlArticles($subcategoryLinks),
                'categories' => $this->_create($categoryContents, $subcategoryComponents),
                'active' => $this->_isActiveCategory($subcategoryLinks),
            );

            $sectionTitle = ucwords($category);
            $navigation[$sectionTitle] = $sectionLinks;
        }

        usort($navigation, function ($a, $b) {
            return strcmp($a['name'], $b['name']);
        });

        return $navigation;
    }

    private function _isActiveCategory($category)
    {
        return $category === $this->activeCategory;
    }
}
php