Website Build Workflow Overview

The new website build pipeline

Let's call it Fawkes, after Dumbledore's phoenix from the Harry Potter movies. This new pipeline rises from the ashes of the old, and it's Fawkes and it does Fawkesy things, and this is meant in the most satirical way so that the Rowling estate attorneys will leave this fair use well enough alone.

Let's start with how Fawkes helps the customer and work back through a website contributor, and finally the DevOps engineer.

From the customer's point of view

View relevant and useful information
1. As of early 2018, over 80% of customers on the website would immediately switch to the master version. This was when the site was displaying the latest release. This created an extra click every visit to the site, so it was decided on the developer email list to change the site to default to master. If the customers on the website start leaning towards wanting to see the latest stable release, then the site can be quickly updated to do so. Or, if incremental versions of a stable site are created, these can be published as default instead.
2. With Fawkes, the default version of the website can be set in the Jenkins UI, without any programming. A non-technical site administrator could then change the behavior of the site simply by updating website build job in Jenkins.
The website should be easy to use
1. Prior to Fawkes, the customer would have to find the versions dropdown (hidden in a plus sign), switch versions, then navigate the site to find the installation page, tutorials, or APIs for that version.
2. Now with Fawkes, the website has a consistent look and feel no matter the version. The version dropdown is always visible. Installation options for every version are on the installation page. There is no dropdown or difference for this page.

From a website contributor's point of view

Add or edit content on the website, test it, and create a pull request that shows successful testing
1. Prior to Fawkes, this was not possible. You could make changes in a local branch, and then building docs would fail due to either Docker issues or dependency issues, or both. You could try on Ubuntu with some success, but it would be a preview with a single version. You would not have tested the full website, so in some situations your merged code would break the website. Testing fixes would involve a lot of time consuming trial and error.
2. At the time of this writing you are still limited to using Ubuntu, but you can run a script that will setup a server with all the dependencies then generate the full website and deploy a preview for sharing on a pull request. End to end, starting from launching a new EC2 instance to viewing the website, takes about 20 minutes. Once the server is setup, incremental preview builds for a single version can take seconds, and full version builds can take several minutes.
Limit website builds to the versions and APIs you want to preview.
1. Again, this was not possible before. It was all or nothing. You would be in for at least 45 minutes per build, so making a mistake was very costly.
2. Fawkes has a settings file where you can set which APIs to run for which versions.

From a DevOps engineer's point of view

Deployments are parameterized and modular
1. The earlier incarnation was tightly bound to whatever was already merged in to master and the settings were hard-coded there. It was impossible to run specific versions of the site.
2. Now with parameterized website build jobs in Jenkins, DevOps can make on the fly updates to how the site is built and deploy just through the Jenkins UI. The options include selecting versions by branches or tags to build, selecting the versions' names that appear in the website, selecting the default version for the site, and selecting which repository source to use (such as development forks for testing).
Deployments are testable and reproducible
1. Modifications to the website build flow used to involve editing the underlying code and manually copying in build artifacts to patch bugs.
2. A test harness is now available, so that changes to CI scripts and Jenkins parameters can be tested in full.

The history

To explain the current workflow it is useful to cover its history, as the current workflow has a foundation upon which it was built. This can help understand where were, where we are, and where we can go in the near term. Or, if you're impatient and just want to get flying with Fawkes, jump to How to Build the Website.

The website was designed to have multiple site versions. There would be a website for the oldest version still maintained, v0.11.0, and one for v0.12.0, and one for v0.12.1, and so forth, as well as a site for the master branch. Each site would be a complete representation of the website at the time of that version's release and would be generated from the release tag. Then, the sites would be combined into one overall site with the latest tag being the default view. You could switch between the sites by clicking the plus button on the top right side near the search box, and a versions drop down selector would appear. The API docs would appear per version, and views would be rewritten dynamically first by Python code in the website build and then by client-side code during site run-time.

The website build tools were split into two parts. The first, and easiest was to ignore all of the versions logic and just build what you have locally. It used a shell script that called Docker container which mounted a shared directory and generated the website and all of the documentation in this folder. The second more complicated route was similar, but it also generated the full site with all of the site versions. This output would be copied into a the mxnet-site repo and then published.

This approach was modular and self-contained. It was platform independent. It was a true representation of each software release.

What was broken?

This pipeline had several downsides. The first was the inflexibility in choosing a default website view. It assumed you wanted to always use the latest tag which may not be the case. It also only used tags and not branches, so if there were any patches to a branch post-tag, you would not see this information in the website, tutorials, or docs. It also created a time-travel like feel, so that when you went to the older versions, the entire website would change. The design would be different and the navigation options would change. Finally, most critically, it wasn't documented. So when various parts began to break, it was difficult for people that were not involved in its creation to help. There was also the issue of maintenance of the front-end. Someone unfamiliar with the back-end process would try to help, but would find themselves blocked by the complicated nature of website: its various version incarnations, how it would overwrite code that generated the docs with undocumented build functions, and how it would inject client code to override expected functionalities of common frameworks.

In hindsight, after you go through a round of troubleshooting and fixing, you often find the root cause after you have fixed several other symptoms or unrelated issues. As with your mechanic, troubleshooting is why your bill is often higher than expected. The diagnostic says "Check Engine" and while there are a couple of other codes as hints for the mechanic, you're usually going to get hit with a couple charges for problems you didn't even know you had. And in some cases, a sensor or part gets swapped out because it could be the problem and the mechanic won't know until after he fixes it. Only then the mechanic might realize, parts A and B were probably ok, and C was the culprit. But you will surely be charged for the time, if not all of the parts. In much a similar way was troubleshooting the website build pipeline.

The project build became larger over time and Docker's default memory allocation eventually was not enough. The error handling for this event is abysmal. You end up crashing a various points in the build, and each time you investigate that part of the build to find out what happened. Only after many, many iterations you realize that you should search for "why is my process in docker crashing randomly" do you get the answer. By the way, the current answer is to increase your limit to 4 GB on macOS.
The dependencies for the website and docs build were not pinned to specific versions. As dependencies get updated they can exhibit incompatibilities. In one case, RecommonMark became incompatible with the latest Sphinx. After research and trial and error a stable state could be found. However, the website build's warnings and errors had been so long ignored, it's impossible to know the exact combination of dependencies that worked perfectly, if that even existed.
Bugs would be found in one of the versions of the website. However, these were generated by a tag, so the code was locked for that version. It became practice to manually patch the website after each build.
You couldn't update the home page until a new release was tagged.
Launching new versions required website updates to be included in the tag, increasing the latency and number of steps in a release.
Building the full site with all of its versions was all or nothing. To test the build you had to trigger a production job, using production code. You could not test your own fork/branch with the versioned site pipeline. You only had access to build your current site locally which wouldn't uncover issues introduced after the build pipeline changed the code outputs.
There's more, but let's move on to what we have today.

Website build goals (that haven't changed)

The intention of the website build is still the same. There are several versions of documentation to display, there is news on the home page, and there are instructions and tutorials. The manner in which these are created has been given more flexibility, but there is a long way to go to achieve a state where it satisfies its customers by being useful and easy to use, is easy to maintain by people of different disciplines, and it exhibits few or no errors during build and deployment. In fact, these three delivery aspects will probably never go away. As the audience mix and tastes mature, the underlying website will technologies change, as do the people that service them.

Other goals include: the ability to run the website build as part of a CI pipeline, allow developers to test locally, easily edit what is shown on the home page or other parts of the site, easily add more docs APIs, add internationalization, allow for analytics, marketing, and social media tools, reporting capability, and have near-zero downtime.

Next up

How to build the website (with Fawkes)

Or here are the rest of your options:

Unable to render {children}. Page not found: Website Build Workflow Overview.

Page tree

Other related content: