Developing the Content Workflow System for Programiz
As of July 2020, I have been working as a Python Developer and Senior Content Editor at Programiz for the past 10 months. This blog was originally posted on the Programiz Blog on May 19, 2020, and can be accessed here.
At Programiz, one of our daily tasks is to create beginner-friendly programming tutorials that eventually reach out to millions of users all over the world. Behind the scenes, we are also constantly experimenting with different tools and techniques to furnish our products and enhance the user experience.
One such tool I recently developed for Programiz—called the Content Workflow System—allows the content writers to effectively write, edit, manage, review, and publish content—that includes programming tutorials, quizzes, and challenges for the Programiz website as well as the mobile app—directly using Google Docs.
What could possibly have gone wrong that made us abandon conventional writing tools to develop an entire Content Workflow system? As a matter of fact, this was not our first time trying to do so.
Problems with Conventional Writing Tools
If you are developing a blogging site or any site with similar static web pages, you know that it is a lot of hassle to manually type in HTML for each article post. In today's context, most content writers probably end up using some sort of visual (WYSIWYG) text editor or markdown language to speed up the writing process.
When Programiz first started as a small company, we also used one such rich text editor called CKeditor to write our programming tutorials. This editor would then translate our writings to HTML, and we would publish them directly on our Programiz website.
Our articles usually consist of plain and stylized texts, images, lists, tables, and preformatted code blocks. All of these items are non-fancy HTML elements and are supported by the CKeditor.
Despite this fact, CKeditor had many flaws, which we soon became aware of. These smart WYSIWYG editors often had a tendency to act overly intelligent, which caused problems such as incorrectly formatting special symbols within preformatted code blocks, or causing HTML to become bloated with empty paragraph tags. Additionally, we had to manually input items that required Programiz-specific HTML structure, which was quite inconvenient. Due to these difficulties, we felt compelled to search for better alternatives.
We wanted a way to automate the Programiz-specific HTML generation process without having the content writers explicitly follow this semantic every time they wrote an article. Considering that we required a broader level of customization, the next stop at our venture was to start from the ground up and build our very own text editor.
We used Slate.js (a customizable framework for building rich text editors) to create our custom text editor. We shaped this tool to fit our needs, and it worked perfectly for some time.
As we began to scale up our team and programming tutorials over the years, however, we realized that this might not have been the best solution moving forward. One of the main problems with this tool was that it did not provide us with a layer for moderation and article review. So, most of the articles that our content writers previously wrote were only proofread by themselves before publishing. This gave room for mistakes and compromised the quality of our tutorials.
If you were to review some of our old programming articles, you would find numerous grammatical and linguistic errors even though they are technically correct. This went against our foundational belief in delivering the highest quality content and prioritizing quality over quantity. As a result, we decided to uphold these principles and invested significant time in developing a robust Content Workflow System—time we could have used to write bulks of other tutorials. We were back to square one; we now faced the dilemma of choosing a medium not only to write but also to review and manage our tutorial articles. This was only possible with a medium that supported seamless, real-time collaboration and sharing.
The only option that came to mind at that time was Google Docs—a burgeoning web-based word processor by Google. I began experimenting with Google Docs API to convert simple Google Documents into HTML, and it produced decent results. Besides, Google Docs also provided us with a reliable state-of-the-art interface for real-time collaboration and article review (commenting). Subsequently, we decided to develop this idea into a full-fledged content-writing tool.
Most of the tutorial articles that you see today on the Programiz website are derived from Google Docs!
Update: Today, the Content Workflow System is not only used for Web articles, but also for the Mobile App & Programiz Pro content including Articles, Quizzes, and Challenges—all of which are internally written in Google Docs.
Google Docs does not provide us with these customizable features right out of the box. So how was I able to exploit the features of Google Docs in our favor?
Our Solution: Content Workflow System
docsToHTML: Generating HTML Content from Google Documents
A Google Document comes with all the typical elements that are available in HTML, such as headers, paragraphs, lists, tables, and basic stylizing options (bold, italics, hyperlinks, subscript, superscript).
However, as previously mentioned, there was no implicit way to convert Google Documents into HTML. There were also no third-party libraries to achieve this task, so I had to build one using Python. docsToHTML is a Python module that converts Google Documents into HTML using Google Docs API.
In addition to basic HTML elements, our tutorial articles use other HTML tags for preformatted code blocks, other inline stylings, and note-tips.
These custom elements are not natively available in Google Docs. So I also had to find a way to incorporate our Programiz HTML semantics into the Google Docs interface. This involved finding methods to represent these elements as Google Docs components, which could then be detected and parsed by a Python script.
We decided to use a combination of different styling options—like fonts and foreground/background colors—to distinguish these elements. The following image shows one such option for the preformatted code that I mentioned earlier.
While this approach effectively resolved our challenges, it would have been overwhelming for our content writers to remember all the distinct styling options for each element.
After some research, I found out that Google Apps Script (a Google scripting language for G suite) could be embedded with Google Documents. These App Scripts could then be used to modify the Google Document and even alter its User Interface.
We utilized Google Apps Script to modify the User Interface of Google Docs, which was particularly useful for generating custom buttons in the Menu Bar to execute the styling tasks mentioned above.
Now the content writers could simply highlight the required text and perform styling options like changing the font and color, or inserting predefined tables into the Google Document using these custom buttons.
The feature to insert predefined table templates in the Google Document allows the content writers to add metadata information about articles or images. These special tables are parsed differently by our Python module. The Page Info Table, for instance, is used at the beginning of a Google Document for Web Page and Article Information:
After we complete writing a properly formatted Google Document, we can use our Python module to send a request to the Google Docs API using the Document's unique DOC_ID. Google Docs API sends back a JSON response corresponding to the contents of the Google Document.
You can visit Google Docs API Response JSON to learn more about Google Docs API and its JSON response.
Our Python module in the backend then converts this somewhat disordered JSON into a structured HTML.
Let's look at how a preformatted code block is parsed by our Python module to generate the HTML content:
The Image Info Table mentioned earlier is parsed in the following way:
After completing these basic features, I further polished the docsToHTML module to perform a sanity check on the HTML produced from the Google Document. As of yet, the module does the following:
- Detects the programming language described in the article. (Some HTML semantics are language-specific)
- Checks if any image names clash with other pre-existing images.
- Replaces Smart Quotes and different Unicode whitespaces with normal ones. (can cause errors if they occur in code blocks)
- Checks if all Comments and Suggestions in the Google Document have been resolved. (warning if moderator reviews aren't addressed)
The Table of Contents bar that you see on the Programiz website is also automatically generated by docsToHTML
:
HTMLToDocs: Converting HTML Contents To Google Document
After developing docsToHTML, we started writing our new tutorials in Google Docs and using it to convert Google Docs into HTML. Despite this, there were still a lot of old articles of which we only retained the HTML copies.
Since we strive for perfection, we constantly update our old articles by fixing technical and grammatical errors wherever necessary. We also rewrite obsolete sections of old articles with up-to-date developments in that certain topic.
The editing of these existing articles had to be done manually by changing their HTML as we did not have their corresponding Google Doc (docsToHTML could not be used). Rewriting the entire article in Google Docs was also not a very feasible solution.
To solve this problem, I developed HTMLToDocs. HTMLToDocs
is a similar Python module that now converts raw HTML Content into a Google Document. It performs the exact reverse operation of the docsToHTML
module.
HTMLToDocs
takes in a Programiz Article Web Page URL and converts the tutorial content into a Google Document (what we would have originally written had we used Google Docs). It is programmed to parse both our old and new HTML semantics.
This Python module also uses Google Drive API to cluster HTMLToDocs generated Google Docs into a proper hierarchical structure in the user's Google Drive.
Now that we had completed the full cycle of converting Google Docs to HTML and HTML back to Google Docs, we were ready to take this project one step further.
Content Workflow System: Bringing Everything Together
Initially, every content writer had to set up a Python environment and install various dependencies on their local machine to run docsToHTML
and HTMLToDocs
. They also had to keep track of the Google Document ID to run these scripts.
The Content Workflow System was developed to create an interface that connects both docsToHTML
HTMLToDocs
while running everything in the cloud. It would also keep records of every entry made by the user. We could then use this tool to write, edit, review, and publish content on the Programiz website even more easily.
For this, I first converted the docsToHTML
and HTMLToDocs
modules into an API and hosted them in the cloud. Next, we built a User Interface to send requests and retrieve responses from this API.
Users can login to Content Workflow System (CWS) with their Programiz credentials. They have the option of using either docsToHTML
or HTMLToDocs
. For either case, the endpoint is docsToHTML
to generate the HTML Content.
Update: The users now also have the option to write content, quizzes, and challenges for the Mobile App and Programiz Pro. For this, there is a
docsToJSON
module that can convert Google Doc to a specific JSON format.
When the user submits an article, it goes to the review section. Reviewers can review the article and approve it or send it back to the user for further editing (suggestions and comments are handled in the Google Document itself) and the user is notified accordingly.
The reviewed article, along with images, can finally be uploaded to the website with our tool. The article will go for revision and can be published by the admin or the moderator.
The Content Workflow System has become exponentially better since its first release. However, it is far from perfect, and there are always some occasional errors that go unhandled.
Challenges
Using Google Docs came with its own shortcomings. One of the major problems I faced while developing HTMLToDocs
was that inserting and editing all the styling elements in Google Documents with Python (in fact with any supported programming languages) was not very intuitive compared to how easy it was using Google's native Apps Script. The Google Docs documentation was very ambiguous.
For instance, we can add a table in Google Docs using Apps Script with a structure similar to:
var cells = [
['CELL11', 'CELL12'],
['CELL21', 'CELL22']
];
var table = element.insertTable(index, cells);
If instead, we were to use other programming languages, we would have to send the following JSON request to the Google Docs API.
[
{ insertTable: { rows: 2, columns: 2, location: { index: 2 } } },
{ insertText: { location: { index: 13 }, text: "CELL22" } },
{ insertText: { location: { index: 11 }, text: "CELL21" } },
{ insertText: { location: { index: 8 }, text: "CELL12" } },
{ insertText: { location: { index: 6 }, text: "CELL11" } },
];
Note: For efficiency, the Document is written backwards so that the text's length in each cell doesn't affect the indices of the subsequent elements.
Here, the index of any cell is given by the formula:
4 + TABLE_INDEX + (1 + NO_OF_COLUMNS * 2) * CURRENT_ROW + 2 * CURRENT_COLUMN
This is not mentioned anywhere in the Google Docs API documentation. Moreover, some basic elements like a Horizontal Rule could not even be added—atleast at the time of writing this article.
Nonetheless, we were able to find work arounds for most of the problems we faced. The Content Workflow System has been able to meet most of our requirements and it has definitely made the content writing process more flawless than ever.
Final Words
Creating your own Content Writing Tool can seem like a daunting task at the beginning. You are sure to encounter loads of unintended errors along the way. It will take time to get used to the quirks of how different services like Google Docs handle information and how we can use them to our advantage.
However, I believe it is well worth the effort if you really want to add custom features to your Content Writing Project (or any other project for that matter) and save yourself some nasty inconvenience in the future. Moreover, you come out learning more about what actually happens under the hood of various services & frameworks and why things work the way they do.
P.S. HTML Content for this blog post was also generated via docsToHTML.