Localization is crucial for reaching out to a global audience, however, it’s often an afterthought for most developers and non-trivial to implement. Traditionally, game developers have outsourced this task due to its time consuming nature.

But it doesn’t have to be this way.

Yan Cui will show you a simple technique his team used at GameSys which allowed them to localize an entire story-driven, episodic MMORPG (with over 5000 items and 1500 quests) in under an hour of work and 50 lines of code, with the help of PostSharp.

Watch the webinar and learn:

  • The common practices of localization
  • The challenges and problems with these common practices
  • How to rethink the localization problem as an automatable implementation pattern
  • Pattern automation using PostSharp

Solving localization challenges with design pattern automation on Vimeo.

You can find the slide deck here: http://www.slideshare.net/sharpcrafters/solving-localization-challenges-with-design-pattern-automation

Video Content

  1. Six Sins of Traditional Approach to Localization (6:20)
  2. Automating Patterns with PostSharp (16:08)
  3. Q&A (20:50)

Webinar Transcript

Hi, everyone. Good evening to those of you who are in the UK. My name is Yan Cui, and I often go by the online alias of The Burning Monk because I'm a massive fan of this '90s rock band called Rage Against the Machine. 

I'm actually joined here by Alex from PostSharp as well. 

Alex: Hi, everyone. 

Yan:

And before we start, a quick bit of housekeeping. If you have any questions, feel free to enter them in the questions box in the GoToWebinar control panel you've got. We'll try to answer as many of them as we can at the end of the session, and anything we can't cover, we will try to get back to you guys via email later on. 

We're going to be talking about some of the work that I did while I was working for a gaming company called Gamesys in London. This was up until October last year, and one of the games I worked on was this MMORPG, or massive multi-player online RPG game, called Here Be Monsters. One interesting thing about Here Be Monsters was that it has lots of content, and when it comes to time to localize the whole game, we have some interesting challenges that we want to find a novel way of solving. Just from this simple screen, you can see there's a couple pieces of text. The name of the character, the dialogue that you say, as well as some UI control here that says, “out of bait.” One of these need to be localized for the entire game. And as I mentioned earlier, this game is full of content. In fact, in terms of text, we have more text than the first three Harry Potter books combined. And there are many different screens in the game, one of which is what we call the almanac. Or, you can think of this as an in-game Wikipedia of some sort, where you can find information about different items or monsters in the game. Here is an example of the almanac page for Santa's Gnome, which is only available during Christmas. 

So anyway, there's a couple information about the monster itself. A name, description, some type, et cetera, et cetera. Those all need to be localized, as well as all the UI elements of labels or for bait. The text on the bottom et cetera, et cetera. So even for a very simple screen like this, there's actually a lot of different places where you need to apply localization. 

A few years back, Atlus, who makes very popular RPG games, very niche games as well, they did a post explaining why localization is such a painful process that can sometimes take four to six months, and he touched on many different aspects that's involved in the localization process. And you can see from his list that I hit, by the estimation, programming alone takes between one and one and a half months with the traditional approach of localization. Therefore, each of your platforms, the client, they will ingest a gettext file, which contains a bunch of localization in a very plain text format like this. You've got, basically, a key value of pairs of what the original text is, as well as what the localized text should be. 

Alex:

Yan, so what is this pure file format? Is this the standard for localization, or do you have some tools for that? 

Yan:

Yep. So the gettext file is industry standard for localization. Otherwise, I don't know of any standard tooling for translators. The translator that we were working with, they had internal tools to help their translators, or work more effectively with the gettext file format. And for different languages, there are also libraries available for you to be able to consume those gettext files. We'll look at one for .NET later on. 

Alex: Okay. Yeah, thanks. 

Yan:

Once you know you've consumed those translation files, you'll need to then substitute all the text that you have with the localized versions of those texts. You've got buttons that display some text. This is just do the code, it's not taking from our real code base, but to just give you an idea of where you need to apply localization to labels and buttons and so on, as well as your domain objects. So where you've got the domain object that represents a monster, or the names and descriptions, et cetera, et cetera, will need to be localized.

 

Once you've done that, you do your data binding. Then, assuming you haven't lost anything while you've localized, this screen shows you all information about the centers known. I think this is in Brazilian. Portuguese? I can't read any of it, so I don't know how accurate translations are, but at least you can see the localization has been applied to all the different places. So you pat yourself on the back for a job well done, probably go get a beer with your colleagues, and then you realize, oh wait; what happens if we make some changes or we add new, additional types to our domain? You're gonna have to keep doing this work over and over, each with a platform that we support.

And then to add salt to that injury, look at how much time Atlus reckons we tend to spend on QA. The reason why it takes so much time is because there's a massive scope they need to test. And not only to spot check and make sure the translations is not drastically wrong, but there's also loads of bugs that can creep in and do in client integration work, have people miss out on a particular screen, whether some buttons was left in English instead. And you've gotta do these for every single platform, and because you're doing releases so frequently, or at least I hope you should be, that means you have to put on repeated effort to test localization whenever you make any kind of change on the client, really; which broadens your scope for your regular testing. And it's putting a lot more pressure on your QA teams. 

 

6 Sins of Traditional Approach to Localization 

Here's pretty much a laundry list of the problem that I tend to find with the traditional approach to localization. A lot of up-front effort, which is development, and we have the team doing more work as you introduce more domain times and extend your game. It's also hard to test, and it's prone to repressions. You see, it's normal to feel doom and gloom whenever localization's been mentioned in the company, and it's because it's a pain. Once it's there, it just doesn't go away. Which is why, when it comes to time for us to implement localization in our game, we decided to think outside of the box and see whether or not we can do it better in a way that would be easier and more attainable for our team. 

To give you a bit of background on what sort of hours of pipeline at a time, we built a custom CMS; content management system. We internally call it TNT. It's really just a very thin layer on top of Git where all the information about game design data about the monsters, about different quests, locations; they're all stored as JSON files, and we built in some integration with the Git flow, finding strategies so that we can apply the same Git flow. Finding strategy that we use for developers already, and get our testers to do the same thing. 

Alex:

By the way, why did you choose to build a custom CMS instead of using something else, or benefit to Git and Git 12 in this case?

Yan:

Right. We decided to build a custom CMS because we also wanted to bake into the CMS some basic validations that applies to our particular domain. And the reason we have a pro-lay on top of Git, is because we want to have a source control for all our game data. Well, we do that for all our source code, and the game data is really part of your source code for your game, which can't exist without the validator, the things that makes up the content of the game. 

And Git flow is just a way so that we allow our game designers to work in tandem with each other, and have a well understood process of how to merge things, and how to release things when those things get back to masters. So that when you look at a master branch, you know you're looking at exactly what has to be deployed for production and so on. 

Well, we had a team of game designers who work on different branches of stories. One person may be working on a storyline, their storyline next week, whilst another person may be working on a storyline they're screwing up in a month's time, and you want them to be able to work in parallel without stepping on each other's toes. Git flow comes into the play as part of the mechanism for allowing them to do that. 

Does that answer your question?

Alex:

Yeah. That seems to work well, yes. Good idea. Thanks. 

Yan:

Cool. 

So inside TNT, you have some very simple UI controls that the game designers can do cherry-pick, as well as they do merges of different branches. Once they're happy with the game design for the world they've done, they can then publish to a particular environment so that they can test it out in that environment and see whether the quest is more interesting. Or, put in all the mechanics so that their hopes are in place.

Right, so at this point the custom CMS would package all the JSON files and send it to a publisher survey's, which you would perform deeper validation against the game moves. For example, if you've got an item, there's a high level. Then the pair should be at a particular quest, and it shouldn't be able to give out the item as a reward. And also, we do quite a few pre-computation and auto transform the format from the original JSON into a more suitable format to be consumed by different client platforms. 

Now, all of that then gets pushed to S3 and from the TNT as a publisher, I press a button. Now I can see everything's happened. And as you can see from the logs, we do this an awful lot, which is also one of the reasons why we decided to invest the effort into building two links, such as the CMS, so that we can constantly integrate upon our game design, not just as much as our code. 

Once the publisher has done his job, we will solidify the game specs. We call them into S3, into version folders, and as a game designer, we'll just press a button to publish my work. I'll get an email with a link at the top so that I can click that and load up the web version of the game with just the changes from my branch so that if someone else is using the same test environment to test their changes, I won't be stepping on their toes. 

You may notice that there's a link here to run economic report. This refers to some other piece of work that we've done to use a graph database to help us understand how different aspects of the game connects with each other. So an item could be used in a recipe to make another item, which can then be used to catch a monster, who then drops a loop, which can then be used in another quest, or so on and so forth. So our domain is very highly connected, and to understand all the knock on effects or making even small changes like upping the price of water, you'll have a huge amount of knock on effect that goes through the entire economy of the game. So, we use a database to automate a lot of those validation and auto-balancing. 

You can also see, down the email there, you can see results of the game rule validations, and we report back to the game designer. And then at this point, all the specs are ready. The server, the flash client, as well as the iPad client will be able to consume those data in different formats, and you'll be able to load up a game and test out the changes. 

Alex:

Another question here; so why do you need to produce different formats for all those different platforms? Is that a requirement? 

Yan:

Right. So for example, on the server application, we don't really care about the name of a monster, or the description of a monster. It helps to reduce the size of the file and how long it takes to load it, as well as the memory footprint of several application. So we ship those client only information from the server spec, and we also precalculate a bunch of secret values - coefficients and things like that - into the client and into the server spec, but not make them available in the client spec. 

Of course, the client specs are public, so anyone who's a bit more tech savvy will be able to download the spec, and then work out its format and understand some secret values that we have embedded into our domain. They'll be able to cheat in the game, essentially. 

Alex: Uh-huh, okay. 

Yan:

And also the flash, because it's all web based. They prefer to load the whole file as one big zip, whereas for a iPad, they prefer to have smaller file sizes. But many of them -

Alex: Okay, that makes sense. 

Yan:

So at this point, we thought, “well, if we do localization, what about if we bundle it into our publishing process so that by the time all the files has been generated, they'll only be localized on the client?” We wouldn’t have to do some of the things we saw earlier, where you have to apply localization to your domain objects all the time. You can then publish your localized versions of game specs to language specific folders. Notice that, as you mentioned earlier, the server doesn't care about most of our text being localized, so we're actually gonna need to apply that same sort of path to the server spec. 

So with that, you remove the duplicated effort you need to do on each of the client platforms. At the same time, you reduce the number of the things that can change. They're automatically changed with each release, because they have an automated process of doing this, so there's fewer things that you test. We don't need to touch all these things, but we still have this problem of having to spend a large amount of effort up front. All the things that you were doing on the client before, now has to be done by something else. In this case, the server team, which have to, like I said before, ingest a gettext file to know all the translations and then need to check the domain objects for string field improperties that need to be localized, apply localizations when you transform those domain objects into DTOs. And then again, do the same thing for multiple languages if you're localizing for different targets.

Automating Patterns with PostSharp 

 But notice that the step two and three is actually just a Patient pattern that can actually be automated to help fire proof yourself against future changes or as you add more domain objects, we should get localization for free. 

And in .NET, you can consume a gettext file and gain admission from that using the second language package. And obviously, because we're here, I'm going to be talking about the implementation patterns and how to automate them with Posharp. 

So for those of you who are not familiar with Posharp, you can buy different aspects, which will then apply the post compilation modification to your codes so that you can bake additional logic and behavior into your code. So here, what I've got is a very simple aspect which is applied only to fields or properties of type strength, so that when you code a setter on those properties of fields, this bit of code will run. And as part of that, we check against localization; so, local stress … context object to see whether or not we are in localization. If not, we move on. 

Alex:

So I can see here that actually localization context is doing all the translation work apparently. How do you set it up, or how do you initialize it? Because here, you see just that you call translate. 

Yan:

Yep. So as I mentioned, we use just the gettext translation file. Imagine when the custom CMS, the TNT, calls the service with a big zip file, the publisher will then unpackage that. As part of that package, you will find those PO files. For each of those files, the publisher would load it with a second language, and then create a local context. And within that context, it then transforms the domain objects created from those JSONS, and into DTOs. 

So when the DTO transformation is happening, and you're creating new DTO objects and setting the string values for its fields and properties, this code will kick in. And because it's called inside a localization context, you will contain the information that we have loaded from the gettext file. So that next line, this guy, all he's doing is checking for the gettext file. “Do we have a match for the string that you're trying to localize?” If there is, then we will use that localized string instead. So what we're doing is here, is that we're proceeding with calling the setter as if you've called the setter with the localized string instead of the original string. 

Does that make sense?

Alex: Yeah, that's perfect. 

Yan:

So with this, we can then just multicast all of our DTO objects, which have the convention of having the suffix of VO for legacy reasons. And through this one line of code, plus the 30 we just saw, it pretty much covers over 90% of the localization work we had to do. And as we create new domain objects and new types, those types will be localized automatically without us having to do additional work. 

So with that, we can eliminate the whole up front environment cost, because the whole thing took me less than an hour to implement. And because we're multicasting attributes to all DTO types, it means any time we add a new DTO type in the future, you will be localized automatically by default. 

Again, you have more automations, so there's fewer chance for regression to kick in, because people are not changing things and having to constantly implement new things by hand. Still, you can have things that are regressed, but it's just, in my experience anyway, is far less likely. And since we implemented the localization this way, we actually didn't have any localization related regressions and bugs at all, which is pretty cool for us for not a lot of work. The combined effect of all of these changes is just far less pressure on your QA team to test the changes that you are making to the game; new quest line, new storylines, as well as UI changes and the server changes, and localization as well. So, they can better focus their time and effort on testing things that have actually changed and are likely to cause problems. 

Q&A

“Okay, well,” you may ask, “well, how do I exclude the DTO types from the localization process?” Fortunately, we have a built in mechanism for doing that, where you can just use the attribute property on particular types. In this case, I know the leaderboard player DTO only have the IDs which are called division profile ID and the name of the user, none of which should be localized. And therefore, we can simply exclude this guy from the whole localization process.

Then, you may ask, “well, but then where do you get these gettext files from,” which is a great question. Well, as I mentioned earlier, we actually store those gettext files as part of TNT so that when we call a publisher from there, you include the localization files as well. And to get those files into TNT, there's actually a page in the tool so that the game designer can go in there and make the changes once they're happy with all the contents. At that point, we say, “okay, now let's localize all the new quest lines that we've just created.” 

And there's a button that we click, which will then take the existing localization file, because we don't want to localize the same text if you haven't changed. We actually use comments to basically put a unique identifier for each of your text, so that we can identify that when a particular by-log or name, or description or whatever, has changed, so that we will reset the entry in the gettext file. And then for the new gettext file, we then send you over to the translators, who, with their tools, will be able to pick out the new strings that they need to translate. They only charge us for the new strings that they have to translate and not everything else that we send them. 

Once they send it back, the translated PO file, we upload it into TNT. When we do the next publish, you will have all the organization for the new content. When we release the new content, there is a bit of time where the English version is the head of the Brazilian Portuguese version. So if the player is up to the latest quest, then chances are they will end up playing a single game in English instead of translated the version of the game. 

 

With that, that's everything I've got, and thank you very much for listening. 

 

About the speaker, Yan Cui

Yan Cui

Yan Cui is a Server Architect Developer at Yubl and a regular speaker at code camps and conferences around the world, including NDC, QCon, Code Mesh or Build Stuff. Day-to-day, Yan has worked primarily in a mixture of C# and F#, but has also built some components in Erlang. He's a passionate coder and takes great pride in writing clean, well structured code.
Yan's blog.

 

When addressing website performance issues, developers typically jump to conclusions, focusing on the perceived causes rather than uncovering the real causes through research.

Mitchel Sellers will show you how to approach website performance issues with a level of consistency that ensures they're properly identified and resolved so you'll avoid jumping to conclusions in the future.

Watch the webinar to learn:

  • What aspects of website performance are commonly overlooked

  • What metrics & standards are needed to validate performance

  • What tools & tips are helpful in assisting with diagnostics

  • Where you can find additional resources and learning

Applying a methodical approach to website performance on Vimeo.

You can find the slide deck here: http://www.slideshare.net/sharpcrafters/applying-a-methodical-approach-to-website-performance

Video Content

  1. Why Do We Care About Performance? (3:20)
  2. What Indicates Successful Performance (6:06)
  3. Quick Fixes & Tools (27:27)
  4. Diagnosing Issues (37:10)
  5. Applying Load (40:49)
  6. Adopting Change (50:30)
  7. Q&A (52:14)

 

Webinar Transcript

Mitchel Sellers:

Hello, everybody. My name's Mitchel Sellers and I'm here with Gael from PostSharp.

Gael:Hi.

Mitchel:

And we're here to talk about website performance, looking at taking a little bit of a unique look at performance optimization to get a better understanding of tips and tricks to be able to quickly diagnose and address problems using a methodical process. As I mentioned, my name's Mitchel Sellers. I run a consulting company here in the Des Moines, Iowa area.

My contact information's here. I'll put it up at the very end for anybody that has questions afterwards. During the session today, feel free to raise questions using the question functionality and we will try to be able to address as many questions as possible throughout the webinar today, however if we're not able to get to your question today, we will ensure that we are able to get those questions answered and we'll post them along with the recording link once we are complete with the webinar and the recording is available.

Before we get into all of the specifics around the talk today, I did want to briefly mention that I will be talking about various third party tools and components. The things that I am talking about today are all items that I have found beneficial in all of the time working with our customers. We'll be sure to try to mention as many alternative solutions and not necessarily tie it to any one particular tool other than the ones that have worked well for us in the past. With that said, we're not being compensated by any of the tools or providers that we're going to discuss today.

If you have questions about tooling and things afterwards, please again feel free to reach out to me after the session today and we'll make sure that we get that addressed for you. What are we going to talk about? Well, we're going to talk a little bit about how and why we care about performance. Why is it something that is top of mind and why should it be top of mind in our organizations, but then we'll transition a little bit into what are identifiers of successful performance. I'm trying to get around and away from some of the common things that will limit developers.

We will start after we have an understanding of what we're trying to achieve. We'll start with an understanding of how web pages work. Although fundamental, there's a point to the reason why we start here and then work down. Then what we'll do is we'll start working through the diagnostic process and a methodical way to look at items, address the changes and continue moving forward as necessary to get our performance improvements. We'll finish up with a discussion around load testing and a little bit around adopting change within your particular organization to help make things better, make performance a first class citizen rather than that thing that you have to do whenever things are going horribly wrong.

Why Do We Care About Performance?

One of the biggest things that we get from developers is that sometimes we have a little bit of a disconnect between developers and the marketing people or the business people, anybody that is in charge of making the decision around how important is performance to our organization. One of the things that we always like to cover is why do we care? One of the biggest reasons is we look at trends in technology is that for publicly facing websites, Google and other search engines are starting to take performance into consideration within the search engine placement. We don't necessarily have exact standards or exact considerations for this is good versus this is bad, but we have some educated guesses that we can make.

We want to make sure that we're going to have our site performing as good as possible to help optimize search engine optimization aspects. The other reason that it is important to take a look at performance is user perception. There's a lot of various studies that have been completed over the years that link user perception of quality and or security to the performance of a website. If something takes too long or takes longer than they're expecting, it starts to either A, distract them and get them to think, "Oh, you know what, I'll go find somebody else," but then they start asking questions about the organization and should I really trust this business to perform whatever it is that we're looking for.

One of the last things that is very important with the differences in user trends within the last 24 months or so is device traffic. We have a lot of customers. We work with a lot of individuals where their traffic percentages for mobile is as high as 60 to 80% depending on their audience. What that means is we're no longer targeting people with a good desktop internet connection and large, wide-screen monitors. We are actually working with people that have various network abilities, anything from still being attached to that high speed home internet, but they could be all the way down to a low end cellular edge network that isn't fast.

We have to make sure that we take into consideration what happens with throttle connections, page sizes, all of those things that we used to really take an emphasis on back ten years ago, become front and center with what we're doing today. 

What Indicates Successful Performance 

With this said, one of the things we find a lot of cases is that nobody can put a tangible indication on a site that performs well. That's the hardest thing. Most of the time, when we work with development groups, when we work with marketing people or anyone really to take an existing site and resolve a performance issue, the initial contact is our website performance is horrible.

Although it gives us an idea of where their mind is at, it does not necessarily give us a quantitative answer because if we know it's horrible and we want to make it better, how can we verify that we really made it better? This is where we have to get into the process of deciding what makes sense for our organization. It's not an exact science and it's not necessarily the same for every organization. Almost all of the attendees on the webinar today, I'm going to guess that each of you will have slightly different answers to this indication and it's important to understand what makes sense for your organization.

A few metrics that we can utilize to help us shape this opinion, Google provides tools to analyze page speed and will give you scores around how your site does in relation to performance. If your site responds in more than 250 milliseconds, we would start to receive warnings from Google's page speed tools that mention your website is performing slowly. Improve performance to increase your score. That starts at 250 milliseconds and there's another threshold that appears to be somewhere around the 500 millisecond range.

We can take another look and start looking at user dissatisfaction surveys. Usability studies have been completed by all different organizations and a fairly consistent trend across those studies that we've researched show that user dissatisfaction starts in the two to three second mark for total page load and what we're talking about here from a metric would be a click of a link, a typing in of a URL, and the complete page being rendered and interactive to the visiting user.

We can also look at abandonment studies. There's various companies that track abandonment ecommerce. We can see that abandonment rates start to increase by as much as 25 to 30% after six seconds. This gives us a general idea that we want initial page loads to be really fast to keep Google happy and we want the page load to be fairly fast to keep our users happy. There are exceptions to these rules, things such as login, checkouts, et cetera where the user threshold for exceptionable performance is going to be slightly different, but it's all about finding a starting point and something that we can all utilize to communicate effectively, this is what we're looking for with a performant website.

Gael:

Mitchel, you mentioned that Google will not be happy. What does it mean? What would Google do with my website if the page is too slow?

Mitchel:

Right now there is no concrete answer as to what ramifications are brought on a website if you exceed any of these warning levels within the Google page speed tools. It's a lot like your other Google best practices and things. We know that we need to do them. We know that if we do them, we will have a better option. We'll have better placement, but we don't necessarily know the concrete if we do this, it's a 10% hit, or if we don't meet this threshold, we're no longer going to be in the top five or anything like that. We don't have anything other than knowing that it is a factor that plays into your search rank.

Gael:Okay, thanks.

Mitchel:

Which really gives us a few options. Google looks at individual page assets, so they look at how long it takes your HTML to come back. Another way to look at this that may make sense within an organization is to focus not necessarily on true raw metrics per se, but focus on the user experience. This is something that is much harder to analyze, but it allows you to utilize some different programming practices to be able to optimize your website, to make things work well. Examples of this would be things like Expedia.com, Kayak.com, the travel sites.

Oftentimes, at various events, I've asked people how long they think Expedia.com or Kayak.com takes to execute a search, whether you're searching for hotels, whether you're searching for airfare, anything in between. The general answer is that people think it takes only a few seconds when in all reality, from when you click on search to when the full results are available and all of the work is completed, it's oftentimes 20 to 35 seconds before the whole operations been completed.

However, these sites employ tactics to keep the user engaged, whether it is going from when you click that search to showing a, "Hey we're working hard to find you the best deal," and then taking you to another page, which is the way that Expedia handles it, basically distracts the user during the process that takes some time. Or Kayak will utilize techniques to simply show you the results as they come in. You get an empty results page, we're working on this for you, and then the items start to come in.

These are ways to handle situations where we can't optimize for whatever reason. Case of Expedia or Kayak, we have to ensure that we have appropriate availability of our resources. If we have flights available, we have to make sure that we still have those seats. We can't necessarily cache it. This is going to take some time. It's a great way to be able to still allow the user experience to be acceptable, but we then optimize in other manners.

The last thing from an indication perspective that is fairly absolute across to everybody is we can focus on reducing the number of requests needed to render a website. That is the one metric that in all circumstances, if we reduce the number of things that we need to load, we will improve the performance. It is a metric that regardless of if that is a target for your organization, we really recommend that you track it and the reason for that is as we step into understanding how web pages work, we'll see that the impact here can be fairly exponential. Regardless of if you're a developer, a designer, a front end person, a back end person or even a business user, it's important to understand the high level process of how a web page works before we start actually trying to optimize it.

The reason for this is that we have oftentimes encountered situations were users are trying to optimize something based on an assumption that they have. It has to be the database. It has to be our web code. It may not even be that. It may simply be your server. It may be user supplied content or something else. When we look at a webpage, the process is fairly logical and top down in terms of how the web browsers will render a page. We start out by requesting our HTML document. That is the initial request that the user has initiated. The server, after doing whatever processing that it needs to do, will return an HTML document. The web browser then processes that document and it looks for individual assets that it needs to load.

This will be your CSS, your JavaScript, any additional resource, including images, that needs to be downloaded to be able to properly render. Those items are all cataloged and they start to download. We then may have, in the case of certain development practices have a repeated chain. If we have an HTML document that links to a CSS file, once that CSS file is downloaded, if we find out that CSS file was referencing another CSS file, we can continue that chain. That's when we start to see some of our processing take additional time. The other aspect that's important to understand here is that we are forced with limitations with our web browsers so that we can only request between four and ten items per domain at the same time.

Most of the modern, modern web browsers in the last year or so will get you all the way to that ten item limit, but what that means is that if we have an HTML document that then references 30 JavaScript and CSS files to render a page, we have at least three full sets of round trip processing utilizing those ten concurrent requests, which will push out our page load. If we could take those 30 and make it down into ten resources, we could go out and grab one batch of responses and be ready to render our pages and that's something that will really help with overall performance.

To help illustrate this, I have taken some screenshots of the network view of a local news station's website. This was taken from their website about a month ago or maybe a little bit more. They've since redesigned their website, so it's not quite as extreme as it is in this example, but what we see here is I have two screenshots showing the HTTPS requests to load this web page. In all reality, there are 18 screenshots, if I were to show you the entire timeline view to load their home page. In the end, a total of 317 HTTP requests were necessary to render this web page. Now, obviously this is not something that we want to be doing with our webpages, but it's something that allows us to see what happens as we work to optimize.

We can see in this trend here that we start out by retrieving the main KCCI website, which then redirects us to the WWW. This case, user typed KCCI.com and we do a redirect. We can see here that it took 117 milliseconds to get to the redirect and then 133 milliseconds to actually get the HTML document. We've already consumed 250 milliseconds just retrieving the HTML document. Then we start to see the groups of information being processed. We can see here that we have a first group of resources up through this point being processed. We then see the green bars indicate the resources that are waiting to be able to complete. We'll see that these get pushed out, so the additional resources along the way, keep getting pushed out as the rest of the items are working.

Then the timeline expands and expands. We can see here, even by our second section, we're downloading a bunch of small images, things maybe even as small as 600 byes, two bytes, those types of things and they push out our timeline. We get all the way out here and it keeps going. You can get these types of views for your own websites, regardless of which toolset you want to use by utilizing the web developer tools in your browser of choice. On Windows machines, it's the F12 key, brings up the developer tools and you get a network tab. That network tab will show you this exact same timeline here, so you'll be able to see this in your own applications.

What we want to do is we want to be able to reduce this. If we simply cut out half of the items that we're utilizing, we can then simply move on and not have to do anything specific to optimize by simply reducing the request. Moving forward from this, let's take a look at what a content management system driven website in a fairly default configuration ends up looking like.

In this case, we see that we have a much longer initial page response time here, at 233 milliseconds, but we have a far smaller number of HTTP requests and the blue line here indicates the time at which the webpage was visible to the requesting user. It's at that point in time that sure, there may be additional processing, but that's not typically stopping somebody from being able to see the content that they want to be able to see.

As we look at optimization, this is one of those things that we want to keep a close eye on to make sure that we have improvement. Metrics, and how can we validate this quickly to be able to do a compare and contrast. There's a couple key metrics that will help you identify your HTTP requests to maybe give you an idea of if it's something that you want to do. We can look at the total page load. We compared the KCCI site that we talked about. We compared an out of the box content management system solution and then we took that same content management solution and simply minified our images and minified and bundled together our CSS and JavaScript. What we were able to do is we were able to see improvements in total page size, the total number of HTTP request.

KCCI, we're sitting at an almost five and a half megabyte download. This could cause problems for certain users because if I'm on a mobile network, that's a slightly larger download, so we definitely would want to look at ways to potentially minimize that. Same thing goes here with the out of the box content management system solution where we're at almost two megabytes. The same content, just with properly sized and compressed images in CSS, we were able to drop that down all the way to 817 kilobytes, making it so that we have less information to transfer over the wire.

Same thing goes from an HTTP request perspective. 26 is way better than 59 and 59 is still ridiculously better than 317. We see a fairly decent trend. The one thing that's a little bit misleading is sometimes this page load speed, if you utilize third party tools, that number may not always be exactly what you want it to be because it factors in things such as network latency and some of the tools even add an arbitrary factor to the number. Well, at this point-

Gael:

I'm wondering what is the process, what did you actually do to go from column one to column three. How do I know what I need to do or what is step one, two, three?

Mitchel:

What we use in a lot of these cases is one of two tools or sometimes both to help us identify more quickly some of the things. There's Google Page Speed Insights. Sorry, they renamed it and the Google Page Speed Insights tool is one that allows us to point it at any website and it will give us a score out of 100. That score out of 100 is going to then look at things such as how many HTTP requests do you have? Are your images properly sized? Are your images properly compressed? They talk a little bit about bundling and minification of JavaScript files as well.

What we can do with that tool specifically is it gives us a number we can use to help identify some of those quick hit common problems. It goes as far as if you have images that are not properly scaled or compressed for the web, they actually give you a link that you can click on and you can download properly scaled and sized images. Then you can simply replace the images on your own site with those optimized ones that come from Google.

Gael:Okay, that's cool.

Mitchel:

It's a fantastic tool, but one of the things that we find is that its score is not necessarily indicative, so one of the things that developers are often led to believe is if there's a score out of 100, I have to get 100. With the Google Page Speed Insights tool, it's not exactly that clear cut and the reasoning behind that is just looking at these three examples, it gave a very slow new site a score of 90 out of 100 and the reason for that is that they checked all of those little boxes. They checked the boxes to keep Google happy, but maybe they didn't actually keep their users happy.

We typically cross check this tool with another tool provided by Yahoo called YSlow. This tool takes a slightly different approach to scoring, but it looks at similar metrics and similar things with your website. The KCCI site received an E of 58%. Our CMS that wasn't horrible was 70, but after slight modifications on our side, we were able to get it all the way up to a 92% or an A. We find that a lot of times you want to utilize these two different tools or two other tools of your choice to help aggregate and come to an overall conclusion to what may need to change.

Furthermore, with Google's Page Speed Insights, anybody utilizing Google Analytics will actually never be able to get above a score of 98 because the Google Analytics JavaScript violates one of the Page Speed Insights' rules. I call that out to people just because I don't want to see people get tunnel vision and lunge forward all the way to, "Oh, we have to do this because this is what, we have to get 100."

Quick Fixes & Tools

To help with some of this, one of things that we often will use is a site called GTMetrics.com. It is a free tool. It is actually the tool that was utilized to grab all of the metrics on that table that I just showed and what's nice about it is you can generate a PDF report. You can even do some comparison and contrasting between your results before and after a test, which makes it a lot easier.

What does this give us? At this point, we've looked at how a webpage is structured. We've looked at what it takes to load a webpage and one of the things that we get from this is a lot of quick fixes that we may be able to improve our websites dramatically. Images that aren't compressed. JavaScript files that aren't necessarily combined. These two add to our content quite greatly. Images that aren't compressed or images that aren't properly sized are going to bloat your page load. That's why we may have a five megabyte page instead of a two.

A couple quick examples of this would be in the responsive era, we'd have a known maximum of how big our images are going to be. Uploading anything larger than that size is simply a waste of server space and it's a waste of bandwidth because we don't need to work with it. We worked with a customer that had the initial argument of, "Our website performance is horrible." We went to look at their site and we agreed. It was. The pages were loading incredibly slow and it was consistent across all of the pages. What we found is that for the overall look and feel and design of the website, the design firm never prepared web ready imagery. The homepage was somewhere along the lines of 75 to 80 megabytes in size to download and interior pages were 30 to 40.

Well, the reason why we start here is that would have been a lot harder to catch on the server side because it's not anything wrong with the coding. It's not anything wrong with the database and our web server looked really good performance wise. When we start here, we can catch those kinds of things. This one is really important in terms of image sizes if you allow user uploaded content into your web properties. The other thing that we can look at is static image, hidden images or hidden HTML elements, for those of you ASP.NET developers, older web forms projects that are utilizing use state can often add some overhead. The other big thing that is pointed out by PageSpeed Insights as well as by YSlow is a lack of static file caching.

There's a common misconception with ASP.NET projects that static files will automatically decache. That's not necessarily the case, so you may need to tweak what gets responded as that cache header value, all things that both of these tools will point out and direct you on the fix. The goal here is to take care of the low hanging fruit. Take care of the things that may be stressing an environment in a way we don't know and see what we can do. We've found typically with websites that have never had a focused approach to making these kinds of changes that we've had a 40 to 80% improvement over time as it relates to these changes alone. For those of you who don't have public facing websites, GTMetrics.com and PageSpeed Insights will not work for you because they do have to have access.

Otherwise, they don't have access to get to the website. Yahoo YSlow is available for anybody to use because it's a browser based plugin as well, so you'll be able to utilize that on internal networks. One of the things that we get a lot of times is okay, we've optimized, but we still have a lot of static files and it's adding a lot of activity to our server. Another quick fix would be to implement a content delivery network. A content delivery network's purpose is to reduce the load on your server and to serve content more locally to the user if at all possible.

The situations that we're looking for here would be things where maybe we can take the content, load it to a CDN and when we have a customer visiting our website from California, they can get the content from California. When they are in Virginia, maybe the content delivery network has something closer to them regionally. These types of systems would definitely help. It's a little bit of a moving the cheese kind of game where we've moved potential problems to leverage something else. The difference is the content delivery networks are designed to provide incredibly high throughput for static content.

They're a great way to start improving things. I'll talk a little bit more about a couple implementations specifics in just a second. Another thing that often gets overlooked is we've reduced our JavaScript files. We've reduced our CSS. We've got everything good, but now we have these pesky images where we have 25 little images that are being used as part of our design. Most commonly we see them in things such as Facebook, Twitter, LinkedIn, Instagram. All of them as little one kilobyte images.

Some people opt to use font libraries to be able to handle that type of thing, but one of the things we don't want to lose sight on is that we can utilize an image sprite, the simple combination of multiple images into one larger image and then using CSS to display the appropriate image is definitely an option to help minimize the number of HTTP requests. That does require developer implementation of it, something that can make things work fairly easily.

If we've made these changes, we want to go and roll out to a content delivery network solution, we have two options primarily and we have pull type CDNs, which are things that we don't have to make a bunch of changes to our organization. Service providers such as CloudFlare and Incapsula, I believe that Amazon also has the support to do this. Basically you point your DNS to the CDN provider and then the CDN provider knows where your web server is. Basically they become the middle man to your application.

What's nice about it is you can make the change pretty quickly, simple DNS propagation and they usually give you on/off buttons, so you can turn on the CDN, everything is good. If you notice a problem, you can turn it off until you're able to make changes to resolve things. They're great, however, unusual situations typically will arise at least at some point in your application because you have introduced something, so if it accidentally thinks something is static content when it's not, you may need to add rules to change behaviors. The implementation benefits are massive. We'll go through an example here with some experience I have with Incapsula that makes things easy.

The selling point here though is you can make this change in an afternoon, barring any unusual situations. Another option would be to integrate a more traditional CDN, which basically involves rather than loading content for static content directly to your web server, you utilize a CDN and store that information elsewhere. Rather than linking to an images folder inside of your application, you would link to an images directory on a CDN or something of that nature. It's definitely more granular, easier in some ways to manage, but it requires manual configuration. You have to actually put the content where you want that content to be and that's the only way that it'll work out as expected.

Diagnosing Issues 

We've talked about quick fixes. We've talked about things such as images and tools to validate and quantify things, but that just didn't get us enough. We need to now dig into the server side. To dig into the server side, we need to make sure that prior to us starting anything, we want to make sure that we're monitoring the health of our environments. What we'll want to make sure of is that at all times, whenever possible, we want to have information on our web server and database server, CPU and memory usage. We want to know how many web transactions are actually happening, how long are they taking, what kind of SQL activity is going on?

On top of that, we want to know exactly what our users are doing in our app if at all possible. Regardless of if you're diagnosing a problem today or if you're just simply supporting a web property that you want to make sure that you're keeping an eye on performance overall, you want to make sure that you are collecting these metrics and the reason why this is so important is sometimes you may not be able to jump on the problem a second that it comes up, but by having metrics, we'll be able to go back in a point of time and understand exactly what has happened, what was changed, what was going on yesterday at 2:00 PM in the afternoon when the website was slow?

How you go about doing that, there's a number of different tools out there. I personally am a fan of NewRelic, but we've worked with CopperEgg. We've worked with Application Insights and many other vendors. The key is to collect this information as much as humanly possible to make sure that you really truly have a picture of what's going on with your environment. Most of these tools operate either with minimal or almost no overhead to your server. We typically see a half percent to one percent of server capacity being consumed by these monitoring tools.

Gael:

Mitchel, so how long do you recommend to keep data? How much history should we keep?

Mitchel:

My personal recommendation is to keep as much history as humanly possible, understanding that that's not exactly a practical answer. Really what I like to see is at minimum 30 days, but ideally, six to 12 months is preferred, simple because sometimes you have things that only happen once a year, or twice a year. We encounter a lot of situations such as annual registration for our members happens on October 25th and our website crashes every October 25th, but we're fine 364 other days of the year.

In situations like those where having some historical information to be able to track definitely makes things a little bit easier because we can try to recreate what happened and then validate that, "Oh, well, we recreated this and now we're not using as much CPU. We're not using as much memory as we do our testing."

Gael:Okay, thanks.

Applying Load

With this, we now know that we maybe need to start taking a look at our website under load. One developer clicking around on your dev site just isn't recreating the problems that we're seeing in production, so how do we go about doing this? Well, what we need to do is we need to take all of the information that we were talking about earlier and apply this to the way that we do load testing and what we mean by that is that the load examples need to not be static content examples. In other worlds, we can't go after just our HTML file.

We have to make sure that whatever our load test does recreates real traffic behaviors, including making all of the HTTP requests, waiting between times where it goes from one page to the next, making sure that everything that the user's browser does, we do in our load test. We also need to make sure that we're being realistic. If we know that we get 200 people logging in and a five hour window and that's the most we ever see, we don't want to test our system necessarily to 500 people logged in, in ten minutes because we aren't necessarily going to be able to be successful in that regardless. The other key and this is the hardest part for most organizations is that our hardware environments need to be similar.

If we're testing against a production load, we need to be utilizing production grade hardware. There are certain scenarios where we can scale, but it doesn't always work that way and the general recommendation is that we want to make sure that we really are working in a similar environment. What should happen when we do a load test?

When we do a load test, we should see that everything kind of continues in a linear fashion. If we do a load test and we add more users, the number of requests that we process should go up. The number of, the amount of bandwidth that we use should go up and those really go up in lockstep with each other because more users equals more traffic and more traffic equals more bandwidth. What tools can we utilize for this? Well, there's a large number of tools out there. If you Google load testing tools, you're going to come up with a whole bunch of them.

In my situation, I found that LoadStorm.com is a great tool. The reason why I like it is it does what browsers do. You actually record the script that you want to run using Google Chrome, Internet Explorer, et cetera and the other thing is it gives me some data center aspects, so I can choose with LoadStorm that I want my traffic to come from the central part of the US. I want it to come from Europe or I want it to split between the two. The benefit here is that some of these load testing tools will send you European traffic when you have no European traffic at all, so your numbers re going to look a little bit different because of internet latency.

Really what you need to do is just make sure that whatever tool you're utilizing matches your expectations of what the user should be doing and what your users are doing. When we do load testing, we also want to track some additional metrics. In addition to all of the things that we should be doing, our load testing tool or our server monitoring tool should be giving us some additional things. How many requests per second were we issuing? How many concurrent users were on the site? This is a metric that's very important to understand. A concurrent user may not mean the two users clicking on a button at the same time. A concurrent user is typically referred to as a user that's visited the site whose session is still active, so they could be as much as ten to 20 minutes apart.

We want to make sure we understand that metric so we can get a gauge as to what's going on with our users. We want to look at the average server response time and we want to look at that from the client side. We want to look at the percentage of failed requests and we want to look at how many requests are being queued. The goal here in terms of what number do you have to hit or how do you hit it is going to depend on your environment. Each server configuration, each application is going to be capable of performing to a different level than something else.

Now, with that in mind, one of the things that we do want to make sure of is that our load testing tools, as part of being realistic need to be realistically located. If we have our data center in our office, doing a load test from our office to that data center is not necessarily going to give us a realistic example because we factor out the entire public internet. Keep that in mind as you're configuring things. With this, we got about ten minutes left. We'll get through a quick scenario here. Then we can take some time for a couple quick questions.

One of our success stories was working with a customer that had a large number of, they had six virtual web servers, 18 gigs of RAM a piece, a SQL server cluster with 64 gigs of RAM, 32 cores and load tests resulted basically in failures at six to 700 concurrent users or about 100 users per server and we had a goal. We needed to reach 75,000 concurrent users and we needed to do it in two weeks. We did exactly what we've been talking about here today. We started with optimizing images, CSS, minified things, bundled things. We implemented a CDN and then we have a before and after load test scenario.

Really, the actual numbers don't matter, so I've even taken the scales off of these images. The differences is the behavior that we see. In both of these charts, the top one is a before. The bottom one is an after. The green and purple lines are requests per second and throughput. The light blue line that goes up in a solid line and then levels out, that's the number of concurrent users and the blue line that is completely bouncing around in the top chart is maximum page response time and then the yellow one that's a little bit hard to see is average page response time.

What we saw with the initial test in this situation was that the maximum response time was bouncing around quite a bit. As load was added, we'd see a spike, it stabilized. Then we'd add more and it would spike. Our responses weren't necessarily where we wanted them to be and it just wasn't doing things in quite as linear or smooth of a manner. In the bottom here, we get a much more realistic example of what we'd like to see and that is as our users go up, throughput and bandwidth go up and they continue to exceed. They should never dip below that line of users. As they dip below this line of users in the top example, that's where we start to see that the server's being distressed and not able to handle the load because we have more users, so it should always going further and further away from that line.

We can see at the very bottom, we had average response time was great, error rates were totally acceptable, all the way until this magical failure at 53 minutes. The reason why I bring this up and why we utilize this as an example is to showcase why metrics are so important. In this case, we're doing a load test with 75,000 concurrent users. We are using over 21 gigabytes of bandwidth going in and out of the data center and all of a sudden, things start to fail.

Well, we have monitoring on each web server. We have monitoring on the database server. Everything is saying they're healthy. We only have seven minutes left of this load test to really be able to identify a root cause. We're able to remote desktop into the server and validate that we can browse the website from the server. In the end, we were able to find that we had a network failure of a load balancer. It had been overloaded and overheated and stopped serving traffic.

With the metrics, we were able to confirm that at this load point right before failure, we were at 30% CPU usage on our web nodes which meant we had tons of room to go yet. We were at 20% on the database server, which meant we had tons of room to grow yet. The problem was is that our pipe just wasn't able to deliver the users to us, which allowed us to be able to determine what makes sense as our next step. Now, we got lucky in this case. It was a completely implemented solution and we had to make a bunch of changes. 

Adopting Change 

As we start working in organizations, I encourage people to try to promote a proactive approach to performance in your applications.

Bring it up as a metric for any new project. Set a standard that your new projects will have X page load time or you're going to try to keep your HTTP requests under a certain number. If you can start there and validate it through your development process, it'll be a lot easier to make things work rather than having to go back and make a bunch of changes later.

If you do have to work in that reactive environment, it's important to remember the scientific method that we learned in elementary and middle school, which is change one thing, validate that that change really did something and then revalidate what you're going to do next. If you change three things, you're going to run into situations where it's not as easy to resolve an issue because you don't know what actually fixed it. It was one of those ten things we changed on Friday. Which of those ten things was really the one to make it work?

The whole goal here is if you start with a process, if you start with a purpose, you should be able to get through these situations in a manner that makes it easier to get the complete job taken care of. I'll put my contact information back up for everybody and Gael, do we have a couple questions? We've got about five minutes or so here.

 

Q&A

Q: Will GTMetric or similar tools work within an authenticated area of a website?

A: With regards to authenticated traffic in validating performance. There's a couple options. Google Page Speed Insights is really only for public facing stuff. YSlow as a browser plugin will work on any page, so authenticated, unauthenticated, internal to your network, public facing does not matter. That's probably going to be your best choice of tooling for anything that requires a login to get the client-side activity as well as the developer tools themselves, to be able to get the timeline view.

Q: What is a normal/acceptable request per second (RPS) for a CMS site?

A: The acceptable number really depends on the hardware. From my experience with working with DotNetNuke content management system on an Azure A1 server, we start to see failures at about 102 - 105 requests per second. However, it really varies and will depend on what your application is doing (number of requests, number of static files it has, etc.). While there's not necessarily a gold standard, if you see anything under 100, it's typically a sign of an application configuration or other issue rather than the fact that you do not have a large enough hardware.

Q: When would you start getting tools like MiniProfiler from Stack Overflow or similar tools?

A: Typically what happens is the progression here is at this point in time, what we're trying to do is factor out other environmental issues, so from here, this is where we would typically jump into okay, our HTTP requests look good. Our file sizes look good. Our content looks good. Now we know that we have high CPU usage on the web server and the database server isn't busy. That's where I'm going to then want to dive in with MiniProfiler or any of the other profiling tools and start to look. If I notice that the web server looks good, but my database server is spiking, that's where we're going to want to start looking more at SQL Profiler and other things.

The goal here is to take the mix out of the puzzle of am I making IIS work too hard because it's serving static content and everything else? And focus more on getting that optimized to then dive in and really get our hands dirty with our .NET code or whatever server side code that we're working with.

 

 

 

About the speaker, Mitchel Sellers

Mitchel Sellers

Mitchel Sellers is a Microsoft C# MVP, ASPInsider and CEO of IowaComputerGurus Inc, focused on custom app solutions built upon the Microsoft .NET Technology stack with an emphasis on web technologies. Mitchel is a prolific public speaker, presenting topics at user groups and conferences globally, and the author of "Professional DotNetNuke Module Programming" and co-author of "Visual Studio 2010 & .NET 4.0 Six-in-One".
Mitchel's blog.