Build Design Systems With Penpot Components
Penpot's new component system for building scalable design systems, emphasizing designer-developer collaboration.

medium bookmark / Raindrop.io |
Best believe
It’s your digital shopfront. You need everything nicely on display in order to generate as many sales as possible, just like a regular store.
A huge advantage for e-commerce businesses over traditional shopfronts is that we don’t need foot traffic analysis or video monitoring to see the impact of layout changes in your shopfront. We can use online usability testing to analyse design and layout variations in a fraction of the time.
This case study demonstrates and tests alternative product grid layouts, with an aim to answer one simple question — what does the perfect product grid look like?
I’m not a maths guy. I approached this case study from the viewpoint of a keen designer armed with a small budget, a short timeline and a basic question.
There’s no comprehensive statistical analysis here. The intention was to be as pragmatic as possible while still having confidence in the results.
In late 2015, I worked at a grocery delivery startup, YourGrocer, based near my office. I’d been in corporate e-commerce for a while, so a change of gear and more ownership of an end-to-end customer experience appealed to me.
The crazy pace of seed-stage businesses means we didn’t have time to test every idea in a meaningful way. There were stronger priorities, ideas that had more customer impact potential or were a better use of developer and designer time.
When we revised the product layout, we went through a quick cycle of the user centred design process, tested our updates with customers in moderated user testing and contextual inquiry, and then released them.
After implementation, we recorded conversion, 2 month retention, time spent per order, and a host of other metrics.
However, the results of these tests often didn’t show huge changes, and we didn’t have enough customer volume to run a series of meaningful A/B tests in a short amount of time.
Often, we concluded that an update hadn’t broken anything, had good qualitative feedback from users, and looked better to us – so we moved on to the next problem.
Without expensive paid traffic through Google or Facebook, it would have taken months to get to statistically significant answers from real site data. That’s far too long for a seed-stage business. We needed answers in days or weeks, not months.
On reflection, we should have tried online testing on a heap of small variations before shipping updates. I only really discovered how much online testing platforms had matured towards the end of my time at YourGrocer. Now we can get some data back in minutes, instead of months.
This is fortunate for you, reader, because I can now present a case study of how changes to your product grid affect the scanability of your design! ^_^
Improving the product layout was driven by a series of hypotheses:
… all of which would contribute to customers returning more often.
In this case study, we’re only looking to test one piece of the puzzle — the bigger everything == shorter scanning time part.
For this research, I used a multi-variate click test on UsabilityHub. This allows us to test multiple layout concepts with exactly the same task setup every time.
Here’s what the starting point looked like:
A pretty bog-standard product grid. Art.
I tested five variations, along with the original. Each variation was seen by 25 unique participants. This gave a total participation count of 150.
Test participants only saw one of the variations. The intention was to record the first, fresh impression of each participant, with past usage or familiarity with the product having as little influence on the problem as possible.
I asked the participants to imagine they were doing grocery shopping online. The task was to look at the ‘Fresh vegetables’ category page, and add one bunch of bok choy to their shopping cart.
This specific language was used because we wanted to test the scanability of the grid within the context of a common user journey, in this case the purchasing flow.
The data you get from a test like this is limited, but it’s enough for our purposes. The mean completion time is returned, as well as a heatmap showing where the participants clicked. The raw data can also be exported to CSV format for further analysis.
Comparing the mean completion time for each test gives us a basic indication about which variation performs the best. If you can find an item quickly, that’s success. If you struggle to scan the grid and it takes a long time to find an item, that’s failure.
Here’s a gif of 3 different grids, and the average completion time for each. There’s a transcript underneath.
3,4 and 5 column layout comparison
Test 1: existing version with 3 columns
Test 2: 4 columns
Test 3: 5 columns with slightly smaller product cards
This version had the product cards slightly scaled down to fit within the same resolutions as the previous tests.
From this first dataset, it seems that there is a point of diminishing returns for how many products you cram into the grid.
The average scanning time increased 34% per row in the jump from 4 to 5 columns.
Is it the decrease in size of the product cards that causes this effect? Or the fact that there are more products in each row, overwhelming the test participants?
To get some more data on this, I repeated a previous test, but with the cards scaled down to the same size as the 5 column test.
I can’t get enough of these gifs.
Test 4: 4 columns with smaller product cards
This result is close to the previously tested 4 column version — it’s the mean completion time is 1.2 seconds faster, but the average row scanning times are within 0.2 seconds of each other— so for me this data demonstrates that the extra column is slowing participants down, not the smaller product card size. But it’s not super convincing.
Ok. So what if we play with the scale of the layout?
One of the things to keep a sharp eye out for when conducting moderated user sessions is whether or not a participant is struggling to make out details on a page due to their size.
You might notice a participant lean in to see a detail, or they might verbalise it in an off-the-cuff way.
If someone pulls out reading glasses when you first start a test session, it’s not an issue. When they pull them out and you’re 20 minutes in? Red flag.
For marketing pages, campaign sites and other consumer-focused content, it’s critical that your design passes the squint test: can you squint your eyes and still make out the main CTAs?
For the next test, I took the original 3 column design and scaled up as many elements on the page as possible, to measure what kind of impact this change can have on scanning time.
Yes, it does.
Test 5: 3 columns, scaled up
This change took nearly a second off the average scanning time per row. The scale of this improvement is a surprising result.
I wonder how many other seemingly small changes would have dramatic results?
Having good images is critical to your design — but just how important is the size of them?
To test this, I hacked together a variation of the scaled up 3 column version with comically large images, and threw it into design-concept-Thunderdome:
The difference in image scale for this test.
Test 6: Comedic images
This is 0.07 seconds per row faster than the previous winner, which I’d argue is within the error margins for these tests.
From this data it seems that they make a slight difference, but not by much. I’d argue that it’s not worth corrupting your visual style with images that run right up to content container padding zones for a 0.07 seconds per row gain in scanability.
Putting together the mean completion time, where the target item was situated in each variation, and the number of products per row allows us to see a representation of how many products a customer could scan through over a 30 second period.
This is a fuzzy metric, because it’d be rare for a user to do nothing but scan products when they were shopping. However, it gives us more context to talk about the differences in designs and how they might affect the key user journeys for this product.
Extrapolating this further out to a longer session, a customer would be able to scan through ~50–70 more items every five minutes using one of the optimised 3 or 4 column layouts compared to the baseline layout.
I think this is a significant enough result to justify a closer look at more variations and optimisations we can find in the layout of product grids via user testing.
My usual process for writing finishes with a ‘sanity check’ of the final draft by a small social email list which contains a group of STEM professionals from different backgrounds.
They introduced me to a different way to analyse this data using histograms. I’m not a maths guy, and it’s clear to me now that I need to take in some bio-stats classes.
If you’re not aware of histograms, they’re a great way of showing the spread of time-sensitive data in an experiment. Here’s what our test data looks like in this format:
The standout insight from this histogram for me is that almost all of the layout variations performed well compared to the original, with the exception of the 5 column layout.
The big black bulge in the 60–90 second bucket shows the bad performance of the original. The 3 column and 4 column alternative layouts all have tighter groupings of results.
The 5 column design (in light blue) has a wider spread of responses, including 3 responses between 60–90 seconds, and the longest response of all tests (not shown) at 147 seconds.
I believe these results aren’t ‘outliers’ in the data because ultra-long completion times for isolated participants are 100% relevant. They represent the fact that someone sat in front of the screen and struggled to complete a the task. It’s another red flag.
The beauty of the histogram is that it shows you this data without having to take in the numbers themselves. It’s obvious there are designs that seem to work better than the original — but I’m also fascinated about what would happen to this diagram if we increased the participant numbers.
Would gaps between the variations develop? Or would the curves become more similar? I feel another article coming on.
If you only compare the mean completion time, then the last version, with comedy sized images is the winner.
Taking the histograms into account, I’d say there’s no clear winner — but there’s definitely a couple of losers that I would count out of the running in further tests: the 5 column variation and the original.
If you put a gun to my head, I’d take the version from test 5– the 3 column layout, with many page elements scaled up.
It definitely outperforms the original 3 column version, and the wider spread of data from the 5 column tests is not a good sign when you’re optimising for scanability.
The difference in performance between this version and the comedy-sized images version is small, and it’s not nearly as jarring to look at, especially when you load in some less-than-perfect product images.
To that end, here are the complete test results in a Google Sheet. If you’ve got an eye for this sort of thing, I’d love to hear about different ways to look at this data.
If you want to try the experiment out for yourself on your own product grid, hack a couple of concepts together in your layout tool of choice, and get testing.
I don’t often see this kind of research done in the open with real data. If you’ve got different testing tactics, then please let the world know in the responses.
I’d love to hear about similar tests you can talk about in a public forum. Even if you can’t share the exact results or dataset, I’d love to hear about the methodology and whether you thought it was effective (or not).
And lastly, of course — please subscribe to User Testing Monthly here on Medium! ^_^
Next month, I’ll show you the results of testing these layouts with eye-tracking software, as we try to move towards an optimal state for this product grid.
See you then!
AI-driven updates, curated by humans and hand-edited for the Prototypr community