No results found for your search
How do you measure success of your individual experiment in relation to your North Star & Objective if you run multiple experiments in a week?
I'm not sure whether you're asking about things like statistical significance or not (and pls clarify if you are) but I look at everything from the perspective of learnings.
In other words, irrespective of whether the test was able to prove the hypothesis of that idea to be true or not, what did I learn about how I can provide value to our users?
Learnings will either fall into "don't do that" or "do more of that" buckets.
The "do more of that" tests, by definition, should move your objective metrics in the right direction, which in turn ideally moves your NSM in the right direction.
Now its entirely possible that you could move your objective metrics but not impact your NSM in . If that happens, then you're likely focusing on the wrong objective and you should rethink where to focus.
Does that help?
I want to break this down because you have a lot going on:
1. North Star Metric - The key metric that helps your company have long-term, sustainable growth (paraphrasing here)
2. Objective - A mini-milestone you reach en route to improving your North Star Metric (again, paraphrasing)
3. Individual/Multiple Experiments - Focused experiments that help you optimize your marketing/growth efforts.
If I am understanding your question correctly, you want to know how to measure the success of one experiment compared to other experiments you are running concurrently and if one experiment will give you the better chance of improving your North Star metric (NSM)?
Let's use Just a Baby for example purposes (disclaimer: I know nothing about the business/monetization model). My guess is that the NSM for Just a Baby is the number of successful matches completed. To improve that metric, one of your objectives would be the improvement of your retention rate of each app user (you can't make matches if no one is using your app). Since you have been blogging since February 2017, I am going to make an assumption that you have a years worth of data that you can use as a control. When you create your experiment to improve your retention rate you should devise a hypothesis to test such as the following: By sending the user notifications every seven days we will improve retention rates and increase the likelihood that the user will find a match. Once you have launched the experiment you should specify a length of time you should wait to ensure you have enough data for a statistically significant winner (here's a blog post I just wrote about the importance of stat sig data: https://www.perfectpixelmarketing.com/time-to-get-statistically-significant-data/). Leverage past data (control) against the new data from the experiment to determine if it was successful. If you experiment improves the chance of you acheiving your objective and improving your NSM then you should be able to call it a success.
For multiple experiments running at the same time, you should employ the same hypothesis method as mentioned previously and establish a control as well. If you are running multiple experiments against each other, you should run them as variants (i.e. mini-experiments that have the same optimizing objective but have one small difference). I would advise against running multiple comparative experiments (i.e. variants) that are not related to each other or do not have a common metric/objective.
Bad experiment design: We will increase retention rates by advertising to users who have not install the mobile app, advertising to users who haven't used the app in two weeks, and sending users notifications to their smartphones. Why it is bad: you are running the experiment across two different audiences (non-users and inactive users) and three methods (user acquisition advertising, re-engagement advertising, mobile notifications).
Good experiment design: We will increase retention rates of existing users by sending them notifications to their smartphone if they stop using the app after 3 days, 7 days, and 10 days; users will be split into three segments and will only receive notifications according to the timeframe they are in (i.e. 3 days) but they will all receive the same messaging. It is our hypothesis that users will re-engage with the app more when notified in a shorter period of time. Why it is good: Targeting the same audience split into three segments using the same delivery method but they are receiving notifications at different time increments (see: variation).
Is that the perfect example? Probably not. But it is supposed to drive the importance of creating similar experiments so you can evaluate the results.
What if you are running an experiment with five variants and the top-two performing variants perform better than the control but a statistically significant winner cannot be determined? Great! Two winners! You can eliminate the three other variants and continue the experiment or employ both variants as long as it doesn't hinder the customer experience.
I hope this helps - I ended up writing a lot more than I originally expected.
Thanks Justin for your comprehensive response. Awesome. Btw, our NSM is actually Weekly Active Users and our Objective is to Acquire 'x' users by end of May 2018. We believe that the more people we acquire in the app, the more users will go online weekly.
We've been running the following experiments:
1.) Provide additional login method via mobile number (Our hypothesis is that a huge number of potential users don't want to use their Facebook accounts to login to Just a Baby thus, if we let them sign up with mobile number, it would increase acquisition.
2.) Optimize Keywords in Play Store and AppStore (Our hypothesis is that when we rank high on certain keywords in the App Store, people will download our app and proceed to sign up thus, increasing acquisition.
3.) Provide Fertility Pro feature: users can directly contact donors without having to undergo a match (Our hypothesis is that when users find a way to easily contact donors, they'd download the app and sign up thus, increasing acquisition.
Now, we've been seeing an increase of the sign up by 8% in the last 5 days but how do we figure out which of the experiments above have caused the most significant impact?
Can you shed light?
Correct me if I'm wrong but your funnel/workflow is the following:
Search/Browse > Install > Sign-up > Free/Trial Users > Paid/Pro User
Test 1 - Your hypothesis is fair; someone gave the app zero stars because of the Facebook login method. You should be able to track how someone registers for your service in your database; if that is something you are not tracking you should implement that immediately so you can see how many people arrive at your register page, the registration method, and the overall conversion rate. Once the new registration methods have been implemented you compare the total registration conversion rate from a previous period and how each registration method performs compared to the other.
Test 2 - This is almost a lead quality test; if someone is searching for a specific keyword in each store are they be more likely to sign-up to start using the app. This keyword optimization test affects searchability, impressions, installs, registrations, and possibly retention.
Test 3 - This is a paid product feature that could influence user acquisition and retention rates. However, I do not see it mentioned in the App Store description. If you have your app analytics set up correctly, you should be able to view how many people have viewed the feature and if they have paid for it/used it.
All of that being said, all three experiments (probably) shouldn't be run concurrently. Unless you can track how a specific keyword performs from search to registration (anyone?), you won't really know if your keyword(s) are driving more registrations or if a specific registration method or feature should take the credit.
In my opinion, the best way to break this all down is to create two sales funnel visuals of each step in the registration and onboarding process; the first visual is a previous period used as your control and the second is the same period after your experiment has been launched. Calculate the conversion rate between each step in the funnel to see if there is any significant improvement in each period; Kissmetrics' Stat Sig calculator is great for this: https://www.kissmetrics.com/growth-tools/ab-significance-test/. Every week/month create a new sales funnel visual and compare that data to the most recent period; this will help you gain some insight on how your experiments affect acquisition rates in the long-term. For example, if your install rate continues to increase but your registration conversion rate stays the same that is (possible) evidence that your keywords have a greater effect on your acquisition rates than the registration methods. That doesn't mean that the registration method experiment is a failure, it just gives you a better insight on what drives more acquisitions.
Also, you should leverage cohort analysis to see if acquired users are more likely to be active on a weekly basis after those experiments have been enabled.
I hope that helps. I do think you'll need to wait longer than 5 days to really tell if your experiments are improving registration numbers or if it is just a random spike in performance.
I'll go really basic on you, mainly because I am not familiar with the platforms you mention. Have you tried measuring your experiments with unique utm tags to see the data in Google Analytics Campaign reports?
Use the feedback box below if you have a question, comment or general feedback.
Your feedback has been sent.
Sweet! The link has been copied to your clip boardy board!
Flash isn't supported. Please copy the link manually.