mtn
MegaDork
6/1/17 12:49 p.m.
Been too long since I messed with this stuff in school. Don't remember what test I'm supposed to use.
I have a group of branches of a business. I have the full population of their clients. I'm trying to determine the differences between the branches.
I have the following information, and I need to compare it. The number is the average (mean) number of purchases per customer—and that is a complete number for each branch (i.e. every purchase from every customer was taken into account). We have 12 months of data for this. Minimum number of customers is about 118 at a single branch in a single month; maximum is 1500. Average around 600.
Moscow Branch….. 5.05
Paris Branch…….. 4.95
London Branch………..4.87
Prague Branch…….5.31
Tokyo Branch…….4.97
For each branch, I also have the average (mean) number of purchases per customer by month:
Moscow Branch Jan…..5.03
Moscow Branch Feb……5.08
Moscow Branch Mar….4.73
(There are 29 total branches, all these numbers made up, but close enough to the real thing)
I have a few things that I’m trying to figure out. First:
Comparing Moscow to Paris to Tokyo to…. (all the rest)
H0: μ1=μ2=…=μn
H1: two or more means are different from the others
(Confidence interval TBD, but probably .05)
Second, within the branches themselves:
Comparing, for Moscow, January to feb to mar to…. (all the rest)
H0: μ1=μ2=…=μn
H1: two or more means are different from the others
(Confidence interval TBD, but probably .05)
What tests should I be using for this? Assume that the underlying data (# of purchases/customer) is a right skewed distribution--most customers only buy one thing, some buy two, few buy three, fewer buy four...
Sorry. I only know lies and damned lies.
All I can remember is that it's certainly not a normal distribution. Other than that, I no longer have my stats books to recall what kind of a distribution that is- you can't have less than 0 sales, and that's the problem with the distribution.
What are you looking for and who is paying for the study?
RossD
UltimaDork
6/1/17 1:16 p.m.
In our engineering curriculum, our statistics class taught only the lowest levels of concepts and one additional thing. That thing was when you need actual statistical analysis, you hire an actual statistician.
I rang the bell for RX Reven
I have not done stats in a long while, but I am pretty sure you will need sample sizes (n) for all of those measurements for any calculations.
mtn
MegaDork
6/1/17 1:23 p.m.
aircooled wrote:
I rang the bell for RX Reven
I have not done stats in a long while, but I am pretty sure you will need sample sizes (n) for all of those measurements for any calculations.
We can get n for all of them.
I am not a statistician. I know enough to butcher my own data. My lack of knowledge of your problem or statistics in general could make this advice terrible. Be warned.
For normal stuff, you'd start with an ANOVA to determine if any of your means are significantly different from each other. If so, you'd pairwise.t.test the crap out of everything (add false discovery correction as needed) to determine which branches are over- or under-performing.
Nonparametric testing is the term for the not normal stuff you might want to do. To determine if your data is is or isn't normally distributed, you can do a Shapiro-Wilk test. Once you've established your stuff is not normal, you might want to look into the Wilcoxon Rank Sum Test (or Mann-Whitney stuff). The Kruskal-Wallis Rank Sum Test is the "ANOVA" equivalent for your initial "should I bother to look further" test.
I'm here, here!
I'm jumping on a plane to go home from, well, getting paid to do statistics so a baked, good enough to put my name on answer will need to wait until tomorrow morning. Please PM me so we can arrange to shoot the raw data over to me. If you can't do that, I'll walk you through the appropriate steps (test for normality, attempt to transform data, etc.). If my plane goes down, call on Kylini as he's got a good handle on the situation except for his insane reference to paired T-tests...that's for two groups only...doing it on more inflates the risk of committing type one error (concluding a significant difference exists when in fact there is none...additionally, paired-T controls for part-to-part variation which isn't what we're trying to do here. Other than that though, Kyansi
mtn
MegaDork
6/1/17 3:37 p.m.
RX, That would be awesome--I need an email though, I cannot send PM's through this site for whatever reason
MTM: I messaged you his email address. Let me know if you get it.
oldtin
PowerDork
6/1/17 3:53 p.m.
Glad Rxreven is chiming in. First question is are the products sold at each branch the same? Are there variations in customers between branches? Another thought is ditch averages - they tend to obscure more than reveal. Look up ANOVA for comparing data sets
If the goal is to evaluate the performance of each branch, wouldn't you also want to know the median (as opposed to mean) number of purchases per customer, as well as the dollar amounts per transaction?
mtn
MegaDork
6/1/17 6:07 p.m.
Stealthtercel wrote:
If the goal is to evaluate the performance of each branch, wouldn't you also want to know the median (as opposed to mean) number of purchases per customer, as well as the dollar amounts per transaction?
No--this is in the banking world, and we're trying to analyze to see if there is any Wells Fargo nefariousness going on. It's not, and anyone with any sense could see that, but the powers that be want a bunch of statistical analysis to prove that.
RX Reven' wrote: ...except for his insane reference to paired T-tests...
Glad I was mostly on the ball! I'm also glad I included a strong enough warning to not use my advice.
If you're into that kinda stuff, have you listened to this Radiolab story or heard about it?
Radiolab link
It talks about using Benford's Law (which I hadn't heard of before) to catch discrepancies in accounting and how this forensic accountant has been catching people with it. It's pretty interesting.
Kylini wrote:
RX Reven' wrote: ...except for his insane reference to paired T-tests...
Glad I was mostly on the ball! I'm also glad I included a strong enough warning to not use my advice.
Hi Kylini,
I sincerely apologize for selecting the unnecessarily harsh term of “insane”…the combination of having very little available time and the tall IPA I was inhaling at the sky bar kicking in resulted in my not sending the second part of my text that made it clear I was being exaggerative for comedic value.
Thanks for being a good sport.
jstand
HalfDork
6/2/17 12:04 p.m.
Not to hijack, but statistics related:
Any good reference books for calculating sample size requirements?
Not for surveys, but for design verification testing of new products.
Basically if I know the tolerance and experimental variability, how many units do I need to test to achieve a desired confidence and reliability?
Binomial distributions work for some cases, but may result in larger sample sizes than necessary (or desired) for capital equipment, but I'm sure there are other methods for variable data.
Thanks.
In reply to mtn: OK, I see what you're doing. That makes sense.
jstand wrote:
Not to hijack, but statistics related:
Any good reference books for calculating sample size requirements?
Not for surveys, but for design verification testing of new products.
Basically if I know the tolerance and experimental variability, how many units do I need to test to achieve a desired confidence and reliability?
Binomial distributions work for some cases, but may result in larger sample sizes than necessary (or desired) for capital equipment, but I'm sure there are other methods for variable data.
Thanks.
Isn't that just power testing?
I took a statistics for engineers class a few years ago. Great teacher (with Bond villain thick Russian accent) everything was taught with Excel knowing that is what we would actually use. I can pull out the old laptop and send you what I have if you want. The class walks you through each stat, when to use it, what are reasonable assumptions, and then step by step how to use excel to get the answers. Jeff