Which US cities have the most breweries?

	City	Count
1	Denver, CO	90
2	Portland, OR	82
3	Chicago, IL	75
4	Seattle, WA	69
5	San Diego, CA	67
6	Austin, TX	44
7	Albuquerque, NM	40
8	Brooklyn, NY	37
9	Houston, TX	37
10	San Francisco, CA	36
11	Minneapolis, MN	35
12	Cincinnati, OH	34
13	Philadelphia, PA	33
14	Colorado Springs, CO	32
15	Milwaukee, WI	32
16	Pittsburgh, PA	31
17	Nashville, TN	31
18	Asheville, NC	30
19	Charlotte, NC	30
20	Portland, ME	28

I was recently having a discussion with a friend about which city had the most breweries. I couldn’t find anything on the internet that was able to answer that simple question. I decided to use this as an opportunity for a little project.

Gather Data

I decided to use BeerAdvocate as my source. I was almost certain there was an API for BeerAdvocate, but there wasn’t, so I had to scrap their listings page, which was easy enough.

I saved off the HTML files, so I didn’t have to scrape more than I needed to.

for p in range(1510):
    params = {
        'start': 20*p,
        'c_id': 'US'
    }
    page = requests.get('https://www.beeradvocate.com/place/list', params=params)

    with open(f'./beer_advocate/page_{p}.html', 'w') as fid:
        fid.write(str(page.content))

I then used lxml and some xpath magic to pull the relevant data.

get_name = lambda x: x[0].xpath('./td[1]')[0].text_content()
print(get_name(p))

get_address = lambda x: x[1].xpath("./td[1]/text()")[0]
print(get_address(p))

get_zip = lambda x: x[1].xpath("./td[1]/text()")[2].split(', ')[1]
print(tryf(get_zip, p))

get_city = lambda x: " ".join(x[1].xpath("./td[1]/a[1]/text()"))
print(tryf(get_city, p))

get_state = lambda x: " ".join(x[1].xpath("./td[1]/a[2]/text()"))
print(tryf(get_state, p))

get_country = lambda x: " ".join(x[1].xpath("./td[1]/a[3]/text()"))
print(tryf(get_country, p))

get_score = lambda x: x[0].xpath('./td[2]')[0].text_content()
print(tryf(get_score, p))

get_ratings = lambda x: x[0].xpath('./td[3]')[0].text_content()
print(tryf(get_ratings, p))

get_beer_avg = lambda x: x[0].xpath('./td[4]')[0].text_content()
print(tryf(get_beer_avg, p))

get_num_beers = lambda x: x[0].xpath('./td[5]')[0].text_content()
print(tryf(get_num_beers, p))

I then ran through all the files, pushing each brewery into a list and finally created a Pandas DataFrame.

out = []
for page_num in range(1510):
    page = get_page(page_num)
    places = get_places(page)
    for p in places:
        out.append({
            'name': get_name(p),
            'address': tryf(get_address, p),
            'city': tryf(get_city, p),
            'state': tryf(get_state, p),
            'country': tryf(get_country, p),
            'zip': tryf(get_zip, p),
            'score': get_score(p),
            'ratings': get_ratings(p),
            'beer_avg': get_beer_avg(p),
            'num_beers': get_num_beers(p),
        })

df = pd.DataFrame(out)

Getting the top 30 beer cities was as easy as …

df[(df.num_beers != '-')].groupby(['city', 'state']).size().sort_values(ascending=False).head(30)