Building a subreddit analyzer with Preswald

Amrutha GujjarAmrutha Gujjar4 min read

Category: Community


The Challenge: Not Knowing When To Post

Sometimes, Reddit users want to maximize their post visibility :

  1. Users need to perform a series of trials to discover both optimal posting times and community-engaging content.
  2. User behavior predictions and emerging trend identification through basic analysis methods require substantial time for research and produce unreliable results.
  3. The process of balancing subreddit guidelines and content value with less duplication continuously produces both performance challenges and wasted promotional possibilities.

The Solution: Preswald

Preswald allows users to create easy and quick interactive data applications by bypassing difficult JavaScript frameworks and expensive BI applications.

  • Lightweight. It enables you to generate dynamic dashboards exclusively through Python. 
  • Automatic Chart Updates as the data changes 
  • No need for complicated front-end coding or vendor lock-ins

img

The Implementation: From Static Reports to Live Dashboards

Step 1: Installing Preswald

We started by installing Preswald:

pip install preswald

Step 2: Build a Preswald Dashboard

We’ll build three Plotly charts and present them with Preswald:

  1. A histogram heatmap visualizing the best time to post given the time and day

  2. A bar chart assessing amount of subreddits posts per day

  3. A bar chart assessing amount of subreddits posts per hour for any given day

Here’s the full code with comments explaining each part:

def get_histogram_object(df: pd.DataFrame) -> go.Figure:
    day_order = ['Sunday', 'Saturday', 'Friday', 'Thursday', 'Wednesday', 'Tuesday', 'Monday']
    day_type = pd.CategoricalDtype(categories=day_order, ordered=True)
    df['Day'] = df['Day'].astype(day_type)

    figure_df = df.groupby(['Day', 'Hour']).size().unstack(fill_value=0)

    fig = go.Figure(
        data=go.Heatmap(
            z=figure_df.values,
            x=figure_df.columns,
            y=figure_df.index,
            colorscale='Viridis'
        )
    )

    fig.update_layout(
        title='All top posts',
        xaxis_title='Hour of Day',
        yaxis_title='Day of Week',
    )
    return fig

def top_post_by_day(df: pd.DataFrame) -> px.bar:
    day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
    day_type = pd.CategoricalDtype(categories=day_order, ordered=True)
    df['Day'] = df['Day'].astype(day_type)
    
    figure_df = df.groupby('Day', as_index=False)['score'].sum()
    fig = px.bar(figure_df, x='Day', y='score', title='Top Posts by Day', category_orders={'Day': day_order})
    fig.update_layout(
        xaxis_title='Day of Week',
        yaxis_title='# Upvotes',
    )
    return fig

def top_post_by_hour(df: pd.DataFrame) -> px.bar:
    figure_df = df.groupby('Hour', as_index=False)['score'].sum()
    fig = px.bar(figure_df, x='Hour', y='score', title='Top Posts by Hour')
    fig.update_layout(
        xaxis_title='Hour of Day',
        yaxis_title='# Upvotes',
    )
    return fig

def best_time_to_post(df: pd.DataFrame) -> pd.DataFrame:
    top_days = df.groupby('Day', as_index=False)['score'].sum().sort_values(by='score', ascending=False).head(3)

    top_days_list = top_days['Day'].tolist()
    best_times = []
    for day in top_days_list:
        filtered_df = df[df['Day'] == day]
        grouped_by_hour = filtered_df.groupby('Hour', as_index=False)['score'].sum().sort_values(by='score', ascending=False).head(1)
        
        top_hour = grouped_by_hour['Hour'].iloc[0]
        hour_df = filtered_df[filtered_df['Hour'] == top_hour]

        grouped_by_minute = hour_df.groupby('Minute', as_index=False)['score'].sum().sort_values(by='score', ascending=False).head(1)
        top_minute = grouped_by_minute['Minute'].iloc[0]

        am_pm = 'AM' if top_hour < 12 else 'PM'
        hour_12 = top_hour % 12
        hour_12 = 12 if hour_12 == 0 else hour_12
        readable_time = f"{day} at {hour_12}:{top_minute:02d} {am_pm}"

        best_times.append(readable_time)

    return best_times

Step 3: Load the CSV File

connect() # load in all sources, which by default is the sample_csv
df = get_df('reddit_csv')

subreddit = selectbox("Subreddit", df['subreddit'].unique(), default='dataengineering')
all_timezones = pytz.all_timezones
local_timezone = selectbox("Timezone", all_timezones, default='US/Pacific')
df = df[df['subreddit'] == subreddit]

df['created_utc'] = pd.to_datetime(df['created_utc'], unit='s')
df['created_utc'] = df['created_utc'].dt.tz_localize('UTC').dt.tz_convert(local_timezone)
df['Date'] = df['created_utc'].dt.date
df['Day'] = df['created_utc'].dt.day_name()
df['Month'] = df['created_utc'].dt.month_name()
df['Year'] = df['created_utc'].dt.year
df['Hour'] = df['created_utc'].dt.hour
df['Minute'] = df['created_utc'].dt.minute
df['score'] = df['score'].astype(int)

Step 4: Run & View Your Interactive Reddit Analyzer

With everything in place, run the script using Preswald:

preswald run my_script.py

This will launch a local server. Open the provided URL in your web browser, and you’ll see your heat map, and both bar charts. From here, you can:

  • Determine the best time to post on a subreddit
  • Refresh the script for near-instant updates
  • Share the app link with associates for real-time analysis

Try Preswald today!

https://github.com/StructuredLabs/preswald

Special thanks to [Varnit Singh](https://www.linkedin.com/in/varnitsingh for his creating this app