BlockByte
Posts
HubSpot to BigQuery: Automating Data Pipelines with Cloud Functions

HubSpot to BigQuery: Automating Data Pipelines with Cloud Functions

Seamless ETL Process from HubSpot to BigQuery Using Google Cloud Services and Python

October 20, 2024

Introduction:

Integrating HubSpot with BigQuery enables businesses to centralize customer and marketing data for deeper analysis and real-time insights. While there are no-code tools available, building a custom ETL pipeline using Google Cloud’s Cloud Functions and Python gives you more control and flexibility. This approach ensures that your HubSpot data is automatically extracted, transformed, and loaded (ETL) into BigQuery, empowering your team to make data-driven decisions without worrying about manual updates or complex workflows.

In this tutorial, we’ll walk through how to set up an automated ETL pipeline from HubSpot to BigQuery using Google Cloud Scheduler, Cloud Functions, and the HubSpot API. Python will be used to write custom logic for data extraction and transformation. By the end, you’ll have a fully functional pipeline that schedules regular data loads, handles API calls, and seamlessly pushes your HubSpot data into BigQuery.

We’ll cover:

Setting up Google Cloud Scheduler to trigger the pipeline
Writing a Cloud Function in Python to extract data from HubSpot via API
Transforming and loading data into BigQuery
Automating the pipeline to run on a set schedule for real-time insights

Let’s dive into the technical setup and see how you can automate your data flow!

Technology Stack for Automating the ETL Process

Google Cloud Scheduler:
Cloud Scheduler is a fully managed cron service that allows you to schedule jobs at regular intervals. In this pipeline, it is used to trigger the ETL process, ensuring that data from HubSpot is pulled and processed at the specified time.

Google Cloud Functions:
Cloud Functions is a lightweight, serverless compute service that allows you to run code in response to events. In this pipeline, Cloud Functions are used to make API requests to HubSpot, extract the required data, transform it if necessary, and prepare it for loading into BigQuery.

BigQuery:
BigQuery is a fully managed data warehouse that allows for fast SQL queries using the processing power of Google’s infrastructure. It is the final destination for the HubSpot data, where it can be stored, analyzed, and used for reporting.

Architecture and Data Flow

The diagram shows the process of automatically loading data from HubSpot into BigQuery using Google Cloud services.

Automating HubSpot Data Load into BigQuery with Cloud Functions

Step 1:
Automated Triggering by Cloud Scheduler

Cloud Scheduler is set up to trigger the entire ETL process at regular intervals (e.g., daily or hourly). It sends an HTTP request to a Cloud Function that initiates the data extraction.

To schedule the automated data transfer from HubSpot to BigQuery, you will need to set up a job in Google Cloud Scheduler. Here's how the configuration looks based on the screenshot:

Name: The job is named hubspot_data_to_bigquery, ensuring the job is uniquely identified in the region it operates in, which in this case is europe-west3 (Frankfurt).
Description: A concise description such as "Job that triggers the data query and data import function" provides clarity on the function of the job.
Frequency (Cron Expression): The job is configured to run every day at 8:00 AM. The cron expression used for this is 0 8 * * *, which means:
- 0: Trigger at the 0th minute.
- 8: Trigger at the 8th hour of the day (i.e., 8:00 AM).
- * * *: Run this job every day of every month, regardless of the day of the week or the month.
Time Zone: The selected time zone is Central European Summer Time (CEST). It's important to note that jobs running in time zones affected by daylight savings time may experience slight variations in timing during DST changes. For example, switching to UTC can prevent such discrepancies.

Target Type: HTTP
- You select HTTP because the scheduler will send an HTTP request to trigger the Cloud Run function.
URL
- The endpoint URL for your Cloud Run function is entered, in this case
- https://us-central1-your-project.cloudfunctions.net/hubspot_data_fetcher
HTTP Method: POST
- You use POST to trigger the function, which is ideal for operations that manipulate data.
Authentication
- Optionally, you can configure authentication in the Auth Header to ensure only authorized calls can trigger your Cloud Function.

Step 2:
Fetching Data from HubSpot via Cloud Functions

In this step, the Cloud Function is triggered to retrieve data from the HubSpot API. The Python function hubspot_search() is responsible for making API requests to HubSpot and fetching relevant data based on predefined filters.

Key Parts:

HubSpot API Setup:
The function uses the hubspot.Client object to interact with the HubSpot API, passing in an API token for authentication (ACCESS_TOKEN = 'hubspot_api_key').
Search Filters:
A filter (COMPANIES_FILTER) is configured to retrieve specific properties from HubSpot, such as "country" and "status". The filter also limits the results to 100 companies per request.
Handling Pagination:
The function handles pagination by checking if there are more results (paging.next.after) and continues fetching until all data is retrieved. It also adds a 1-second delay between requests to avoid hitting rate limits.
Collecting Data:
The data for each company is formatted into a list of dictionaries (all_results), which includes fields like customer_id, country, hubspot_id, and name

def hubspot_search():
    print("Starting HubSpot API Search request...")
    
    after = None
    more_results = True
    all_results = []

    while more_results:
        try:
            # Create a search request with the defined filter and paging token
            search_request = PublicObjectSearchRequest(
                filter_groups=COMPANIES_FILTER["filterGroups"], 
                properties=COMPANIES_FILTER["properties"], 
                limit=COMPANIES_FILTER["limit"],
                after=after
            )
            
            # Execute the search API request
            search_results = client.crm.companies.search_api.do_search(public_object_search_request=search_request)

            # Append each company's data to the all_results list
            for company in search_results.results:
                all_results.append({
                    "customer_id": company.properties.get('customer_id'),
                    "country": company.properties.get('country'),
                    "hubspot_id": company.id,
                    "timestamp": datetime.now().isoformat(),
                    "name": company.properties.get('name', 'undefined')
                })

            # Handle pagination
            paging = search_results.paging
            if paging and paging.next.after:
                after = paging.next.after
                time.sleep(1)  # Pause to avoid hitting the rate limit
            else:
                more_results = False
            
        except ApiException as e:
            print(f"Exception when calling search_api->do_search: {e}\n")
            more_results = False

    return all_results

Step 3:
Data Transformation and Loading into BigQuery

After retrieving raw data from HubSpot, the Cloud Function transforms the data into a format suitable for BigQuery by selecting key fields (e.g., customer_id, country, hubspot_id, and name) and adding necessary metadata like timestamps. The transformed data is then loaded into BigQuery using the insert_rows_json() method, which ensures the data is structured for efficient querying.

Key Parts:

Data Structure:
The data fetched from HubSpot is already structured in dictionaries within a list (all_results), which is compatible with BigQuery's JSON format.
Transformation:
While fetching, the function ensures that only necessary fields are included, such as the company’s name, country, and ID. Additional formatting (e.g., adding a timestamp) is done to align with BigQuery’s schema.
Loading Data:
The function insert_data_into_bigquery() handles loading the transformed data into BigQuery. It sends the list of dictionaries directly to the BigQuery table using the insert_rows_json method. This method accepts JSON objects and inserts them into the specified BigQuery table.

def insert_data_into_bigquery(data_list):
    # Insert data into BigQuery as JSON objects
    errors = bq_client.insert_rows_json(table_id, data_list)  # BigQuery API request
    
    # Check for errors in the insertion process
    if errors == []:
        print("Data successfully inserted into BigQuery.")
    else:
        print(f"Error inserting data: {errors}")

Step 4:
Querying Data in BigQuery

Now that the transformed data has been successfully loaded into BigQuery, it’s ready for querying and analysis. This final step allows you to leverage the power of BigQuery to perform complex queries, generate insights, and integrate your HubSpot data into reports or dashboards.

Setting up BigQuery for Querying

Before querying the data, it’s essential to ensure that your Google Cloud Project and BigQuery dataset are properly configured. Here's what you need to have in place:

Google Cloud Project:
- You need to create a Google Cloud Project to manage your resources. If you haven’t created a project yet, you can do this from the Google Cloud Console by navigating to the Project Selector and creating a new project.
BigQuery Dataset:
- In BigQuery, data is stored in datasets, which are containers that hold tables. To set this up:
  - Go to the BigQuery Console and create a new dataset under your project.
  - Specify the Dataset ID and configure the necessary settings, such as data location and default table expiration time, if required.
Table Creation:
- After creating the dataset, the tables (e.g., hubspot_data) will be populated automatically by the Cloud Function as part of the data load process. No manual table creation is needed if you follow the automated loading process.

Querying Your Data

Once the data is loaded into the BigQuery table, you can start running SQL queries to analyze it. BigQuery supports a powerful SQL-like syntax, allowing you to extract insights from your HubSpot data.

Here’s an example SQL query you can run to get a list of companies along with their countries:

SELECT customer_id, name, country, timestamp 
FROM your_project_id.your_dataset_id.hubspot_data 
WHERE country IS NOT NULL 
ORDER BY timestamp DESC;

This query retrieves all records where the country field is populated, sorts them by the timestamp of data insertion, and returns the customer ID, company name, country, and the time the data was inserted.

Visualization and Reporting

You can now use this data for reporting, connect it to visualization tools like Google Data Studio, or integrate it into custom dashboards for monitoring your business metrics. BigQuery’s ability to handle large datasets with fast query execution makes it ideal for processing and analyzing HubSpot data efficiently.

By completing this setup, you’ve established a fully automated ETL (Extract, Transform, Load) pipeline that allows you to move data from HubSpot to BigQuery and make it available for analysis with minimal manual intervention.

HubSpot to BigQuery: Automating Data Pipelines with Cloud Functions

Seamless ETL Process from HubSpot to BigQuery Using Google Cloud Services and Python

Introduction:

Technology Stack for Automating the ETL Process

Architecture and Data Flow

Step 1: Automated Triggering by Cloud Scheduler

Step 2: Fetching Data from HubSpot via Cloud Functions

Key Parts:

Step 3: Data Transformation and Loading into BigQuery

Key Parts:

Step 4: Querying Data in BigQuery

Setting up BigQuery for Querying

Querying Your Data

Visualization and Reporting

Step 1:
Automated Triggering by Cloud Scheduler

Step 2:
Fetching Data from HubSpot via Cloud Functions

Step 3:
Data Transformation and Loading into BigQuery

Step 4:
Querying Data in BigQuery