11. Designing a MongoDB Database

This tutorial is based on the MongoDB tutorials on Data Modeling.

If you’ve learned how to design an SQL database you likely have learned to construct a table for each entity involved in the application and to use primary keys as fields in tables to establish relationships between entities. You’ll have also learned that duplication within a table or among multiple tables is removed via the process of normalization. With this structure we can perform queries that bring together data from multiple tables by performing JOIN operations.

A NoSQL database is designed differently. When designing a MongoDB database, keep these things in mind:

Design your schema to match the way your application uses data. Consider the views that the user will interact with.
Anticipate the queries that an application will send and design the schema so that the queries minimize if not eliminate joins across multiple collections.
Don’t feel obligated to create a separate collection for every entity. Perhaps adding subdocuments to documents better serves your application.
Documents within the same collection do not necessarily have to have all the same fields and the fields can even have different types.
Duplicating data in multiple documents is a feature of NoSQL – not an indication of a design flaw.
Consider the implications of requests to update data that is duplicated in multiple documents.

Below we discuss the following schema design process.

Identify your application’s workload.
Understand common design patterns.
Design schema that support the application’s entity relationships.

1. Identify Your Application’s Workload

To begin we should understand how the application will flow. What will the user see in each view of the application. Using the views we can identify the data that your application will need to store in the database.

Next we need to establish the CRUD operations (create, read, update, delete) that will need to be performed on the data.

One way to document this information is to populate a table similar to the one below during a planning event. Typically, the product owner will be present and have design documents (eg. designs of the views) that help the database engineers understand the application requirements.

Description       CRUD          Data in    Data out    Frequency             Priority
of Action         Operation                             1  - 10              1 (low)
                                                       low - high            2 (moderate)
                                                                             3 (critical)
-------------------------------------------------------------------------------------------
User creates       Create         email       User         5                      3
an account                       username.   object
                                 password

2. Understand Common Design Patterns

Below is a list of common design patterns identified by consultants at MongoDB. Follow the links to learn about their design, use cases, risks, and benefits. Once you are familiar with these common design patterns then try to identify which patterns fit your application demands and begin to design your schema.

The Extended Reference Pattern

If a query requires data from two collections, then rather than store in one document only a reference (ObjectId) to a second document in a different collection, we can copy the data that is needed by the query from the second document into the first document.

The Inheritance Pattern

If you have a data that can be viewed as being instances of subclasses of a parent class and a query will request different instances of the data, you can store all of the data in a single collection with all the fields that are in common across all documents being listed as required. You can also use discriminators to define subclasses.

The Attribute Pattern

When you have multiple queries that search different fields of a document you may need multiple indexes. If the searchable fields have a similar structure, you can reorganize the schema so that the searchable fields are objects in an array and then search the array for objects that satisfy the search criteria. This strategy requires only one multi-field index.

The Tree Pattern

When documents in a collection have a hierarchal tree-like relationship and queries require information in a document along with information in its ancestor documents then we can duplicate in each document the ancestor data that the query requires.

The Preallocation Pattern

When the value of a field contains a subset of values, it may be appropriate to store in each document the entire set of values and indicate which values are selected in each document. The authors admit this pattern is not used as often as in the past since.

The Bucket Pattern

When collecting time-series data (data that contains a timestamp that indicates the date and time that the data was collected), rather than create a document for each data set that is collected, we can reduce the number of documents and thereby the size of a search index, by grouping the data into buckets where each bucket contains all of the data that was collected during a specified interval of time. Each bucket is then stored in a document along with the data that specifies the interval during which the data was collected.

The Outlier Pattern

Suppose you have documents in a collection that each have an array field that typically contain 10-100 elements. Now suppose you have one document that needs to store thousands of elements in the same array field. One way to handle this is to add a field to the schema that indicates (if true, for example) there is an unusual number of elements for the array and so the array elements are stored in a separate overflow document in a separate collection.

The Subset Pattern

When documents in a collection get too large we may need to create a new collection and transfer data that is not frequently used by the application from the documents in the original collection to documents in the new collection.

The Computed Pattern

When we have frequent read requests that require computed statistics, rather than computing the stats when the requests are made, we can compute the stats when updates are made to the documents in the collection, or compute the stats at scheduled times.

The Approximation Pattern

When an application scales, statistical computations that are resource intensive may not need to be done every time a create or update operation is performed. If this is the case, then it is acceptable to have an approximate value at times. If so, we can limit when the statistical computations are performed, like during only 1% of the times that write operations are performed.

The Document Versioning Pattern

When we need to keep a historical record of the changes that were made on a document we can 1) add a field to the document that indicates the current version number of the document and 2) create a separate collection to hold the old versions of the document.

The Schema Versioning Pattern

When a schema changes and we wish to allow in the collection both documents that satisfy the new schema and documents that satisfy the old schema, we can add a schema_version field to the schema. We can convert documents that satisfy the older schema to the new schema when we find it appropriate (e.g. on update or when the load is low).

3. Design Schema that Support the Application’s Entity Relationships

Performing queries that require data from multiple collections require join operations. In MongoDB, join operations are implemented with a $lookup phase in a aggregate pipeline (discussed later). These are slow and should be avoided.

In this step we determine the relationships between entities and determines how to structure schema to minimize, if not avoid, $lookup operations.

Below are steps we can use to design schema that support the entity-relationships:

Create an entity-relationship (ER) model using the application workload table.
For each query that requires data from multiple entities, choose a strategy for storing the data.

Below are heuristics for determining which strategy you should choose for step 2.

Embedded Data Models

Embedded data models combine data in a single document. For example a database for a school library may embed data about the a borrower in the same document that contains a book’s information. Here, we say borrow is a sub-document.

{
  _id: <ObjectId>,
  title: "Atlas Shrugged", 
  author: "Ayn Rand"
  borrower: {
    studentId: "1234",
    firstName: "Joe",
    lastName: "Smith"
  },
  dueDate: <Date>
}

At the same time, we might have a User document that contains the information about the books that Joe has checked out.

{
  _id: <ObjectId>,
  studentId: "12345",
  name: "Joe",
  checkedOut: [
    { 
      title: "Atlas Shrugged",
      author: "Ayn Rand"
    },
    {
      title: "1984",
      author: "George Orwell"
    }
  ]
}

Embedded data models often result in denormalized data because data may be duplicated in multiple documents. This may be OK – duplication of data is a feature of NoSQL databases, not a bug. This strategy results in better performance for read operations, but may not be beneficial if the embedded data changes frequently. When using denormalized data we have to ensure consistency when updating data.

References

Alternatively, we can store in one document references to other documents. This strategy results in normalized data.

Using the same scenario as above we can store a reference (the _id property) to a student document in a book document.

{ 
  _id: <ObjectId>, 
  title: "Atlas Shrugged", 
  author: "Ayn Rand",
  borrower: <Student ObjectId>, 
  dueDate: <Date> 
}

We can also, if we choose, store the _id properties of the books in a student document.

{ 
  _id: <ObjectId>, 
  studentId: "12345", 
  name: "Joe", 
  checkedOut: [
    <Book1 ObjectId>,
    <Book2 ObjectId>
   ] 
}

If the client has the student document in memory (e.g. in localStorage), in order to retrieve information on all of the book that the student has checked out, we need only perform one query using the ids in the checkedOut field.

Querying Multiple Data Models

If we need to perform a query that pulls data from 2 or more models, we can use a $lookup stage or $graphLookup state in an aggregate pipeline.