Running a website or having an online product of some form generally means that certain objects will always grow in your system.
If you are a content publisher like Youtube or Medium, then you must write your software in a way that expects "always more" posts, pages, articles, comments etc.
If you are selling a product on a platform like Odoo, then you should expect your sale orders, reviews, payments to also be ever growing.
But what about products? Well, some sites sell a limited product set; maybe they manufacture it themselves and only add a new product every couple of months. On the flip side, if you're adding bulk products, potentially you're trying to be "the Amazon/Newegg of ABC".
Odoo has gone on the record (link to tweets or white papers) that they can support millions of products. This is true, but there are may different things you can do with products.
Scaling bulk products in Odoo can vary a lot depending on how you use them!
How long does it take to duplicate a product?
Lets say that when you duplicate a product it takes a single second. You would not expect it to take 10 seconds the next time you duplicate it.
It started with a simple enough discovery. "When I duplicate this product the site times out."
Indeed it did! In reality it was hitting memory limits, but that is only because we already had a higher than usual timeout set. #knowyourlimits
After some troubleshooting and profiling, it was discovered that:
1) it wasn't happening with all products
2) most of the time was being spent in an @api.depends computed field in product.attribute
The particular code exists in Odoo 13 to 16 (maybe earlier with additional modules/config).
for pa in self:
pa.with_context(active_test=False).product_tmpl_ids = pa.attribute_line_ids.product_tmpl_id
As of today, it exists in the soon to be released 17.0.
If you know yourself some Odoo code, you'll probably have an immediate "ah ha" and see where this is going, but the code is so short and clean that it is easy to analyze line by line.
The @api.depends line tells us how often or under which conditions this code will run. This function runs essentially whenever you add (or remove) an attribute line onto a product (.template). Basically, if you know how to add attributes and values to products, like making a red T-shirt from a T-shirt product, then this is the code that is running each time you do this. Of course it also runs when duplicating a product that has an attribute.
This immediately explains why (1) exists - the products that duplicate fast don't use attributes (or use attributes that are less commonly used...)
In fact, this code will trigger and compound the behavior by doing things like using export/import mechanisms to do regular product management workflows like adding attributes or adding products in bulk that would use attributes.
So why is most of the time being spent inside this function when duplicating a product? You'd think that just adding a row to this many2many relationship wouldn't be expensive, right?
What is the ORM doing here?
There is a kind of 'hidden mapped' going on here but we're assigning a many2many field a recordset.
pa.mapped('attribute_line_ids.product_tmpl_id') is a form of this I feel expresses the idea itself better. The effect of this is that we're finding all the attribute lines on all the product templates for the pa or product.attribute.
The more common the attribute, like "Brand" (which might be on every product you sell), then this is proportional to your overall product set itself.
Essentially the size of the recordset being 'written' to pa.product_tmpl_ids is the entire product recordset that uses pa.
If you have 100k products, duplicating a product with an attribute on all of the products will already be a problem. By 300k products, it likely won't complete anymore.
How to fix it
The reason for this field to exist is to aid in search and filtering on the website. This is reasonable and expected. Faceted search is a staple on eCommerce platforms and is relied upon to filter to your preferred brand, color, etc.
My goal was to make duplicating a product, with its attributes and values, take no additional time for this fields calculation at all.
I was able to achieve this by re-declaring the field, along with the computed field number_related_products, becoming stored and not computed.
Now we 'own' the process of updating these fields in the database and can optimize and time them to our needs.
For both (the relationship table and the number/count field), pure SQL is the obvious answer. We avoid the ORM and get postgres do the same "work".
Though there are enormous speed and efficiency gains for doing the computation in SQL, I still do not do this every time a new product is added. Instead, a new scheduled action is introduced to do this regularly. Additionally, a user might want to run it for an attribute or two, so a server action is helpful for the UX.
But what about all the call sites? These fields are used to prevent deleting attributes that have existing relationships!
Patch them so that if they would succeed because the field on-the-fly computation of the number of products is zero, call the re-index method on the attribute first! If it is truly zero, then this operation is very cheap and analogous to the original intent of 'computed field'.
Some of the uses like the smart button action action_open_related_products were optimized to return their results faster by using aggregation ORM methods.
The main "less" here is that we no longer impact this relationship at all when creating or updating products. This makes things like import scripts/files perform much much better and restores most of the UX feel to working as a product manager with large product sets.
However, it should be pointed out that another significant "less" here is using optimized SQL vs the ORM. The ORM method ends up spending a lot of time with very large recordsets that don't benefit in any significant way from being recordsets/objects. No in python filtering or methods were called in the original code, so the possibility of hitting timeouts and memory limits should just be avoided.
All in all, significantly "less" IO, postgres locks, and memory usage.