Lots of people have asked where most of the allocations in Mozilla come from. I’ve gzipped some dtrace output that shows number of calls per size of stacks 5 deep. Note: This log only shows allocations <= 2048 bytes. This data is pretty raw but if people want to take a look at it and see if they have ideas for how to improve some of the code paths in question, that would be great.
An example:
libSystem.B.dylib`malloc+0x37
XUL`nsStringBuffer::Alloc(unsigned long)+0x15
XUL`nsACString_internal::MutatePrep(unsigned int, char**,
unsigned int*)+0xce
XUL`nsACString_internal::ReplacePrep(unsigned int,
unsigned int,
unsigned int)+0x46
XUL`nsACString_internal::Assign(char const*, unsigned int)+0xc8
value ------------- Distribution ------------- count
4 | 0
8 |@@@@@@@@@@@@@ 9724
16 |@@@@@@@@@@@@@@@@ 11940
32 |@@@@@@ 4359
64 |@ 887
128 |@@ 1604
256 |@ 497
512 |@ 676
1024 | 47
2048 | 0
This shows that there are 9724 8 byte allocations, 11940 16 byte ones and so on.
Things to look for include:
- Things with lots of allocations
- Things that could be stack allocated to avoid memory churn
- Things that the lifetime is well understood that we could put in to pools
- etc…
Edit: I’ve also posted another log with 8 frame deep stacks as well as a log that only includes allocations post-startup (also 8 frames deep).
Tags: dtrace, firefox, leaks, memory fragmentation, Mozilla, performance
November 13, 2007 at 6:50 pm |
perhaps CoW strings are causing more harm than good.
November 13, 2007 at 8:38 pm |
Pav, is it possible to get the information for large allocs also? At the moz2 meetup today we were pondering whether pldhash/jsdhashtable allocations (which can be very large blocks of memory) could be causing unnecessary fragmentation, and whether a plhash (chaining) system might be better for certain allocation types.
November 13, 2007 at 9:30 pm |
Benjamin: I can get data for larger allocations. I’ll do that tomorrow.
November 13, 2007 at 10:15 pm |
COW is not the problem per se, really. The main callers of MutatePrep I see in the logs there, inclusing startup, are:
* CDATA section parsing (1500)
* AppendASCIItoUTF16 (~1000)
* nsCookieService::GetCookieInternal (~1500)
* nsHttpHeaderArray::Flatten (~1200)
* AppendUTF16toUTF8 (~2200)
* AppendUTF8toUTF16 (~800)
* nsStandardURL::BuildNormalizedSpec (~3000)
* nsHttpHeaderArray::SetHeader (~8600)
* nsStandardURL::BuildNormalizedSpec (~3500)
* nsCacheService::CreateRequest (~2500)
* nsStandardURL::SetRef (~6000)
* nsStandardURL::Resolve (~700)
* nsStandardURL::GetPath (~400)
I sort of have to stop, but unfortunately I’ve mostly looked at pretty big allocations there (a lot of the URI stuff does things in the 256+ byte range). For strings, it might be nice to do another log focusing on that 8-32 range.
The SetRef calls are from LoadBindingDocumentInfo. I wonder why we even hit that code 6000 times… I guess we have a lot of bindings.
Could we only allocate once in BuildNormalizedSpec somehow?
nsCacheService::CreateRequest should at the very least preallocate the right string size. Another option would be to use a key with two strings in it, not just a concatenation of the two strings as a single string key (and hope that if we use two strings they can both share their buffers with caller).
We should really make cookies output nsACString, not a char** and see how much of a difference that makes. I expect some. I also wonder whether it’s worth precomputing the length we’ll need and allocating it all at once instead of going through and doing one append per cookie. Depends on the typical cookie count, I guess…
Maybe we should do 9-frame stacks to see the callers of the string conversion functions?
November 14, 2007 at 10:18 pm |
Benjamin: Using a single large allocation for a double hashtable instead of lots of little ones for a hashtable with chaining is the right answer if the entry size is small enough. See the big comment near the top of jsdhash.h. If there is a user of {js,pl}dhash configuring too large an entry size, you get a warning in DEBUG builds.
But you didn’t mention entry size, so I’m wondering why you think a single large allocation would fragment worse than a bunch of small ones? The table growth uses malloc and free, not realloc. Ignoring overlarge entry size, a double hashtable is strictly less fragmenting than an open table with chaining.
/be
November 15, 2007 at 1:37 am |
[...] got a great community and people eager to solve these problems. We’re now equipped with data and ready to fight this [...]
November 15, 2007 at 5:45 am |
Heavy use of the stack has it’s own problems.
If a given function uses more than 4096 bytes of stack space, Microsoft’s compiler will add code to ensure that subsequent pages of memory are actually present. This makes the code larger and slower.
I wouldn’t be surprised to see other compilers generating similar code.
https://bugzilla.mozilla.org/show_bug.cgi?id=359453