Unicode surrogate-pair text cutting and pasting and undoing produces corruption

• Dec 23, 2018 - 01:56
Reported version
S4 - Minor

Text edit code is not correctly aware or doesn't properly handle surrogate pairs always.

Edit: the description that follows I've split off as a separate issue #280522: TextEdit edit mode lost after undo...shouldn't be deselected as I've determined that issue isn't even related to surrogate pairs. So disregard the rest of this description:

Was making a christmas tree out of different ways to say happy new year, but discovered that a cut and undo for some unicode would produce corrupted text.

Repro steps:
1. Open unicode test.mscz
2. Select this line:

Screenshot (30).png

  1. Cut via ctrl->x.
  2. Paste it in the space above.
  3. Undo with ctrl->z twice.
  4. Reselect that line
  5. Cut again via ctrl->x.

Result: That cut line is replaced with an up arrow:

Screenshot (31).png

That character shouldn't be there. Also doing some other operations with undo will produce similar unusual characters.

I think this might be a residual part of one of these high & low surrogate pair characters being detached from it's partner and unintentionally left behind or produced.

On 3.0 release candidate OS: Windows 10 (10.0), Arch.: x86_64, MuseScore version (64-bit):, revision: 96c1f7b


Here is a minimal score which can produce the problem: unicode-surrogate-pair-test.mscx . It is just one text box with one surrogate pair chinese character (U+2070E 𠜎). Unfortunately it doesn't actully render in musescore...looks like a rectangle with a question mark, so I guess that's a limitation of my fonts? But you can still select it, cut it, paste it, undo, undo, and select it and cut it again to produce the glitch.

I can also produce similar glitch in 2.3.2, although not quite with the same steps...the same steps produces a popup that action unavailable in current state.

so if I start with that chinese character: (U+2070E 𠜎)

when I it into the program from outside, I see the clipboard text being pasted in is the two values (a surrogate pair):

But when I then select it inside musescore's text box, cut it, and paste it again, the clipboard text is just one value (half a surrogate pair), and the value is 0x070e, which is missing the initial most significant digit of 2 of its unicode scalar value U+2070E

It seems during that initial paste, it properly determines that it it is a surrogate pair, and it does indeed grab the next QChar from the input text string and assigned it to the variable lowSurrogate. But then when it calls

insertText(ed, QString(QChar::surrogateToUcs4(highSurrogate, lowSurrogate)));

the QString generated is simply one 16-bit value:

    [0x00000000]    0x070e

And that value is the incorrect representation of that chinese character clearly, as it would need more bits than that.

The function

uint QChar::surrogateToUcs4(QChar high, QChar low)

returns a uint, which should be at least 32-bits, to hold the 31-bit UCS-4 representation of that surrogate pair. I'm noticing that the command:

                                insertText(ed, QString(QChar::surrogateToUcs4(highSurrogate, lowSurrogate)));

Is making a QString out of that returned value. But I wonder if QString constructor knows how to create a string out of a 32-bit uint?

None of the QString constructors seem to know how to deal with a uint as input:

QString(const QChar *unicode, int size = -1)
QString(QChar ch)
QString(int size, QChar ch)
QString(QLatin1String str)
QString(const QString &other)
QString(QString &&other)
QString(const char *str)
QString(const QByteArray &ba)

So That would expalin why the top-value of the surrogate pair is getting chopped of. Maybe there is a better way to create a QString from a surrogate pair?

I'm noticing werner's commit from 7 months ago (https://github.com/ericfont/MuseScore/commit/c6ccd635e2e8cb85a8438fe91a…) which enabled undo ability for pasted text (which is a nice ability to have), but it looks like this function got changed from:

insert(_cursor, QChar::surrogateToUcs4(highSurrogate, lowSurrogate));

into what it is now:

insertText(ed, QString(QChar::surrogateToUcs4(highSurrogate, lowSurrogate)));

Why did werner try to make a qstring out of the returned uint32 value that came from the UCS-4 representation of the surrogate pair? I'm thinking he should have created a QString of two QChars, the high surrogate followed by the low surrogate maybe...since that function is dealing with QStrings...

Basically the call QString(0x2070E) is forced to truncate that 32-bit uint representation into a simply the lowest two bytes of it: 0x070E, which explains why the paste is loosing that most significant digit (2).

changing that function into passing a 2-QChar string of the high & low surrogate pair then allows TextBase::insertText() to correctly recieve the chinese character as a surrogate pair, at least according to the debugger.

Interesting thing about the undo stack being corrupted...the corruption happens if I don't click outside of the textbox between cutting the text and pressing undo, but doesn't happen if I do click outside the textbox inbetween.

And I've determined the reason why the corruption doesn't happen if click outside the textbox in between is because the TextBase::endEdit() operation has a command score()->undoStack()->remove(ted->startUndoIdx) which does remove all undo/redo records for that text editing.

And I'm almost down to narrowing the problem during Cut...what I'm noticing in the debugger is only the highSurrogate of the pair (0xd841), not the full string of the surrogate pair, is being pushed to the undo stack during UndoStack::push() from the undo() in the editCut() call...

And what I'm noticing is that TextBase::deleteSelectedText() line 733 is almost likely the culprate...because it is calling score()->undo() with RemoveText that only consistent of the current Character at the cursor. But it seems clear to me that it needs to remove both QChars of the surrogate pair.

Indeed TextCursor::currentCharacter() returns a QChar, but what we want is a possibly a surrogate pair.

I almost wonder if this function should be changed to return a QString which is just one character long if not dealing with extended unicode, but which is two-character long surrogate pair if returning a surrogate pair.

Well maybe maybe not. It seems that currentCharacter() is used a few times with .isSpace()...and I don't want to confuse that behavior.

So what I think I will do instead is every time score()->undo() is called with RemoveText, I will make sure it remove surrogate pairs together.

Let me note the different places that currentCharacter() is used in a RemoveText undo operation:

inside TextBase::edit() for key Delete and key Backspace:
line 233: score()->undo(new RemoveText(_cursor, QString(_cursor->currentCharacter())), &ed);
line 243: score()->undo(new RemoveText(_cursor, QString(_cursor->currentCharacter())), &ed);

And the current place I know for sure it needs to be changed, in TextBase::deleteSelectedTex()
line 733: score()->undo(new RemoveText(_cursor, QString(_cursor->currentCharacter())), &ed)

I think the behavior should be the same in all 3 calls...that the undo or deletion operates on a surrogate pair together as one unit.

I beleive curros move left and right according to units of Unicode characters at a time, so it takes into account adjusting text index twice if moving across a surrogate pair. So considering that in TextBase::deleteSelectedText() that just prior to the undo it moves the cursor left one, that would indicate to me that for sure the intended behavior of the RemoveText should be to remove both QChars of the unicode characters if it happens to be a surrogate pair.

Maybe I should make another function analogous to TextCursor::currentCharacter() which instead of returning just a QChar, would return a QString of what could potentially be a two-character surrogate pair.

It seems QString TextBlock::text(int col1, int len) is called by TextCursor::currentCharacter(), but actually does take into account surrogate pairs and does return a two-QChar surrogate pair if input len parameter is 1 if it happens that input parameter col1 lands on a surrogate pair.

ok...here's what I'm going to do...I'm going to differentiate between a "QChar" and a "UnicodeCharacter". A QChar is just a 16-bit QChar. While a UnicodeCharacter is a QString which could either be a single character long if not an extended unicode, or which would be two chacters long if an extended unicode.

I've split TextCursor::currentCharacter() into two separate functions for clarity:

currentVariableLengthUnicodeCharacter() returns a QString, possibly representing a unicode formed from a surrogate pair of QChars
currentQChar() returns a QChar, with same behavior of previous TextCursor::currentCharacter()

This is important distinction, because a "Unicode Character" could possibly refer to two QChars, if the unicode belongs to a supplemental plane.

I've also determined with confidence that in the cases where score()->undo(new RemoveText(_cursor, QString(_cursor->currentCharacter())), &ed); was called were for dealing with the concept of a complete "Unicode Character", not a single QChar.

I've also determined that TextBlock::remove(int column) operates removal on the granularity of Unicode Characters, not single 16-bit QChars. However, ChangeText::removeText(), which calls TextBlock::remove() doesn't, and instead operates on granularity of signle 16-bit QChars. This was causing problems when deleting, because removeText() would basically remove two unicode characters if the unicode character being removed was a Supplemental Unicode Character.