Unicode surrogate-pair text cutting and pasting and undoing produces corruption

ericfontainejazz

• Dec 23, 2018 - 01:56

Reported version

3.0

Type

Functional

Frequency

Once

Severity

S4 - Minor

Reproducibility

Always

Status

active

Regression

Workaround

Project

MuseScore

Text edit code is not correctly aware or doesn't properly handle surrogate pairs always.

Edit: the description that follows I've split off as a separate issue #280522: TextEdit edit mode lost after undo...shouldn't be deselected as I've determined that issue isn't even related to surrogate pairs. So disregard the rest of this description:

Was making a christmas tree out of different ways to say happy new year, but discovered that a cut and undo for some unicode would produce corrupted text.

Repro steps:
1. Open unicode test.mscz
2. Select this line:

Screenshot (30).png

Cut via ctrl->x.
Paste it in the space above.
Undo with ctrl->z twice.
Reselect that line
Cut again via ctrl->x.

Result: That cut line is replaced with an up arrow:

Screenshot (31).png

That character shouldn't be there. Also doing some other operations with undo will produce similar unusual characters.

I think this might be a residual part of one of these high & low surrogate pair characters being detached from it's partner and unintentionally left behind or produced.

On 3.0 release candidate OS: Windows 10 (10.0), Arch.: x86_64, MuseScore version (64-bit): 3.0.0.4747, revision: 96c1f7b

Comments

ericfontainejazz

• Dec 23, 2018 - 01:56

The good news is clicking outside of the text box will remove that undesired up arrow. But it shouldn't be there in the first place.

ericfontainejazz

• Dec 23, 2018 - 02:04

there will also be error when pasting after that final cut...the clipboard seems to contain the prior contents of the clipboard.

ericfontainejazz

• Dec 23, 2018 - 04:36

Here is a minimal score which can produce the problem: unicode-surrogate-pair-test.mscx . It is just one text box with one surrogate pair chinese character (U+2070E 𠜎). Unfortunately it doesn't actully render in musescore...looks like a rectangle with a question mark, so I guess that's a limitation of my fonts? But you can still select it, cut it, paste it, undo, undo, and select it and cut it again to produce the glitch.

ericfontainejazz

• Dec 23, 2018 - 04:44

I can also produce similar glitch in 2.3.2, although not quite with the same steps...the same steps produces a popup that action unavailable in current state.

ericfontainejazz

• Dec 23, 2018 - 06:29

so if I start with that chinese character: (U+2070E 𠜎)

when I it into the program from outside, I see the clipboard text being pasted in is the two values (a surrogate pair):
0xd841
0xdf0e

But when I then select it inside musescore's text box, cut it, and paste it again, the clipboard text is just one value (half a surrogate pair), and the value is 0x070e, which is missing the initial most significant digit of 2 of its unicode scalar value U+2070E

ericfontainejazz

• Dec 23, 2018 - 06:37

It seems during that initial paste, it properly determines that it it is a surrogate pair, and it does indeed grab the next QChar from the input text string and assigned it to the variable lowSurrogate. But then when it calls

insertText(ed, QString(QChar::surrogateToUcs4(highSurrogate, lowSurrogate)));

the QString generated is simply one 16-bit value:

    [0x00000000]    0x070e

And that value is the incorrect representation of that chinese character clearly, as it would need more bits than that.

ericfontainejazz

• Dec 23, 2018 - 06:42

The function

uint QChar::surrogateToUcs4(QChar high, QChar low)

returns a uint, which should be at least 32-bits, to hold the 31-bit UCS-4 representation of that surrogate pair. I'm noticing that the command:

                                insertText(ed, QString(QChar::surrogateToUcs4(highSurrogate, lowSurrogate)));

Is making a QString out of that returned value. But I wonder if QString constructor knows how to create a string out of a 32-bit uint?

ericfontainejazz

• Dec 23, 2018 - 06:44

None of the QString constructors seem to know how to deal with a uint as input:

QString()
QString(const QChar *unicode, int size = -1)
QString(QChar ch)
QString(int size, QChar ch)
QString(QLatin1String str)
QString(const QString &other)
QString(QString &&other)
QString(const char *str)
QString(const QByteArray &ba)

So That would expalin why the top-value of the surrogate pair is getting chopped of. Maybe there is a better way to create a QString from a surrogate pair?

ericfontainejazz

• Dec 23, 2018 - 06:50

I'm noticing werner's commit from 7 months ago (https://github.com/ericfont/MuseScore/commit/c6ccd635e2e8cb85a8438fe91a…) which enabled undo ability for pasted text (which is a nice ability to have), but it looks like this function got changed from:

insert(_cursor, QChar::surrogateToUcs4(highSurrogate, lowSurrogate));

into what it is now:

insertText(ed, QString(QChar::surrogateToUcs4(highSurrogate, lowSurrogate)));

Why did werner try to make a qstring out of the returned uint32 value that came from the UCS-4 representation of the surrogate pair? I'm thinking he should have created a QString of two QChars, the high surrogate followed by the low surrogate maybe...since that function is dealing with QStrings...

ericfontainejazz

• Dec 23, 2018 - 06:56

Basically the call QString(0x2070E) is forced to truncate that 32-bit uint representation into a simply the lowest two bytes of it: 0x070E, which explains why the paste is loosing that most significant digit (2).

ericfontainejazz

• Dec 23, 2018 - 07:07

changing that function into passing a 2-QChar string of the high & low surrogate pair then allows TextBase::insertText() to correctly recieve the chinese character as a surrogate pair, at least according to the debugger.

ericfontainejazz

• Dec 23, 2018 - 07:08

And I see Score::undo() has properly received that entirety of the surrogate pair chinese character as input.

ericfontainejazz

• Dec 23, 2018 - 07:10

And that chinese character got pasted correctly into musescore:

ericfontainejazz

• Dec 23, 2018 - 07:11

and I can cut that chinese character and repaste it now!!!:

ericfontainejazz

• Dec 23, 2018 - 07:12

Undo stack is still corrupted though...so I've only solved half of the problem so far...

ericfontainejazz

• Dec 23, 2018 - 07:28

Interesting thing about the undo stack being corrupted...the corruption happens if I don't click outside of the textbox between cutting the text and pressing undo, but doesn't happen if I do click outside the textbox inbetween.

ericfontainejazz

• Dec 23, 2018 - 08:07

And I've determined the reason why the corruption doesn't happen if click outside the textbox in between is because the TextBase::endEdit() operation has a command score()->undoStack()->remove(ted->startUndoIdx) which does remove all undo/redo records for that text editing.

ericfontainejazz

• Dec 23, 2018 - 08:12

And I'm almost down to narrowing the problem during Cut...what I'm noticing in the debugger is only the highSurrogate of the pair (0xd841), not the full string of the surrogate pair, is being pushed to the undo stack during UndoStack::push() from the undo() in the editCut() call...

ericfontainejazz

• Dec 23, 2018 - 08:14

And what I'm noticing is that TextBase::deleteSelectedText() line 733 is almost likely the culprate...because it is calling score()->undo() with RemoveText that only consistent of the current Character at the cursor. But it seems clear to me that it needs to remove both QChars of the surrogate pair.

ericfontainejazz

• Dec 23, 2018 - 08:16

Indeed TextCursor::currentCharacter() returns a QChar, but what we want is a possibly a surrogate pair.

I almost wonder if this function should be changed to return a QString which is just one character long if not dealing with extended unicode, but which is two-character long surrogate pair if returning a surrogate pair.

Well maybe maybe not. It seems that currentCharacter() is used a few times with .isSpace()...and I don't want to confuse that behavior.

So what I think I will do instead is every time score()->undo() is called with RemoveText, I will make sure it remove surrogate pairs together.

ericfontainejazz

• Dec 23, 2018 - 08:23

Let me note the different places that currentCharacter() is used in a RemoveText undo operation:

inside TextBase::edit() for key Delete and key Backspace:
line 233: score()->undo(new RemoveText(_cursor, QString(_cursor->currentCharacter())), &ed);
line 243: score()->undo(new RemoveText(_cursor, QString(_cursor->currentCharacter())), &ed);

And the current place I know for sure it needs to be changed, in TextBase::deleteSelectedTex()
line 733: score()->undo(new RemoveText(_cursor, QString(_cursor->currentCharacter())), &ed)

I think the behavior should be the same in all 3 calls...that the undo or deletion operates on a surrogate pair together as one unit.

ericfontainejazz

• Dec 23, 2018 - 08:28

I beleive curros move left and right according to units of Unicode characters at a time, so it takes into account adjusting text index twice if moving across a surrogate pair. So considering that in TextBase::deleteSelectedText() that just prior to the undo it moves the cursor left one, that would indicate to me that for sure the intended behavior of the RemoveText should be to remove both QChars of the unicode characters if it happens to be a surrogate pair.

ericfontainejazz

• Dec 23, 2018 - 08:31

Maybe I should make another function analogous to TextCursor::currentCharacter() which instead of returning just a QChar, would return a QString of what could potentially be a two-character surrogate pair.

ericfontainejazz

• Dec 23, 2018 - 08:37

It seems QString TextBlock::text(int col1, int len) is called by TextCursor::currentCharacter(), but actually does take into account surrogate pairs and does return a two-QChar surrogate pair if input len parameter is 1 if it happens that input parameter col1 lands on a surrogate pair.

ericfontainejazz

• Dec 23, 2018 - 08:41

ok...here's what I'm going to do...I'm going to differentiate between a "QChar" and a "UnicodeCharacter". A QChar is just a 16-bit QChar. While a UnicodeCharacter is a QString which could either be a single character long if not an extended unicode, or which would be two chacters long if an extended unicode.

ericfontainejazz

• Dec 23, 2018 - 09:05

I've split TextCursor::currentCharacter() into two separate functions for clarity:

currentVariableLengthUnicodeCharacter() returns a QString, possibly representing a unicode formed from a surrogate pair of QChars
currentQChar() returns a QChar, with same behavior of previous TextCursor::currentCharacter()

This is important distinction, because a "Unicode Character" could possibly refer to two QChars, if the unicode belongs to a supplemental plane.

I've also determined with confidence that in the cases where score()->undo(new RemoveText(_cursor, QString(_cursor->currentCharacter())), &ed); was called were for dealing with the concept of a complete "Unicode Character", not a single QChar.

ericfontainejazz

• Dec 23, 2018 - 09:31

I've also determined that TextBlock::remove(int column) operates removal on the granularity of Unicode Characters, not single 16-bit QChars. However, ChangeText::removeText(), which calls TextBlock::remove() doesn't, and instead operates on granularity of signle 16-bit QChars. This was causing problems when deleting, because removeText() would basically remove two unicode characters if the unicode character being removed was a Supplemental Unicode Character.

ericfontainejazz

• Dec 23, 2018 - 12:16

I've discovered something while debugging which turned out to be a side issue:

#280522: TextEdit edit mode lost after undo...shouldn't be deselected

I will fix together because I think deals with similar code...but these are separate issues.

ericfontainejazz

• Dec 23, 2018 - 12:57

Temporary PR which basically solves this issue: https://github.com/musescore/MuseScore/pull/4476

But I need to finalize and also solve that other related issue.

ericfontainejazz

• Dec 27, 2018 - 16:50

That other linked issue #280522: TextEdit edit mode lost after undo...shouldn't be deselected turns out to actually be not related to this. So the two PR's can be evaluated independently.

Philzen

• Apr 13, 2020 - 00:34

Severity

S2 - Critical

⇒

S4 - Minor

#303762: [EPIC] Collection of Undo issues