Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions twikit/tweet.py
Original file line number Diff line number Diff line change
Expand Up @@ -198,6 +198,10 @@ def view_count_state(self) -> str | None:
def has_community_notes(self) -> bool:
return self._data.get('has_birdwatch_notes')

@property
def source(self) -> str | None:
return self._data.get('source')

Comment on lines +201 to +204
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add documentation for the source property.

The property is missing from the class docstring. Add documentation to describe its purpose and return type.

Add the following to the class docstring after line 93:

    urls : :class:`list`
        Information about URLs contained in the tweet.
+   source : :class:`str` | None
+       The HTML source indicating the client/device used to post the tweet.
    full_text : :class:`str` | None
        The full text of the tweet.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@property
def source(self) -> str | None:
return self._data.get('source')
urls : :class:`list`
Information about URLs contained in the tweet.
source : :class:`str` | None
The HTML source indicating the client/device used to post the tweet.
full_text : :class:`str` | None
The full text of the tweet.

💡 Verification agent

🧩 Analysis chain

Add HTML sanitization for the source property.

The property returns raw HTML which could be unsafe if used without sanitization. Consider adding HTML parsing/sanitization to extract just the client name or sanitize the HTML to prevent XSS attacks.

Let's check if there are any HTML sanitization libraries available for Python:

Consider using an HTML sanitization library like bleach to safely handle the HTML content. Add it as a dependency in setup.py and implement sanitization in the property.


🌐 Web query:

What are the recommended Python libraries for HTML sanitization?

💡 Result:

Python offers several robust libraries for HTML sanitization, each with distinct features catering to different security and performance needs:


1. pybluemonday (GitHub)

A high-performance library leveraging Go's bluemonday via bindings:

  • Features:
    • Whitelist-based sanitization with policies (UGCPolicy, StrictPolicy)
    • Globally allow/deny attributes per element
    • Benchmarked as ~60x faster than alternatives like Bleach[1]
  • Example:
    from pybluemonday import UGCPolicy
    s = UGCPolicy()
    print(s.sanitize("<script>alert(1)</script><b class='stuff'>test</b>"))  # <b>test</b>

2. nh3 (GitHub)

Rust-based bindings for Ammonia, optimized for speed:

  • Features:
    • Sanitizes HTML via strict allowlists
    • Benchmarked at ~20x faster than Bleach[10]
  • Installation: pip install nh3

3. Bleach (Docs)

A deprecated but widely used library:

  • Features:
    • Configurable allowlists for tags/attributes via Cleaner class[4][9]
    • Supports CSS sanitization and protocol restrictions (e.g., http, https)
  • Limitations: Slower than Rust/Go alternatives[1][10], no longer actively maintained[9].

4. html-sanitizer (PyPI)

Opinionated library for strict cleanup:

  • Features:
    • Converts inline styles to semantic tags (e.g., <span style="bold"><strong>)[8]
    • Merges redundant tags and normalizes whitespace

Honorable Mentions: Custom Solutions

  • BeautifulSoup/lxml: Often used with custom rules but require manual implementation[6].
  • Django utilities: Built-in tools for basic sanitization in web frameworks[5][7].

Recommendations: For critical applications requiring speed and security, pybluemonday or nh3 are optimal. For legacy systems or simpler needs, Bleach or html-sanitizer remain viable despite lower performance.

Citations:


Sanitize HTML output in the source property (twikit/tweet.py: 201-204)

The source property currently returns raw HTML, which poses an XSS risk if rendered unprocessed. Given recent findings, consider integrating an HTML sanitization library. You have a few good options:

  • pybluemonday or nh3 are high-performance, secure choices for sanitizing HTML.
  • If you prefer a more legacy approach, Bleach remains an option but note it’s slower and less actively maintained.

Please add the chosen library as a dependency in setup.py and update the source property to sanitize the HTML—either by extracting and returning just the client name or by properly cleaning the HTML content.

@property
def quote(self) -> Tweet | None:
if self._data.get('quoted_status_result'):
Expand Down